基于redis实现的分布式时间序列存储Roshi

2019 年 2 月 7 日 大数据和云计算技术


上面截图是redis.iohttps://redis.io/documentation)对Roshi的推荐,点进去学习了一下,推荐给大家。

Roshi的原始使用场景源于Sound Cloud(音频分享界的youtube,界面简洁无广告)。SoundCloud流主要通过你的社交图来表示与你相关的东西,按时间顺序排列,最新的第一个。因此需要一个高性能基于时间序列的KV存储。详细的场景见文末原始博客介绍。


Roshi这个非常适合社交等场景使用,Roshi的特点是:

1roshi非常简洁,5000go代码,2300行是测试代码。

2、支持CDRT,最终一致性。

3、核心是基于RedisZSET存储数据

4roshi上层是无状态的全分布式

5Roshi提供的是基于redis的基于时间戳数据的高性能存储。

6、提供高层API

  • Insert(key, timestamp, value)

  • Delete(key, timestamp, value)

  • Select(key, offset, limit) []TimestampValue


原始链接:https://developers.soundcloud.com/blog/roshi-a-crdt-system-for-timestamped-events

      

Roshi: a CRDT system for timestamped events

By Peter Bourgon

Let's talk about the stream.

The SoundCloud stream represents stuff that's relevant to you primarily via your social graph, arranged in time order, newest-first. The atom of that data model, an event, is a simple enough thing.

  • Timestamp

  • User who did the thing

  • Identifier of the thing that was done

For example,

  • At 2014-04-01T13:14:15.034Z,

  • a-trak

  • Reposted track skrillex/duck-sauce-nrg-skrillex-kill

If you followed A-Trak, you'd want to see that repost event in your stream. Easy. The difficult thing about time-ordered events is scale, and there are basically two strategies for building a large-scale time-ordered event system.

Data models

Fan out on write means everybody gets an inbox.

 

That's how it works today: we use Cassandra, and give each user a row in a column family. When A-Trak reposts Skrillex, we fan-out that event to all of A-Trak's followers, and make a bunch of inserts. Reads are fast, which is great. But writes carry perverse incentives: the more followers you have, the longer it takes to persist all of your updates. Storage requirements are also quadratic against user growth and follower count (i.e. affiliation density). And mutations, e.g. changes in the social graph, become costly or unfeasible to implement at the data layer. It works, but it's unwieldy in a lot of dimensions.

At some point, those caveats and restrictions started affecting our ability to iterate on the stream. To keep up with product ideas, we needed to address the infrastructure. And rather than tackling each problem in isolation, we thought about changing the model.

The alternative is fan in on read.

 

When A-Trak reposts Skrillex, it's a single append to A-Trak's outbox. When users view their streams, the system will read the most recent events from the outboxes of everyone they follow, and perform a merge. Writes are fast, storage is minimal, and since streams are generated at read time, they naturally represent the present reality. (It also opens up a lot of possibilities for elegant implementations of product features and experiments.)

Of course, reads are difficult. If you follow thousands of users, making thousands of simultaneous reads, time-sorting, merging, and cutting within a typical request-response deadline isn't trivial. As far as we know, nobody operating at our scale builds timelines via fan-in-on-read. And we presume that's due at least in part to the challenges of reads.

Yet we saw potential here. Storage reduction was actually huge: we projected a complete fan-in-on-read data size for all users on the order of a hundred gigabytes. At that size, it's feasible to keep the data set in memory, distributed among commodity servers. The problem then becomes coördination: how do you reliably and correctly populate that data system (writes), and materialize views from up to thousands of sources by hard deadlines (reads)?

Enter the CRDT

If you're into so-called AP data systems, you've probably run into the term CRDTrecently. CRDTs are conflict-free replicated data types: data structures for distributed systems. The tl;dr on CRDTs is that by constraining your operations to only those which are associative, commutative, and idempotent, you sidestep a lot of the complexity in distributed programming. (See: ACID 2.0 and/or CALM theorem.) That, in turn, makes it straightforward to guarantee eventual consistency in the face of failure.

With a bit of thinking, we were able to map a fan-in-on-read stream product to a data model that could be implemented with a specific type of CRDT. We were then able to focus on performance, optimizing our reads without becoming overwhelmed by incidental complexity imposed by the consistency model.

Roshi

The result of our work is Roshi, a distributed storage system for time-series events. It implements what we believe is a novel CRDT set type, closely resembling a LWW-element-set with inline garbage collection. At its core, it uses the Redis ZSET sorted set to store state, and orchestrates self-repairing reads and writes on top, in a stateless operational layer. We spent a long while optimizing the read path to support our latency and QPS requirements, and we're confident that Roshi will accommodate our exponential growth for years. It took about six developer months to build, and we're in the process of rolling it out now.

Roshi is fully open-source, and all the gory technical details are in the repository, so please do check it out. I hope it's easy to grok: at the time of writing, it's 5000 lines of Go, of which 2300 are tests. And we intend to keep the codebase lean, explicitly not adding features that are outside of the tightly defined problem domain.

Open-sourcing our work naturally serves the immediate goal of providing usable software to the community. We hope that Roshi may be a good fit for problems in your organizations, and we look forward to collaborating with anyone who's interested in contributing. Open-sourcing also serves another, perhaps more interesting goal, which is advancing a broader discussion about software development. The obvious reaction to Roshi is to ask why we didn't implement it with an existing, proven data system like Cassandra. But we too often underestimate the costs of doing that: costs like mapping your domain to the generic language of the system, learning the subtleties of the implementation, operating it at scale, and dealing with bugs that your likely novel use cases may reveal. There are even second-degree costs: when software engineering is reduced to plumbing together generic systems, software engineers lose their sense of ownership, which is the foundation of craftsmanship and software quality.

Given a well-defined problem, a specific solution may be far less costly than a generic version: there's a smaller domain translation, a much smaller surface area, and less operational friction. We hope that Roshi stands in evidence for the case that the practice of software engineering can be a more thoughtful and crafted process. Software that is "invented here" can, in the right circumstances, deliver outstanding business value.

Roshi was a team effort. I'm deeply indebted to the amazing work of Tomás SenartBjörn Rabenstein, and Johan Uhle, without whom Roshi would have never been possible.


登录查看更多
0

相关内容

Redis 是一个使用 C 语言写成的,开源的 key-value 数据库。
【图神经网络(GNN)结构化数据分析】
专知会员服务
114+阅读 · 2020年3月22日
TensorFlow Lite指南实战《TensorFlow Lite A primer》,附48页PPT
专知会员服务
68+阅读 · 2020年1月17日
机器学习入门的经验与建议
专知会员服务
90+阅读 · 2019年10月10日
Transferring Knowledge across Learning Processes
CreateAMind
26+阅读 · 2019年5月18日
PHP使用Redis实现订阅发布与批量发送短信
安全优佳
7+阅读 · 2019年5月5日
使用 Canal 实现数据异构
性能与架构
20+阅读 · 2019年3月4日
利用动态深度学习预测金融时间序列基于Python
量化投资与机器学习
18+阅读 · 2018年10月30日
LibRec 精选:基于LSTM的序列推荐实现(PyTorch)
LibRec智能推荐
50+阅读 · 2018年8月27日
分布式TensorFlow入门指南
机器学习研究会
4+阅读 · 2017年11月28日
【推荐】用Tensorflow理解LSTM
机器学习研究会
36+阅读 · 2017年9月11日
【推荐】TensorFlow手把手CNN实践指南
机器学习研究会
5+阅读 · 2017年8月17日
Arxiv
9+阅读 · 2019年4月19日
Embedding Logical Queries on Knowledge Graphs
Arxiv
3+阅读 · 2019年2月19日
Rapid Customization for Event Extraction
Arxiv
7+阅读 · 2018年9月20日
Arxiv
5+阅读 · 2018年4月30日
VIP会员
相关VIP内容
【图神经网络(GNN)结构化数据分析】
专知会员服务
114+阅读 · 2020年3月22日
TensorFlow Lite指南实战《TensorFlow Lite A primer》,附48页PPT
专知会员服务
68+阅读 · 2020年1月17日
机器学习入门的经验与建议
专知会员服务
90+阅读 · 2019年10月10日
相关资讯
Transferring Knowledge across Learning Processes
CreateAMind
26+阅读 · 2019年5月18日
PHP使用Redis实现订阅发布与批量发送短信
安全优佳
7+阅读 · 2019年5月5日
使用 Canal 实现数据异构
性能与架构
20+阅读 · 2019年3月4日
利用动态深度学习预测金融时间序列基于Python
量化投资与机器学习
18+阅读 · 2018年10月30日
LibRec 精选:基于LSTM的序列推荐实现(PyTorch)
LibRec智能推荐
50+阅读 · 2018年8月27日
分布式TensorFlow入门指南
机器学习研究会
4+阅读 · 2017年11月28日
【推荐】用Tensorflow理解LSTM
机器学习研究会
36+阅读 · 2017年9月11日
【推荐】TensorFlow手把手CNN实践指南
机器学习研究会
5+阅读 · 2017年8月17日
Top
微信扫码咨询专知VIP会员