技术控

    今日:59| 主题:49369
收藏本版 (1)
最新软件应用技术尽在掌握

[其他] Streaming Messages from Kafka into Redshift in near Real-Time

[复制链接]
萌城少年 发表于 2016-10-18 04:38:49
177 1

立即注册CoLaBug.com会员,免费获得投稿人的专业资料,享用更多功能,玩转个人品牌!

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
This is the sixth post in a series covering Yelp's real-time streaming data infrastructure. Our series explores in-depth how we stream MySQL updates in real-time with an exactly-once guarantee, how we automatically track & migrate schemas, how we process and transform streams, and finally how we connect all of this into datastores like Redshift and Salesforce.
   Read the posts in the series:
   
       
  • Billions of Messages a Day - Yelp's Real-time Data Pipeline   
  • Streaming MySQL tables in real-time to Kafka   
  • More Than Just a Schema Store   
  • PaaStorm: A Streaming Processor   
  • Data Pipeline: Salesforce Connector   
     The Yelp Data Pipeline gives developers a suite of tools to easily move data around the company. We have outlined three main components of the core Data Pipeline infrastructure so far. First, theMySQLStreamer replicates MySQL statements and publishes them into a stream of schema-backed Kafka topics. Second, theSchematizer provides a centralized source of truth about each of our Kafka topics. It persists the Avro schema used to encode the data in a particular topic, the owners of this data, and documentation about various fields. Finally, our stream processor,PaaStorm, makes it simple to consume from the Data Pipeline, perform transformations, and publish back into the Pipeline. Together, these tools can be used to get data to the people who care about it most.
  Let’s focus on the “Targets” box in our diagram of the Data Pipeline. We can see that transformed data has to make its way into target data stores before services can use it.
   
Streaming Messages from Kafka into Redshift in near Real-Time-1 (statements,guarantee,streaming,company,process)

  After abstracting much of the data transfer and transformation into the Data Pipeline infrastructure, we still have to connect data inside Kafka topics to the final data targets that services use. Each data target has its own idiosyncrasies and therefore requires a separate connector.
   One of the most popular tools for doing analytics processing at Yelp is Redshift . It makes computation and aggregation easy with its column-based storage, and also scales well for performing complicated joins across tables with billions of rows. This makes Redshift a great data warehouse that analysts, data scientists, and engineers can use to interactively get answers to their complex data questions. This post will focus on a Redshift connector: a service that uses PaaStorm to read from Kafka and load data into a Redshift cluster.
  The Data Warehouse is dead. Long live the Data Warehouse.

  Our legacy ETL system served us for many years in moving data from our production application databases into our data warehouse. The diagram below outlines the high-level infrastructure of this system.

Streaming Messages from Kafka into Redshift in near Real-Time-2 (statements,guarantee,streaming,company,process)

   We had triggers on various tables in our MySQL databases listening for row changes. On each row change, we would write to a table changelog. A worker tailing this changelog would then create ETL tasks and put them on our job queue. Next, a pool of workers would perform transformations on each row according to custom logic written for each type of ETL. The output of these transformations would be written to MySQL, where they would sit temporarily before being dumped to S3 and finally COPY ‘d into Redshift.
  This system served us well for getting data from our MySQL databases into Redshift and, despite the complexity, it was made robust through various alerts and tooling we had built over the years. However, this system suffered from major drawbacks which made it hard to scale.
  One of the most pressing concerns was that it would take precious developer time to write custom ETLs, their associated unit tests, get through code review, and push it into our monolithic code base. Furthermore, these code changes would often require performing schema migrations in both MySQL and Redshift across development and production clusters every time a field we cared about in the underlying source table changed.
  Worst of all was that this system was designed to support a single Redshift cluster (i.e. the data warehouse) which meant that each team that managed a Redshift cluster would have to build their own systems for ingesting the data they cared about. We had reached a tipping point and desperately needed a new system that would better scale with the needs of our growing company.
  The Redshift Connector: A New Hope
12345下一页
友荐云推荐




上一篇:Now Open – AWS US East (Ohio) Region
下一篇:Why Do People Fall in Love with Ruby?
酷辣虫提示酷辣虫禁止发表任何与中华人民共和国法律有抵触的内容!所有内容由用户发布,并不代表酷辣虫的观点,酷辣虫无法对用户发布内容真实性提供任何的保证,请自行验证并承担风险与后果。如您有版权、违规等问题,请通过"联系我们"或"违规举报"告知我们处理。

dandongyu 发表于 2016-10-31 23:59:52
LZ帖子不给力,勉强给回复下吧
回复 支持 反对

使用道具 举报

*滑动验证:
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

我要投稿

推荐阅读

扫码访问 @iTTTTT瑞翔 的微博
回页顶回复上一篇下一篇回列表手机版
手机版/CoLaBug.com ( 粤ICP备05003221号 | 文网文[2010]257号 )|网站地图 酷辣虫

© 2001-2016 Comsenz Inc. Design: Dean. DiscuzFans.

返回顶部 返回列表