This is the sixth post in a series covering Yelp's real-time streaming data infrastructure. Our series explores in-depth how we stream MySQL updates in real-time with an exactly-once guarantee, how we automatically track & migrate schemas, how we process and transform streams, and finally how we connect all of this into datastores like Redshift and Salesforce.
Read the posts in the series:
Billions of Messages a Day - Yelp's Real-time Data Pipeline
Streaming MySQL tables in real-time to Kafka
More Than Just a Schema Store
PaaStorm: A Streaming Processor
Data Pipeline: Salesforce Connector
The Yelp Data Pipeline gives developers a suite of tools to easily move data around the company. We have outlined three main components of the core Data Pipeline infrastructure so far. First, theMySQLStreamer replicates MySQL statements and publishes them into a stream of schema-backed Kafka topics. Second, theSchematizer provides a centralized source of truth about each of our Kafka topics. It persists the Avro schema used to encode the data in a particular topic, the owners of this data, and documentation about various fields. Finally, our stream processor,PaaStorm, makes it simple to consume from the Data Pipeline, perform transformations, and publish back into the Pipeline. Together, these tools can be used to get data to the people who care about it most.
Let’s focus on the “Targets” box in our diagram of the Data Pipeline. We can see that transformed data has to make its way into target data stores before services can use it.
After abstracting much of the data transfer and transformation into the Data Pipeline infrastructure, we still have to connect data inside Kafka topics to the final data targets that services use. Each data target has its own idiosyncrasies and therefore requires a separate connector.
One of the most popular tools for doing analytics processing at Yelp is Redshift . It makes computation and aggregation easy with its column-based storage, and also scales well for performing complicated joins across tables with billions of rows. This makes Redshift a great data warehouse that analysts, data scientists, and engineers can use to interactively get answers to their complex data questions. This post will focus on a Redshift connector: a service that uses PaaStorm to read from Kafka and load data into a Redshift cluster.
The Data Warehouse is dead. Long live the Data Warehouse.
Our legacy ETL system served us for many years in moving data from our production application databases into our data warehouse. The diagram below outlines the high-level infrastructure of this system.
We had triggers on various tables in our MySQL databases listening for row changes. On each row change, we would write to a table changelog. A worker tailing this changelog would then create ETL tasks and put them on our job queue. Next, a pool of workers would perform transformations on each row according to custom logic written for each type of ETL. The output of these transformations would be written to MySQL, where they would sit temporarily before being dumped to S3 and finally COPY ‘d into Redshift.
This system served us well for getting data from our MySQL databases into Redshift and, despite the complexity, it was made robust through various alerts and tooling we had built over the years. However, this system suffered from major drawbacks which made it hard to scale.
One of the most pressing concerns was that it would take precious developer time to write custom ETLs, their associated unit tests, get through code review, and push it into our monolithic code base. Furthermore, these code changes would often require performing schema migrations in both MySQL and Redshift across development and production clusters every time a field we cared about in the underlying source table changed.
Worst of all was that this system was designed to support a single Redshift cluster (i.e. the data warehouse) which meant that each team that managed a Redshift cluster would have to build their own systems for ingesting the data they cared about. We had reached a tipping point and desperately needed a new system that would better scale with the needs of our growing company.
The Redshift Connector: A New Hope