Highlights from the “Real World Change Data Capture” Podcast (Part 1)
Datacoral founder and CEO Raghu Murthy appeared on episode 177 of the Data Engineering Podcast with host Tobias Macey in March 2021. The discussion focused primarily on Change Data Capture [CDC] as a methodology for data integrations, drawing attention to some of the challenges of CDC integrations and drawbacks of non-CDC pipelines, such as incremental pulls and snapshot pipelines.
The 50-minute podcast provides a quick, high-level overview of how Raghu and Datacoral view the changing world of data integrations. If you’d like to go deeper, we have written in much greater detail about data connectors, Change Data Capture, and using metadata to future-proof your data stack. Raghu’s perspective is noteworthy given his extensive data management experience with Facebook, Yahoo!, and other prevalent data-driven enterprises.
The content below is based directly on excerpts from the podcast and has been lightly edited for readability.
What is Change Data Capture and when is it useful in data integrations?
CDC, or Change Data Capture, is a solution for a very specific problem in the context of analytics. In general, production databases like Oracle, MySQL, and Postgres are used to build applications. They are transaction processing systems, so they allow applications to make changes to the data and also serve the data. For purposes of disaster recovery or even high availability, these databases have their own system replication logs that allow data replicas to be created from these production databases. The change logs are often made available for other clients to be able to read these changes and do perform other operations.
In the case of analytics, the data in your production databases is probably the most valuable data in your data warehouse. This is the data that your application is using. It could contain transactions for billing purposes, information about how your product is being used, or any number of use cases. If an analyst wants to actually query that data, they can’t run queries on the production database because they would overload the production database.
[Learn more about why queries can overload production databases in our Change Data Capture 101 piece.]
See how data changes over time with Change Data Capture
There is another aspect of understanding the data that doesn’t get talked about that much: Understanding the data not only requires you to analyze the data at that point in time, but you should also have a sense for how the data is changing over time. That is what gives you the true understanding of what the data means. Clearly, the source of truth of the data is your application, the business logic of that application. The application is kind of making changes to the production database, so as an analyst to truly understand what is happening with that data, you also want to know how the data is changing.
CDC is the perfect solution when considering the risk of running analysis on the production database with the need to understand how the data is changing. You get a reliable log of all the changes that are happening in your data and put it into a different system to do the analysis rather than trying to do it in the production database itself.
That combination is what makes CDC really interesting for analytics.
Change Data Capture sounds similar to event-driven software architecture. Is that the same idea?
Yes, absolutely. But, building applications that are responding to and generating a queue of events is actually pretty hard. Most application developers don’t really do that. It takes a lot of scaffolding to build applications using an event-driven software architecture. Instead, typically, applications directly make changes to the database instead of creating a queue of events.
The good news is that the databases themselves are producing that queue of change events on the “other end”. So, CDC is a way to set up analytics as if the applications are event-driven!
Common mistakes and misconceptions about Change Data Capture
We talk about this with our customers a lot, especially ones that already have a solution or want to build their own. CDC is a hard problem to solve. There are a lot of challenges you have to deal with when working with a CDC solution.
CDC Challenge #1: Data security
The first challenge is data security. Do you want to expose your production databases to the internet? If you’re okay with that, then maybe a SaaS solution could work for you. Otherwise you have to build something in-house within your environment.
CDC Challenge #2: Setup
The second challenge is around the initial setup itself. CDC gives you the ability to fetch the changes from the time that you’ve deployed the connector or enabled CDC. So there’s a lot of historical data you need to synchronize to your data warehouse first. Then there’s the handoff from the historical sync into a Change Data Capture where we’re fetching changes.
For these historical syncs themselves, there’s a lot of challenges. How do you make sure you don’t overwhelm the source system when you’re doing historical syncs? How do you make sure you’re doing them fast? Do you have enough intelligence to figure out how much concurrency the source systems allow or how much concurrency you can take up in the source systems so that you get both the necessary throughput as well as not overwhelming the source systems?
[Read our approach to fast historical syncs for more information.]
CDC Challenge #3: Maintenance
The next challenge is maintaining CDC on an ongoing basis. How is your CDC solution performing? There are cases where there could be things like replication lag. It could be either because the CDC system itself is slow or maybe you’re fetching the changes from a replica and there is a lag in the replica itself. How do you deal with that replica lag? How do you deal with cases where the row may have been updated, but not all columns have been updated accurately? It’s not a high-fidelity copy. Are you able to do data quality checks using things like checksums to make sure that the copy you have in the data warehouse is a high fidelity copy of the source?
If your data volumes grow, is your CDC solution able to auto-scale or do you have to reprovision or stop your replication and restart it? Even there, when you stop and restart, do you end up losing data? What happens when, in the case of Postgres CDC, you drop the replication slot? Then when you bring it up again and there’s changes that have been lost, how do you handle those?
CDC Challenge #4: Data quality
This specific problem itself around data quality is actually pretty tricky. You know that you want a high-fidelity copy, but the source data is also changing. If the source is changing, you don’t know at what point in time you have to perform the check because there might be a bunch of changes that have not yet been applied in the destination. How do you make sure that what’s in the destination is an accurate representation of what was in the source a few seconds or minutes ago? Being able to do that in a consistent and reliable way is a pretty hard problem.
You can go on and on providing solutions to these challenges. CDC for data integration elevates data integration to a level it hasn’t been so far, which is providing this level of transactional guarantee that all the data in the warehouse is a reliable copy of what was in the source.
[For solutions to these challenges, read The 3 Things to Keep in Mind While Building the Modern Data Stack.]
Can Change Data Capture data connectors work with APIs and third-party data sources?
Change Data Capture is just another way of bringing in data into a data warehouse. In our metadata-first approach, the CDC connectors generate the same metadata as any other connectors about how fresh the data is.
When transformations are defined, the data input dependencies can be inferred. The only time that transformations can change is when the underlying data dependencies change. So a robust way of running the transformations is when they are synchronized with the changes to underlying data inputs.
With metadata readily available from the connectors, it is easy to synchronize running the transformations only when their inputs are updated by connectors. For example, if the connectors are bringing in data every hour and the transformation is a daily report, it is easy to automatically figure out that the daily transformation has to wait until all 24 hours of the connector data input to arrive.
We are seeing some APIs offering CDC-like capabilities. Stripe is a really good example. They have APIs to fetch data from different places. They also offer an event stream of every single change that has happened across all of the different objects within Stripe.
There’s a lot of movement towards this way of doing CDC for all kinds of applications and services. It correlates with business requirements and expectations. If there are more and more analysts saying, “Okay, I want consistent data. I want to be able to do analysis or even build data applications with really consistent data across different services,” there may be a push toward these kinds of source systems to also offer a way to get these replication logs.
Here’s a quick example of why this might typically play out. Let’s say data is coming in from Salesforce every day and we are using a data integration tool like Fivetran. You also have a CDC connector pulling data in a continuous way but updating the data warehouse every hour. When you want to run a daily report, you want to make sure the entire day’s data has arrived from Salesforce using a Salesforce connector. You’ll also want to wait 24 hours for the CDC data to also have been applied to the data warehouse before your reporting runs.
This synchronization is typically hand-coded in data pipelines using workflow managers like Airflow. In Datacoral’s case, we started off with metadata first for orchestration. We automatically wait for all of the relevant data for a given day to show up before a transformation for that day needs to run without any additional programming.
We work with other tools and have our own connectors, too. We have about 80 connectors that are pulling from APIs, file systems, and databases, in addition to our CDC connectors. All of these connectors generate standard metadata about data freshness – that’s how we are able to synchronize transformations. We make the most of our shared metadata layer.
Going deeper with Change Data Capture
This is part one of a two-part summary of Raghu’s Data Engineering Podcast appearance. In our next post we will dive deeper into ways you can integrate CDC into your data stack and improve your data flows. Raghu will share alternatives to CDC, what to consider when deciding on CDC, the moving pieces in a CDC workflow, performance challenges, and when not to use CDC.