Highlights from the “Real World Change Data Capture” Podcast (Part 2)
In our previous article, What Is Change Data Capture (CDC) Part 1, we highlighted several Change Data Capture-related questions and answers from episode 177 of the Data Engineering Podcast. Tobias Macey hosted Datacoral’s founder and CEO, Raghu Murthy, and spent 50 minutes discussing Datacoral’s real-world usage of Change Data Capture data pipelines.
In this article, we go a little deeper with CDC by covering:
- Alternatives to CDC and use cases for more batch-oriented approaches
- What to consider before you implement CDC and the barriers to entry
- Shortcomings and benefits of off-the-shelf CDC tools
We have written in much greater detail about data connectors, Change Data Capture, and using metadata to future-proof your data stack.
The content below is based directly on excerpts from the podcast and has been lightly edited for readability.
Recap: When is Change Data Capture useful?
CDC, or Change Data Capture, is a solution for a very specific problem in the context of analytics. It provides a complete picture of all data changes within a source database, including hard deletes. It differs from snapshots or incremental pulls in that every single change, no matter how big or small, is recorded into a change log file typically used for replication. That log file is the source from which CDC connectors pull data and load into a data warehouse or a data lake.
Datacoral uses CDC augmented by a powerful metadata layer to provide industry-leading capabilities with end-to-end pipeline visibility, automated data lineage awareness, and on-the-fly data schema updates.
What are the alternatives to Change Data Capture?
In the context of database integrations, the alternatives to CDC are snapshot fetches or incremental fetches. Both snapshot and incremental fetches are done by running SQL queries on the source database. These SQL queries are fetching batches of data periodically.
In the snapshot data integration method, all of the data from the source is fetched periodically. This data would then be loaded to replace the previous data in the destination warehouse. Snapshot data integration is the easiest to understand and also the simplest to build. But it can get expensive as the data volume increases.
In the incremental data integration method, only modified rows are fetched periodically. These modified rows are then applied to the destination warehouse. Incremental data integration is more efficient than snapshot, but it has some drawbacks. For example, deletes are not propagated to the destination.
Both snapshot and incremental data integrations can only be done in batch mode and are run periodically. This means that changes that happened to the data between the periodic fetches are never propagated to the destination warehouse.
What are the cases where a batch-oriented approach would be preferable?
Batch oriented data integration approach means that data records fetched from the source are grouped into batches before being applied to the destination warehouse. So, while snapshot, incremental or CDC data integrations are about *what* data is fetched and/or applied to the destination warehouse, batch-oriented vs real time stream-oriented approaches indicate *how* the data is fetched and/or applied to the destination warehouse.
In the case of CDC, even though you are fetching a stream of changes, you can batch them up to efficiently load them to update tables inside a warehouse. CDC used to be really expensive because it is hard to get right, and there are no pay-as-you-go cloud native offerings that are easy to set up. But with solutions like Datacoral that are really easy to set up, are robust, and are pay-as-you-go, the only reason nowadays to not use CDC for data integrations is if the source does not support it.
The added advantage of CDC is that it can support real time streaming unlike snapshot or incremental ways of fetching data. Snapshot and incremental are typically only batch-oriented and dont have the option of being real time streaming.
What factors should you consider for a Change Data Capture data integration?
There are three factors to consider as to whether CDC’s the right solution for you:
- Whether the data source supports CDC
- Whether the data is valuable enough to justify CDC
- Whether CDC is a known and understood option
Does the data source support Change Data Capture?
It all starts with whether a data source actually can support Change Data Capture. Most modern databases allow you to replicate changes from the database to a different system. You get change logs that you can read, which is great for replication, but you need to make changes to the source. The data source needs to be configured to support replication. If you don’t have access to the database or enough permissions on the database to enable replication, then CDC is not even an option.
Is the data valuable enough to justify Change Data Capture?
Secondly, your data may not even be valuable enough for you to even invest in CDC. If you have some configuration data sitting somewhere and you just want a copy of it, you don’t really need highly-reliable and high-fidelity copies in your data warehouse. There’s not even a need to invest in CDC. You can run something that’s very simple that’s just reading the data every so often and then applying changes to the warehouse. That said, the cost of CDC is going down rapidly, so this is becoming less of a reason.
Is Change Data Capture a known and understood option?
The third barrier to entry is whether you know if CDC is an option at all, or you don’t know what CDC is. Hopefully people are learning about CDC as part of the fact that more people are talking about it.
CDC actually raises the bar for what a good data integration is. If all data integration could be done as CDC in a simple way, then everybody would want it because it gives you really reliable copies. You can not only see a point in time snapshots, but you can see also exactly what has changed. To do it well and correctly is complex and requires much more than “set it and forget it”
Beyond that, the technology has to become easier to use. Even at Datacoral, we have built a package to do CDC, but many parts have to come together byond the CDC connector itself to be able to have a solution that works reliably. I think over time, both of these are going to improve: One, the technology will become a lot easier with less friction to use, and two, more and more people will learn about it and want it. CDC is definitely, in our minds at least, a better way to do data integration than what has been happening so far.
What about Change Data Capture shortcomings and off-the-shelf tooling?
There are multiple CDC-specific solutions available on the market. I broadly classify them into three different categories.
The first category is the data integration providers who also offer CDC connectors. These connectors are able to do the basic functionality of shipping the change logs from the source and applying it to the destination, but they’re not fully featured. They may not support all the kinds of changes that need to be handled, especially around schema changes. It is possible that they apply too much load on the source systems or it might be that they’re not able to provide data quality guarantees. They can still handle the basics of a CDC data integration.
The second category is traditional CDC providers starting with Oracle GoldenGate. They are really sophisticated and provide pretty much everything that you want from a CDC solution. They have been around for a very long time. They started out as monoliths and they offer anything and everything that you want. But they’re a pretty big, heavy lift to operate into your overall stack. And it also turns out that they’re incredibly expensive.
The third category is a lot of homegrown solutions that are being used that are either leveraging some of the open source technologies like Debezium that people are putting together themselves, or there are other solutions like ours where people have taken these end-to-end data engineering platforms and tried to build a CDC solution.
Datacoral began by looking like the first category, which is a connector that can be added with a few clicks. Over time, our platform and connectors have evolved significantly and now we are much closer to a fully-featured CDC solution, specifically around data integration.
Going Deeper with Change Data Capture
The Datacoral blog has several articles containing in-depth technical analysis of Change Data Capture and other data connector types. Our Datacoral platform documentation is also freely available with extensive insights into what our platform does and how it works. We provide more than 80 connectors for APIs, databases, events, and file systems, allowing for fast and easy configuration.
Datacoral offers unparalleled observability out of the box, with end-to-end fully-managed pipelines and metadata-driven orchestration. Getting started with Datacoral is easy. You can request more information on our website, or sign up for a 30 day free-trial of our product.