Let’s pretend you work at a small identity protection startup and you’re getting acquired by a Fortune 100 corporation. You and 120 of your peers are waiting on huge acquisition checks, the champagne is on order, and there’s palpable, anxious excitement in the air. Morale is high and you’re all ready to join the big leagues. There are nervous smiles everywhere you look and you’re waiting to go on a hiring spree to start scaling. You’re feeling good…
Until an email comes in from the acquiring corporation’s security and compliance group. You’re being asked some intense questions for acquisition due diligence. They are looking to develop comprehensive histories of user activity on your product, ensure revenue was recognized at the right time, align development work ticket documentation with product releases, understand the impact of releases and changes to the platform, and more. In short: They want a complete picture of your company’s data from start to finish.
Is your data stack able to deliver any of this? And if so, can you guarantee the data will be accurate, fresh, and complete? Will your company be able to get through due diligence and make all your colleagues’ dreams come true?
For many companies, the answer is either “no” or “it’s complicated.” Even modern data stacks built with the most popular tools can have trouble delivering such a comprehensive recollection of events. User data can change quickly and many data connectors cannot provide a complete picture of what happens to a piece of data over time. Why?
Modern companies have data shared across dozens, if not hundreds, of services. Each service comes with its own data store, requiring a customized data connector with its own level of sophistication. Some connectors are only as capable as the service’s API and others work by asking for the latest data snapshot since the prior request. Even the most sophisticated incremental data checks either miss hard deletes or have expensive retrospective ways of capturing the complete picture. When assembling dozens (or hundreds) of these connectors to pipe data into a warehouse, infrastructural, and programmatic complexity increase exponentially. And there’s still no guarantee of data accuracy, freshness, and completeness.
The good news is there is a viable, time-tested solution to this problem: Change Data Capture (CDC). Let’s explore how it works, what problems it solves, why it’s hard (even deceptively hard), and how Datacoral helps its customers use CDC to have a complete picture of their data’s dynamic narrative. We’re going to assume you’ve at least skimmed our Data Connectors 101 and our CDC Overview articles.
What is CDC?
Change Data Capture is a way of replicating data changes from databases such as MySQL and PostgreSQL by reading the database change log files rather. Unlike their non-CDC counterparts, CDC connectors don’t run heavy SQL queries on the database itself. As a result, CDC doesn’t put much burden on production databases. These connectors also capture more holistic data due to their use of change log files.
Change logs are rather self-explanatory. They are log files containing descriptions of every data change within a database. When a user adds an item to a shopping cart, that event gets written to a log. When that user removes the item from the cart, that event gets written to the same log. CDC will capture both the addition and deletion of the item from the cart. The net result is nothing in the cart, right? Well, a non-CDC connector could miss the shopping cart event altogether if snapshots are taken before the item is added and after it’s deleted.
Figure: A Change Data Capture (CDC) connector in action
Having a deep and valuable understanding of your most important data requires visibility into every change made to every data point across the entire organization, not just the state of the data when a snapshot is requested. This requires a high degree of functionality and coordination in the data stack and CDC is the best technique to accomplish this for your production databases.
Whereas an incremental runs a query that says, “Hey MySQL, give me the latest version of the data since the last time I asked,” a CDC pipeline asks the database, “What are all the changes since the last time I was here?”
This subtle difference in the requests (“latest data” vs. “what exactly has changed”) has enormous implications. Let’s explore those implications, how CDC can make a difference in your current data stack, and how we at Datacoral use CDC to enable our customers to build industry-leading data analytics capabilities.
What problems does CDC solve?
Log-based CDC is powerful, time-tested technology. It’s as efficient as it is reliable and solves many problems that are inherent to snapshot-based pipelines. It is a capability built into every major database offering and has been used extensively to fulfill data replication needs in disaster recovery and data warehouse implementations.
We’re going to discuss three main problems that CDC solves:
- Capturing complete datasets containing every data change
- High efficiency replication for large datasets
- Handling deleted data for tables of all sizes
Capturing Complete Datasets
The most important feature of CDC is that it can capture every data change within a database. Most non-CDC connectors cannot and simply rely on snapshots of data at the time of extraction.
Though snapshot pipelines are easier to build and deploy, they have some real downsides. First, they generate more load on a production database. Snapshots are essentially `select * from production_data` queries. Any database administrator can tell you how risky and expensive such queries are, especially when there is a lot of activity hitting the production database. Though every startup wants to go viral and suddenly have thousands of new users every hour, running a `select *` query during this peak usage period can easily take down the service all those customers want. Expensive operations plus in-demand production databases equal a recipe for potential disaster.
Another benefit CDC provides regarding data completeness is the auditability of the data. Snapshots don’t provide audit-quality data. You cannot trace everything that happened to a data point over time with snapshots, even with frequent snapshots. Even if snapshots are configured to run at an optimized frequency, there’s no guarantee of capturing 100% of the data changes. In domains such as healthcare or finance, or when you are reporting on critical revenue, invoicing and customer behavior data, your team might have requirements to show the exact ways in which a particular data record has changed over time. Only CDC provides such guarantees.
Dealing with datasets of tens of millions of rows (or more) can be taxing. Not only are there computational costs to unoptimized queries, particularly on unindexed columns, but there are significant operational costs with potential cascading impacts. Databases of that size are often sources of truth for complex application architectures. One faulty or expensive query can cause problems that extend well beyond the query execution itself with the potential to bring down multiple production systems. And, believe it or not, there are a lot of data connectors that hit production databases directly.
Snapshots are impossible if tables are tens of millions of rows or larger, and even for incremental data fetches. I described this situation in my Data Connectors 101 piece, but let’s go over it quickly here.
Snapshot and incremental queries are constructed along these lines: `select * from production_table` for anything since the last query ran. These SQL queries put a lot of load on the database. For certain architectures with demanding response time requirements (such as a financial transaction), slowing anything down is intolerable.
CDC, on the other hand, is efficient and does not affect production database performance. It uses database logs produced natively by the database itself. MySQL, Postgres, and other major databases are built to produce these logs with ease. They exist commonly in large data architectures for replicating data internally to disaster recovery and other destinations. By depending entirely on log files, CDC does not affect database performance and does not pose risks to production.
CDC is the best way to capture deletes for large tables. I mentioned the shopping cart example earlier where a snapshot is unlikely to have captured an item that was added to and removed from the cart between snapshots. The net result may be that the customer did not buy that item, but that’s not the whole story. Data analytics teams may wish to understand why that item was added to the cart and why it was removed.
Knowing when rows have been deleted is important for understanding how users are using an application or service. Data teams without access to CDC connectors are forced to use other methods of understanding deletes. These methods can be computationally expensive, complex to manage, and heavily dependent upon custom software. All of these drawbacks inevitably slow teams down and reduce the ability to understand what is happening within the app. Worse, there is no assurance of data quality and timeliness.
Why is CDC Deceptively Hard?
It can be tempting to believe that CDC is “just another data connector” that can be switched on with a simple toggle. The move to CDC is more strategic than it is tactical and requires planning and foresight. Most other connectors are “set it and forget it,” but CDC requires a little more management. It might otherwise sound like god-tier greatness, but it’s not without its own set of risks!
CDC uses log files. Large, active databases write lots of data to these log files. During peak usage, it is not surprising to see log files grow multiple times faster than normal. And any database administrator who’s managed such situations can tell you how quickly disk space can disappear. A database without disk space is a stuck database. And an application with a stuck database is dead in the water. PostgreSQL uses Write-Ahead Logs (WAL) for CDC data and only clears these logs after the log data has been successfully read from the replication slot. If replication fails, the log files grow until the disk is full. Without a fail-safe solution for this problem, your analytics needs can bring down the entire application.
A related concern for CDC is replication lag. Since data is being read continuously from the source (via log files), we might see a growing gap between how much data is read and how much data is written. If there’s a sudden increase in data volume, log data might not be read as quickly as it is being written. In other words, the replication process can’t keep up and the destination ends up lagging behind the source. The data connector should provide a clear indication of how “caught up” it is so analytics teams understand the data they’re dealing with. While the replication won’t take down production, it can cause problems for data quality and freshness leaving analytics teams assembling piecemeal data.
Since CDC connectors are generally more complicated (for reasons of continuously streaming data, replication lags, etc), they need more monitoring and alerting than regular connectors. This is an area often missed by both vendors and people building CDC systems internally. More monitoring and alerts requires more organizational discipline and collaboration across teams (DevOps, DBAs, data engineers, data analysts, etc.). Non-CDC connectors running DB queries don’t cause operational issues if they stop running. If a CDC connector stops running, as we’ve already seen, there is risk to manage.
Finally, CDC connectors also need to support enterprise features such as fast historical syncs, schema change, and use metadata wherever possible. None of these are simple, which is why very few data stacks check all these boxes.
As we’ve seen, CDC is a time-tested and powerful implementation pattern, but it’s not cost- or risk-free. Going with CDC is not a trivial decision or implementation detail. It’s strategic and requires coordination across several disciplines. It’s not a “set it and forget it” pipeline. The benefits are significant and require additional maturity and expertise for long-term management.
CDC has traditionally been reserved only for teams with large data-engineering organizations, or enterprise level budgets. We don’t think that should be the case and Datacoral enables teams of all sizes to take advantage of this powerful style of replication. While there is so much more that we could have covered in this article (and we will in a later article), we hope that this helps explain why we are so excited about the transformative nature of these connectors. To learn more about how your team – whether it’s just you or you and fifteen others – can avail yourself of our CDC connectors, visit our website or drop us a note at [email protected]