This article was originally posted on Towards Data Science here.
Data integrations have been around for decades, but there has been a recent explosion of new, compelling data integration companies offering cloud-native, easy-to-configure connectors and quick access to high-value analytics. These offerings come from companies like Fivetran, Stitch Data, Matillion, and Airbyte. It can be difficult for any data team to know whether to stick to their existing data stack or to adopt these new tools. Though it’s great to see data integrations becoming more mainstream and new approaches emerging, we think there’s a growing gap being created by new entrants to the space. Some of today’s harder problems are either being ignored or promised in a future release, but customers need solutions today. Let’s take a critical look at where data integrations are now and where we believe they need to go in the future.
TL;DR: All data integrations should be built with a log-based Change Data Capture (CDC) architecture!
What makes for a robust data integration?
Data integration is the process of bringing together data from multiple sources into a centralized repository like a data warehouse or a data lake. Integrations and connectors extract data in bulk (for the most part) from these sources and load the extracted data into a data warehouse for further cleaning and transformation. Then the data is blended together to produce meaningful business insights.
Data analysts rely on centralized data from multiple sources to develop these insights. The goal for the analyst is to get the best understanding of the data by slicing and dicing it. Data is best understood when there’s a clear idea about how it is changing over time. But not all data integrations are created equal, which makes it difficult to get a clear idea of data changes. Very few integration architectures capture all data changes, which is why we believe Change Data Capture is the best design pattern for data integrations.
CDC is superior because it provides a complete picture of how data changes over time at the source — what we call the “dynamic narrative” of the data. Log-based CDC provides a low-overhead, high-performance way to capture every single data change, regardless of size.Through CDC, data integrations can extract data changes in a lightweight way, offering high value at low risk to production infrastructure.
An example: Getting the dynamic narrative of data in e-commerce
Let’s consider an e-commerce company where its users make purchases by adding products into a shopping cart. An analyst at the company is trying to determine the popularity of a set of similar products. In order to understand why some products are less popular, the analyst investigates all users’ shopping cart changes. Having a record of every change to a cart can provide a deeper understanding of several considerations:
- Does the user frequently add a product from a specific brand to the cart, but not purchase it?
- Is the user replacing one brand with another?
- Is the user dropping carts before purchasing?
Shopping cart changes provide a dynamic narrative of user behavior and a richer understanding about the performance of products. The analyst can make better suggestions to the business or product teams with access to a complete data set.
Getting shopping cart data
Let’s look at different ways of getting the data for shopping cart changes for the example above. There are three ways in which data integrations sync data from the source:
- Incremental pull
- Change Data Capture
Each data sync mechanism works differently and provides the data analyst varying levels of support.
Snapshots: Can handle deletes, but can be wasteful and expensive — incomplete data
Periodic snapshots of tables/objects from a data source provide a high-fidelity copy of the data at points in time. Analyzing snapshots can provide the analyst visibility into data changes over time, but can be wasteful and expensive. The duration of a pull snapshot increases in proportion to the volume of data changes. This can cause stress at the data source, risking production workload performance and availability.
In other words, the more data changes at the source, the longer it takes to pull a snapshot. And the longer it takes to pull a snapshot, the higher the risk to performance. Since these snapshots are often run directly against a production database, this can be a big problem for customers and internal stakeholders.
Even if some objects/rows are not changing in the source, they are still fetched in the snapshot pull, resulting in unwanted fetches. High-performance applications and up-to-the-minute analysis require a lot of care and attention. There’s no time for “extra” data, rows, or computation.
Finally, since snapshots are fetched periodically, the data changes occurring between fetches are not visible to the data analyst. To compensate, the analyst looks at the changes between snapshots and makes assumptions. The granularity of the changes depend on the frequency at which the snapshots are taken. All shopping cart changes that happened between snapshots may be lost and the analyst has to piece together a story based on incomplete information.
Incremental fetches: Less expensive than snapshot, but can’t handle deletes efficiently — also, incomplete data
Periodic incremental fetches are better than snapshots in some respects. They fetch only the changed data since the last pull, which is not as expensive as fetching all the data all the time. But they also miss data updates between fetches.
The analyst in our example still has to make assumptions to fill data gaps. If user data is incrementally pulled every 15 minutes and there was more than one update to a cart in that 15-minute interval, the incremental pull will only fetch the final state of the cart after the last update is applied at the source. Any other changes in that 15-minute interval are not available for analysis. So the analyst likely won’t see every change to a user’s shopping cart.
In addition, incremental fetches typically don’t handle hard deletes well.Deleted data does not show up in an incremental pull as it no longer exists at the source. In fact, snapshots handle hard deletes better because snapshot fetches automatically include hard deletes.
Change Data Capture: Complete data — no compromises!
Unlike snapshots and incremental fetches, CDC provides a complete picture of how data changes over time at the source. For connectors to support pulling data through CDC, the sources must support publishing every single change to every data element. This is a tall order for most sources.
Fortunately, databases natively write complete and robust change logs for disaster recovery and high availability use cases. Connectors for databases using CDC pull data from these change logs. The robustness of the change logs transfers directly to the data integration itself. CDC elevates the game for data integrations and offers a complete view of the dynamic narrative of the data.
To better understand how CDC provides a complete view of the data, let’s look at the illustration below.
The illustration above shows the behavior of snapshot, incremental, and CDC approaches for the same hourly data latency. The source table gets two sets of changes between 10:00 and 11:00.
Snapshot provides a high fidelity copy (source table and destination table are the same at 11:00) but does not capture the intermediate changes. Thus, the analyst has no visibility into the following missing data:
- Update 1 to 1′
- Insert 4 (only 4′ is visible)
- Insert 5
- Delete 5
In addition, even the unchanged row (row 3) is fetched unnecessarily. On the other hand, incremental fetches are more efficient, as they only fetch changed rows, but they suffer from the same missing data problem as snapshot fetches.
CDC clearly has the most complete picture of how the source data has evolved over time. There are no wasteful row fetches, all deletes are captured, and no data changes were missed.
You might be asking yourself, “If CDC is so great, then why doesn’t everyone use it?” That’s a great question and we have a simple answer: It takes a lot of effort to build and maintain a robust CDC connector. In addition, the source and destination systems of the CDC data integration should also be amenable for efficient CDC pipelines. We describe our learnings around what it takes to build a robust CDC connector in the next section.
What does it take to get a robust CDC data integration?
Systems that support CDC should offer the following for a robust data integration:
- Capture all changes for each data item, including hard deletes
- Allow for introspection (i.e. offer ways to detect what objects were added, removed, or edited)
- Allow fast bulk fetch of all the data for the initial load, also known as “historical syncs”
Three different systems must work together for a data integration to successfully provide CDC:
- Source system
- CDC connector
- Destination warehouse
Finally, these are the requirements for a functional CDC connector:
- Schema mapping: Create tables in the destination warehouse that correspond to the source schema/objects during the connector add process.
- Bulk sync: Bulk fetch and load for initial loads or for recovery from failures. Ability to prepopulate historical data that existed prior to the connector add process.
- Ongoing change data capture: Propagate all inserts, deletes, and updates from source to destination.
- Data freshness/quality guarantees: Provide visibility and guarantees around data freshness and data quality.
- Propagate schema changes: Detect and apply schema changes like new tables added, tables dropped, columns added, columns dropped, and columns updated.
In the table below, we list how the three different systems should be designed for different parts of the functionality listed above to get a robust CDC solution for a given pair of source and destination.
Blueprint for a CDC data integration
The picture above illustrates the end-to-end pipeline needed to build a robust CDC solution for any source system. Multiple components need to be orchestrated together to not only get the data flowing, but also to provide data freshness and quality guarantees.
- CDC Event Reader — Fetches the change log from the source system efficiently and ensures the source system does not get backed up. The change logs consist of changes to all objects or tables in the source, so the reader also has to split out the change log into table-specific changes which are then applied to the corresponding data tables in the destination.
- Metadata Sensor — Captures the table/object structure in the source system and maps the structure (schema) to the destination table schema. On an ongoing basis, when there are changes to the source table/object structure, the metadata sensor also orchestrates the propagation of the changes to the destination tables, and performs historical syncs as needed using the bulk data reader.
- Bulk Data Reader — Fetches data directly from the source system (as opposed to only fetching the change log). For databases, the bulk reader can run SQL queries on the source database and retrieve the results. The SQL queries could be for either the initial historical sync or to compute synopses (aggregates and checksums) for data quality checks.
- Transformation Pipeline — Computes synopses on the destination tables and performs the actual data quality checks. This pipeline can generate events when data is available with quality guarantees or can generate alerts when data quality checks fail.
In this section, we have provided a blueprint for how one can build a CDC data integration pipeline. We have also written in the past about how Datacoral has built CDC connectors using our end-to-end data pipeline implementation built with a metadata-first architecture.
What source systems currently support CDC data integrations?
CDC for data integrations has many benefits, but not every source system can offer them. Different types of data sources provide different data retrieval capabilities. We classify those sources into the following categories:
- File systems
- Event streams
The table below summarizes how a connector can pull data from different types of sources using the extraction types we’ve discussed.
As you can see, not all source systems are amenable to CDC connectors. Databases have long provided all of the properties required to build CDC connectors. We believe that services with bulk data integration APIs should also offer change logs to capture each change to each object within the service. Webhook-based services are getting there, but typically only offer a subset of the changes. Data integrations still need to use other APIs to fetch updated objects that are not covered by webhooks, resulting in gaps in data similar to periodic snapshot or incremental fetches. Services like Stripe are leading the way in providing a CDC API and we hope others follow suit.
A lot has been written about how event-driven architecture is the future, but in many cases event-driven architecture is also tied to real-time requirements. We believe that event-driven architecture supports much richer data analysis, even if the analysis was done by batching up the events!
We are excited to see so many companies entering the data integration space. They are bringing more data analytics awareness to a multitude of companies. Business and technology leaders are getting a taste of the dynamic narrative of data. But snapshots and incremental fetches aren’t enough. We want to see CDC everywhere so analysts and data teams can truly harness the power of a data stack.
Snapshots support hard deletes and high-fidelity data, but they don’t capture the complete picture and come with performance risks. Incremental fetches are often more performant than snapshots, but miss hard deletes and are also incomplete. Change Data Capture is the only architecture that allows for a complete picture of the data with few performance risks. Companies like Stripe are offering event-driven APIs with complete change logs for objects within their environment. We want to see more of that.
It takes a lot of effort and strategy to build a robust and efficient CDC pipeline. Several systems need to be coordinated, functional requirements must be met, and metadata should be leveraged for automated schema change detection and other great features.
At Datacoral, we have built CDC connectors for databases like PostgreSQL and MySQL using our metadata-first data pipeline. Having worked on CDC along with 80 other connectors, we are convinced that CDC elevates the game of data integrations. Data integrations without CDC provide a mere shadow of the full picture of the data for analytics. If all sources start supporting event APIs, all data integrations will finally become a piece of software.