Change Data Capture (CDC) is a common pattern for replicating data from databases to data warehouses by syncing the database change logs rather than fetching data from tables. Many data engineering teams we have spoken to (many of whom are our customers now) have mentioned the complexity of running a CDC system at scale. In this post, we will go over why CDC is important, what CDC for common production databases looks like and common challenges with using CDC.
Why is CDC important?
Many data-driven data companies today replicate data from different data sources into a data warehouse for further analytics. This helps them understand how their users are using their products, how to increase revenue from existing customers and how to streamline customer acquisition. Production databases such as MySQL and PostgreSQL scale really well for transactional queries (insert, updates and deletes), but cloud data warehouses such as Snowflake and AWS Redshift scale much better for analytical queries. Broadly, there are two important methods to sync data from production databases to a data warehouse:
- Running SQL queries: In this method, the connector syncs data from the production database by running SQL queries to fetch data directly from the tables.
- Reading from database write ahead logs: In this method, the connector reads the database write ahead log (for example, binary logs in MySQL, logical replication in Postgres, oplog in MongoDB, log shipping in SQL Server). The connector then translates those write ahead logs to updates in tables in the data warehouse. This method is called Change Data Capture (CDC).
Advantages of SQL queries
- Flexibility on what to sync from the source database: SQL queries can filter, join, aggregate data from tables before syncing the results to the data warehouse. This gives analysts full control over what can be synced in order to satisfy data governance needs – for example, mask columns that have PII, fetch only summaries instead of detailed data.
- Simpler setup: connector can run as a regular user on the replica of the production database. There is no need to connect to the primary database for any reason.
- Easier to handle tables without primary keys, especially if they are small. Just fetch the entire table every time and sync it to the destination.
Advantages of CDC over SQL queries
- Large dataset sizes: When the data size at source is large, it becomes inefficient to fetch all the data every single sync using a SQL query. This also has implications for how frequently the data can be synced.
- Fetching every change: When running SQL queries on some frequency, only the diff between the two datasets is replicated. When replicating via CDC, every single update to a row will be fetched (even if there are multiple updates to the same row at source). This is not just important for observing all activity at the source, but is useful for database auditing purposes.
- Capturing deleted records: Deleted records can only be fetched through CDC, because running SQL queries against the source tables doesn’t return deleted records.
Other than the reasons mentioned above, the benefits of CDC (especially log-based CDC) have been discussed before. At Datacoral, we support both ways of replicating data from production databases, but if any of the conditions above hold true for a customer, we recommend syncing data using CDC. We have connectors for different databases, but in the section below we will discuss CDC for MySQL and PostgreSQL.
CDC in different databases
In this section, we will discuss how Change Data Capture (CDC) works for some of the common production databases and how Datacoral has designed our CDC connectors for the best possible experience. The figure below shows the sample architecture for replicating data from a production database (MySQL or Postgres) to Snowflake or Redshift via CDC.
Historical syncs
Datacoral performs historical syncs for all tables by connecting to a read-replica, if available. This reduces the load on the primary database, and leads to faster historical syncs.
Change Data Capture
MySQL: MySQL allows replicating data from its binary logs. Binary logging needs to be enabled appropriately in the Mysql database – instructions can be found in our documentation.
PostgreSQL: PostgreSQL offers a few different ways of reading in CDC fashion. A commonly used pattern is to read from the Write-Ahead Logs (WAL). The WAL contains changes at the storage level, so this needs to be converted into application specific data through a process called logical decoding. Here, data is read from a replication slot which represents a stream of ordered changes that can be replayed for a client. At Datacoral, we then use SQL to read and retrieve changes from the replication slot, which is then loaded into the data warehouse.
Data Quality Checks
Data quality checks can be setup for any connector to perform data comparison between the source and destination tables – these provide a high-level of confidence for the consumers of data in the warehouse that they are using reliable data. This is a topic that we have discussed before.
Common challenges with CDC
High Data Volumes
Depending on the underlying database, high data update volumes can put a lot of pressure on the CDC system. For example, in PostgreSQL, it is important to ensure that all processes that are reading from the WAL are reading as quickly as possible because unless data isn’t fully read, the records are not removed from the WAL and it continues to consume disk space. This can even lead to the database running out to disk space and to a production database outage! Therefore, it is crucial to have mechanisms built-in to handle cases when data volumes suddenly increase such that the CDC connector doesn’t have an effect on the running of the production database.
Replication Lag
Different analytics use cases require different guarantees on their “replication lag”, which refers to the time taken between a transaction committed in the primary database and for it to become available in the data warehouse. While there are many factors that contribute to the overall replication lag (data volumes in the production database, load on the data warehouse, etc), Datacoral can support replication lag as low as 5 minutes.
Schema Changes
Over time, it is very common for the schema change to happen in databases. This can include:
- A new table getting added
- A table getting removed
- A new column being added for an existing table
- A column being dropped from an existing table
- A column being modified (renamed or undergoing a data type change) for an existing table
As the schema changes in the source database, it is important to replicate the schema change in the data warehouse alongside the data replication. Datacoral handles all these scenarios in a consistent fashion as can be seen in our documentation.
Historical Syncs
Before the updates from CDC can be applied in the data warehouse, historical syncs for the tables need to be performed. Depending on the size of the tables, the historical syncs can take a while (from a few hours to a few days), and can be heavy operations on the underlying database. At Datacoral, we recommend performing replicating syncs against a replica database (or node) instead of the primary, which leads to substantially faster historical sync and very little impact on the performance of the production database. Our historical sync process leads to fast, robust replication over terabytes of data. We have described the scalability and observability challenges in historical syncs and our solutions in another post.
Monitoring and Alerting
When working with complex systems such as Change Data Capture, it is important to have an ability to monitor them closely and get alerted when there are errors or when a human needs to take an action. As an example, with PostgreSQL CDC, the replication slot size can grow to a large size when there is a large number of transactions in the database. Datacoral sets up automatic alerting for when the replication slot reaches certain thresholds (which are picked based on database size). Other than this, the Datacoral web application provides a wealth of information about the status of the CDC replication (what is the latest timestamp record fetched, when was the last time any table was updated, what’s the average time taken to load a batch of data into a warehouse, etc).
Conclusion
At Datacoral, we have worked with many customers over the past few years to replicate their databases to a data warehouse using CDC (with all the database and data warehouse combinations one can imagine). This has given us the luxury of understanding many different production systems and use cases, and consequently, our product has evolved to meet all these needs. In future posts on Change Data Capture, we’ll dive deeper into specific challenges that we have mentioned above (and some that we haven’t mentioned, such as data quality).
If you’d like to learn more about our CDC connectors and how we can help with replicating data from production databases into your data warehouse, drop us a line at hello@datacoral.co or sign up for a free trial.