
Change Data Capture: 101
Introduction Let’s pretend you work at a small identity protection startup and you’re getting acquired by a Fortune 100 corporation. You and 120 of your
Changes in your production database? Keep your analytics database in lock-step using our Change Data Capture (CDC) connector.
Your production database will always change. Keeping your analytics warehouse in sync with your production database will be an ongoing challenge.
Our PostgresSQL and MySQL CDC connectors have processed over 100B changes (5B rows) to ensure that the analytics warehouse captures 100% of the changes whether it be additions, deletions, or changes.
Production databases (e.g., MySQL, PostgreSQL, etc.) scale really well for transactional queries (insert, updates, and deletes), but cloud data warehouses such as Snowflake and AWS Redshift scale much better for analytical queries.
For such analytics, data-driven companies need to replicate data from their production database and from different data sources into a data warehouse. This helps them run analytics and gain insights on their product usage at a granular level without risking interference with the production database upon which their product is built.
In this method, the connector syncs data from the production database by running SQL queries to fetch data directly from the tables. This method doesn’t necessarily log updates to records, leading to the inability to perform in-depth analyses as well as a loss of data.
In this method, the connector reads the database write ahead log (WAL) — for example, binary logs in MySQL, logical replication in PostgreSQL, and log shipping in SQL server. The connector then translates those write ahead logs to updates in tables in the data warehouse. This method is called change data capture (CDC).
Every database worth its salt writes a log of all the changes to every single row/object. In most modern databases, this log is a Write Ahead Log (WAL). The WAL contains all changes to all tables in the database in a single log.
Change Data Capture as a way to integrate production databases to data warehouses actually involves more than just reading the WAL logs even though that is a critical part of the integration.
Setting up a CDC connector involves several steps that need to be performed in the exact right order in order to not lose data and provide data quality guarantees. All of these steps are automated by Datacoral for a completely hands-off approach for data teams.
Start reading the WAL logs from the database and persist in the staging area
Start applying the changes captured in the WAL log to the tables whose historical syncs have completed
Perform a historical sync for all of the tables of interest
Perform above steps when new tables/columns show up in the source database
There are plenty of CDC connectors available in the market. But Datacoral’s end-to-end serverless data pipelines provide the ultimate flexibility, scalability, compliance, and data quality guarantees that data analysts, data scientists, and data engineers need.
Get full auditability - the entire WAL log is available for you to analyze. Pick parts of your WAL log to also be loaded to the warehouse for analysis.
Two options are better than one. Now you can decide between hard deletes and soft deletes.
Automatically handle schema changes at the source. Specify sophisticated rules on how to propagate schema level changes at the source to the destination.
Out-of-the-box Data Quality Checks to guarantee the right data, whenever you need it.
Decide how frequently you want to update the warehouse tables. Tune the pipeline easily to optimize load on your data warehouse.
Specify exactly which tables and columns you need. Datacoral will not persist *any* table or column level data that you do not need.
Introduction Let’s pretend you work at a small identity protection startup and you’re getting acquired by a Fortune 100 corporation. You and 120 of your
Change Data Capture (CDC) is a common pattern for replicating data from databases to data warehouses by syncing the database change logs rather than fetching
We use cookies on our website. If you continue to use our website, you are agreeing to our use of cookies in accordance with our Cookie Statement. For information about how to change your cookie settings, please see our Cookie Statement.