In a previous post, we looked at why replicating data using the change logs (Change Data Capture, or CDC) is important and what are the common challenges in this approach. In this post, we are going to take a deeper look at managing historical syncs while replicating production databases to a data warehouse, and is the critical first step in this process.
Why historical syncs?
The aim of any connector is to make sure that data at the source is identical to the data in the destination (within a permissible replication lag). When using a CDC connector, we only get the new set of changes from the source starting at the time we begin replication. This means that we need to first sync the current snapshot of the tables before applying the new changes captured from CDC in order to maintain a copy of the source database in the data warehouse. You can read more about historical syncs here.
Challenges during historical syncs
Historical syncs are costly
Historical sync involves fetching all the rows using SQL queries. Consider a production database with 1000 tables. Running historical syncs for all the tables at once will put a lot of load on the database. Even fetching all the rows at once from a large table can make the production server slow. The following are some of the ways Datacoral mitigates these issues:
- Performing historical syncs using a read replica instead of the primary database
- Performing historical syncs in batches based on the table sizes.
- Fetching data from huge tables in multiple pages.
Historical syncs can take days
Consider a production table with 1B rows with 1M changes every day. The historical sync of such a table can take up to a few days. Since both historical syncs and change log replication are running in parallel, the data collected from the change logs needs to be staged until the historical syncs are completed. Only then can the change logs be applied to the destination table.
There will also be smaller tables which just take a few seconds for their historical sync. These should become available for querying in the data warehouse as soon as their historical sync is completed and should not wait for historical syncs of all the tables.
Tracking historical syncs
We need to have a centralized view to monitor the progress of historical syncs. It is also useful to get a notification/event on completion of historical sync per table.
CDC Connector to be started before the historical syncs begin
The CDC Connector cannot be started after performing the historical syncs as the changes which happen between the time historical sync is completed and the start of the CDC connector will be missed. At the same time, the changes fetched from the CDC connector cannot be applied to the warehouse table unless the historical syncs are completed.
At the end of historical sync, there will be entire snapshot data from historical sync and the staged changes from the CDC connector. We need to merge the data to have a perfect replica of the tables in the data warehouse.
Automatic historical syncs for changes in source
An essential feature of any connector is monitoring schema changes. Please refer to our documentation for types of schema changes a connector has to monitor. There are specific schema changes that require historical syncs.
Schema changes which don’t require historical syncs
- A table is deleted at source.
- A new column is added in a source table, only the new changes fetched in the CDC connector will have the new column data.
- A column is deleted or altered in a source table.
Schema changes which require historical syncs
- A new table at source is automatically added to the destination.
- A user wants to sync an additional column from the source which was not being synced earlier — now, we have to update all the rows with the column.
At Datacoral, we have worked with multiple customers to set up CDC connectors to replicate data to the data warehouse over the last few years. The product has evolved to handle all the issues we have identified, especially around historical syncs.
Apart from the CDC connectors we also provide a robust way of syncing data from multiple sources. If you want to know more these connectors sign up for a free trial or send us a message at [email protected].