Data connectors are a critical component of any company’s data infrastructure stack. They replicate the company’s data from different sources into a shared data warehouse or data lake, which then enables the data team to aggregate, combine, and explore all these datasets together. When a data connector is set up, it will start reading the data and the changes as they happen at the source, but will also trigger a “historical sync” for the table to fetch all the existing data, so that the data at the destination matches data at the source exactly.
Historical syncs are needed not just at the beginning of setting up a data connector, but also when trying to recover from errors. Fast and reliable historical syncs are necessary to the working of a connector, but they don’t get the attention they deserve. We’ve previously demonstrated how Datacoral can radically speed up historical syncs using read replicas for Data Capture connectors. This article will demonstrate how Datacoral uses Partial Historical Syncs for even simpler and faster recovery from bad situations.
In this article, we will first discuss different circumstances where a historical sync is needed and then go a step further by showing how Datacoral’s Partial Historical Syncs make it even simpler and faster to recover from bad situations and keep your data pipelines healthy. Not only do we show you how, but we show you the raw performance numbers to prove it!! 😎
When are historical syncs needed?
From our earlier post:
A historical sync for a data connector refers to the one-time replication of the complete data from a source to a destination. Historical syncs can be performed for an entire source (including all the tables/objects in that source) or an individual table/object at the source.
The most common need for a historical sync arises when a connector is first being set up. At this point, data for all relevant tables needs to be made available in the destination warehouse. However, this is not where the need for historical syncs ends. Data connectors are incredibly complicated pieces of software, and sometimes, software breaks (gasp!).
Historical syncs help recover from bad situations
Here are some examples where historical syncs might need to be performed to resolve “bad” situations.
- Data issues: The data type of a column could change, or a new column might be added with a default value or manual corruption of data in the destination
- Data volume issues: An increase in data volume or spiky traffic might result in the source system being overwhelmed or unavailable. Similarly, it could lead to the data connector not scaling.
- Data source issues: Software bugs, networking, database reboots, and incorrect parameter updates can all affect data replication.
- Data warehouse issues: The warehouse could be overloaded by too many (or too heavy) queries, making data loads unsuccessful
- Data connector issues: Aside from networking and scaling issues, there could be legitimate bugs in the connector code.
In many of these situations, a common solution can be to simply run a historical sync. But what if the table has millions or even billions of records? The historical sync operation could take days to weeks, and waiting that long can be infeasible (especially when working with critical data tables). There is a better way.
Partial Historical Syncs: Making Data Recovery Faster
The key realization is that most “data outages” (for which a full historical sync is prescribed) are limited to a specific time period or a specific subset of data. This means that the full dataset doesn’t need to be synced, and focusing on the set of rows that were impacted is sufficient. At Datacoral, we call this feature a “Partial Historical Sync”. Since our documentation on Partial Historical Syncs covers this concept, we’ll let you read the details there, but in this article, we will show the performance improvements one gets to see by using a Partial Historical Sync over a Full Historical Sync.
First, we set up a PostgreSQL database and pre-populate it with data for a few tables that represent an e-commerce business. There are tables for `users`, `products`, and `transactions`. We’ll focus on a specific table called the `transactions` table, where each record is a purchase of a product by some user. This is what the `transactions` data looks like:
This table contains records with a column called `last_updated` which simulates when that record was last updated. The data in the transactions table contains records with a `last_updated` value for a 30-day period.
Next, we add Datacoral’s PostgreSQL-CDC connector that is replicating data into a Snowflake warehouse – check out this video to see what the experience is like. Within a few minutes, the connector is added successfully and data matches between the source PostgreSQL database and the destination warehouse. Now, we simulate the following scenario.
Let’s say that because of a manual error in the Snowflake warehouse, two days’’ worth of data gets deleted from the destination `transactions` table. There are two ways we could attempt to recover this data: first, by running a Full Historical Sync for the `transactions` table, and second, by running a Partial Historical Sync for the affected two-day period. In both cases, we note down the time taken to read all the relevant data from PostgreSQL into S3 and then load the data into Snowflake. We also measure the end-to-end time taken from the moment that the historical sync started to the time it was complete.
Finally, we repeat this experiment for three different table sizes: 1000, 1 million and 25 million records.
Full vs Partial Historical Syncs: Time Taken
|Sync Type||Table size||Records Fetched||Extraction Time||Load time||End-to-end time|
|PARTIAL||25M||4.36M||14m 30s||1m 5s||15m 41s|
|FULL||25M||25M||18m 47s||1m 53s||20m 43s|
Fig: The time taken to extract 25M records from the transactions table and write it into S3
The results from our experiments can be seen above, and there are a few clear findings.
- For very small tables, there is no appreciable benefit of using Partial Historical Syncs. This makes sense since a single SQL query can easily retrieve many thousands of records, and here the historical process is bottlenecked on loading to the data warehouse (which only takes a few seconds).
- For larger tables, we start to see meaningful speedups. For example, for the table with ~1M records, we start to see a speed-up for Partial Historical Syncs (which at the end of the day are only retrieving about 20% of the entire table). We see a 27% faster data extraction into S3 and a 53% faster load into Snowflake and an overall speedup of 30% for the historical sync.
- For larger tables (~25M records) and with a Partial Historical Sync for about 20% of the records, we see a 23% speedup for the data extraction into S3 and a 42% speedup in the load into the warehouse. Overall, we get a 24% speedup by running a Partial Historical sync.
Conclusion: Partial Historical Syncs Save Time!
We saw that a significant speedup can be achieved when a Partial Historical Sync is run, compared to a Full Historical Sync. While there are many other variables that could affect some of these timings in the experiment (% of table synced, presence of indices on the timestamp column, instance type in AWS RDS), we believe that this is an important feature for data connectors that allows data teams to recover from data pipeline breakages.
If we extrapolate these time savings to tables that contain billions of records (as many of our customers do), the time savings achieved will be in the 8-12 hour range. For such teams, if they are doing daily reporting of their critical numbers, those 8-12 hours make a world of a difference.
If you’re interested in replicating data from your databases into a warehouse such as Snowflake or Redshift, try us out for a free 30-day trial here. You’ll have lightning-fast historical syncs, automatic schema change handling and so much more, as you can see in our demo video here. And, of course, drop us a note at [email protected] if you have any questions!