Imagine that you’re the analytics team lead at an e-commerce company. Your company has seen rapid growth in the last few months, with thousands of new users who are highly engaged on your website. Your leadership team is pushing for the analytics team to dig into user activity data that has been carefully collected inside your production database to figure out how to make your website’s experience even better. Excitedly, one evening you spin up a data warehouse and set up data connectors to replicate data from your database into your data warehouse.
The next morning, you wake up to an angry email from your VP of Engineering. The email states that the replication of data into the warehouse, and in particular concurrent historical syncs of tens of tables, put too much load on the production database, such that the database was unable to accept writes. This meant that new users could not sign up, and existing users could not buy products on the website for hours. Having turned off the connectors, he ends the email with: “Data analytics cannot impede our day-to-day operations!”
Having learned a painful lesson, you decide to slow down the historical sync for your production database by running it one table at a time. But instead of the production database being impacted by the historical syncs or ongoing replication, by slowing down your historical sync, your team is now a full month delayed in accessing the historical data within your data warehouse. Consequently, your team is now delayed by an entire month in running any meaningful analytics.
Is there a better way? Does there always have to be a tradeoff between the speed of historical syncs and keeping your production database operational? In this post, we first define what historical sync is, when they are required and how Datacoral addresses the tradeoff mentioned earlier. While historical syncs are common across almost all data sources, we will focus on connectors for database sources in this post.
Before we go further, consider reading our Data Connectors 101 article to learn more about data connectors.
A historical sync for a data connector refers to the one-time replication of the complete data from a source to a destination. Historical syncs can be performed for an entire source (including all the tables/objects in that source) or an individual table/object at the source. There are a few different situations where historical syncs are done, although the first case is the most common.
Let’s imagine adding a Change Data Capture (CDC) connector for the first time. Let’s say your source is a PostgreSQL database with a hundred tables, some large (hundreds of millions of rows) and some small (thousands of rows), and your destination is a Snowflake data warehouse. The first step to adding a connector is to add the connection configuration (database endpoint, port, user name, password); at this point, the connector can inspect what tables exist in PostgreSQL. The connector will then create tables in the destination with the appropriate schema in Snowflake. Now what?
For each table, the connector now needs to replicate all the data (from the beginning of history) from source to destination – this is the historical sync for each table. As we discussed in our Data Connectors 101 article, there are two modes of extracting data from sources – snapshot and incremental. With snapshot mode, we are fetching all the data for a table in every run, which means that effectively a historical sync is done every run (making this feasible only for small tables). In incremental mode, however, only the data that has changed since the last run is fetched in the next run. This makes the first historical sync critical in ensuring that analytics teams have complete data available in their warehouse for analysis.
Historical syncs are also helpful in recovering from bad situations. Data connectors are complicated pieces of software, often deployed as part of a complex data infrastructure. There are many reasons why things can go wrong. For example, there can be network outages, data corruption at the source or destination, manual errors, or software bugs in the connector itself. These might affect individual tables or all tables – in either case, historical syncs are the answer. In certain instances, historical syncs can be automatically triggered – for example, we talked about how Datacoral’s connector fully-managed replication slots in PostgreSQL.
Finally, there are cases where certain schema changes require historical syncs. An example is when a new column is added to a table with a default value. In this case, all the rows of the table are impacted, and the table needs to be completely refreshed at the destination. In this case, a historical sync needs to be run for this table.
Historical syncs are common in the data integration process. They need to be fast without any impact on production systems so that full and correct data is available for analytics without any impact on operations teams.
Challenges with Historical Syncs
In the previous section, we described what a historical sync is, and when they are needed. There are many challenges (which we have covered in previous posts) historical syncs present for database connectors, as we saw in the introduction. In particular, the tradeoff between speed and stability.
First, historical syncs can be very slow, especially when dealing with many tables with large amounts of data, as described in the scenario from the previous sections. This is because the SQL queries used to fetch historical data are costly (`select * from large_table`) and can take some time (days to weeks). While the analytics team is waiting for historical syncs to complete, they are also waiting to analyze the data and generate insights that improve their business. It might be tempting to think, “Why not run historical syncs for multiple tables in parallel and save time?” However, here we run into the next challenge.
Multiple costly SQL queries running on a production database can significantly slow down other operational workloads. No one wants the Operations team to be paged at 2 AM by an alert that says that their APIs are running 50% slower, causing a significant drop in user engagement, only to then discover that this middle-of-the-night emergency was the result of historical syncs triggered by the analytics team. Analytics use cases should never bring down (or hamper) production systems!
How does Datacoral address these challenges
Datacoral has designed its connectors and platform to solve several common problems with large and time-consuming historical syncs. Whereas some of our competitors run historical syncs against the primary database, we run ours against read replicas whenever possible. Read replicas are read-only copies of the primary database that are specifically designed to allow for high-volume read traffic without performance degradation on the primary database. This means that the primary database is largely unaffected by historical syncs and our syncs take less time than the competition.
Datacoral also intelligently manages historical syncs of large tables. From our earlier example, you don’t want to trigger historical syncs for all 1000 tables all at once, and you don’t want to do it one at a time. Datacoral handles this behind the scenes with a managed queue triggered for a subset of tables at a time. The moment one of the table syncs is complete, the table switches from historical sync mode into a live sync mode using Change Data Capture (CDC). Then we automatically kick off the next historical sync for the following table in the queue.
We also control the concurrency of historical syncs, so the production database is never overloaded. Our serverless architecture is inherently scalable, so the concurrency can be increased and decreased as needed, depending on your team’s needs. Finally, with observability built-in to our metadata-first platform, you have a complete picture of your data pipelines. We have built our product to produce high throughput historical syncs for your database connectors without impacting production workloads.
You can learn more details about our historical syncs here.
When setting up connectors to replicate data from different sources into destinations such as Redshift or Snowflake, the historical sync of data can be a surprising stumbling block for teams. Teams are stuck with a false choice of slow historical syncs, or concurrent historical syncs against the primary database for large tables, which puts their database at risk. Datacoral’s database connectors solve this problem by running fast historical syncs against read-replicas while managing concurrency and switching to change data capture mode intelligently without requiring any intervention for analytics teams.