Data analysts use Datacoral connectors to replicate data from many kinds of data sources (databases, SaaS APIs, file systems, event streams, etc) into the data warehouse of their choice (Redshift, Snowflake or Athena). This allows them to combine, join and transform these different kinds of data to find meaningful insights. However, when connectors are syncing data from different data sources, how can they figure out if the data is being copied over correctly? In this post, we will describe how one might systematically determine the fidelity of the data being replicated in the warehouse instead of just relying on ad hoc spot-checks.
The Challenge
When a connector is syncing data from a source to a destination, the only sure-fire way of checking that the data was synced correctly is by syncing that very same data to the same destination using a different connector and comparing the two. In addition, when the data in the source is changing, which is always the case in a live system, just checking once is not enough, the check has to be performed continually. When the amount of data being synced is in the millions or billions of rows every day, it can become very expensive to perform two separate syncs! Finally, data integration typically means that you are changing the shape of the data either nominally (like int32
to int64
) or drastically (like json
to varchar
) depending on the exact mapping available in the destination data warehouse.
Prior Work
Thankfully, there are other domains where this problem of checking whether data was synced with high fidelity is common – like in storage and networking – in order to detect errors. The simplest way to check for fidelity is by computing a checksum of the data being synced (transmitted) at the source and comparing the same checksum at the destination. These checksums should be fast to compute, quick to compare, and produce very few false matches or false mismatches. A lot of work has been done around using checksums for both error detection and error correction. We will not go into that topic in this post. Check out the wikipedia article about checksums as a starting point for researching this topic!
This same principle of using checksums for error detection can be applied for data integrations as well. In fact there is a less well-known technique that takes on a more liberal view on what the checksums themselves are. More than checksums, this technique requires computing synopses or summaries of the data being synced. These synopses are computed at the source and the same connector is used to fetch the source synopses from the source to the destination. Once the data is actually synced, the same synopses or summaries are computed at the destination. These destination synopses are then compared against the source synopses to verify that the data that is synced is accurate. This technique is called Source System Loopback Verification. From TDWI, “In this technique, you perform aggregate-based verifications of your subject areas and ensure it matches the originating data source. For example, if you are pulling information from a billing system, you can take total billing for a single day and ensure totals match on the data warehouse as well. Although this seems like an easy validation to perform, enterprises don’t often use it. The technique can be one of the best ways to ensure completeness of data.”
Datacoral’s Approach
We agree that Source System Loopback Verification is the most natural way to check for fidelity. But, we also believe that analysts should have full flexibility in determining exactly which synopses are to be used for the verification process. Datacoral connectors have built-in verification that is available out-of-the-box. Our connectors are flexible enough to not only fetch the raw data, but also if available, generate synopses at the source and fetch them as well. Once the raw data is fetched, our orchestrated transformations automatically generate destination synopses from the synced raw data. Once the source synopses are synced and the destination synopses computed, an automated check runs to compare the two. The whole orchestration is automated using Datacoral data events.
Caveats
The source system should support computing synopses in addition to supporting fetching the raw data. Databases are prime examples where one can compute synopses easily at the source using SQL queries. There are other data sources that are available only via APIs, in those cases, sources that support summary/reporting APIs in addition to data export APIs are amenable to this technique.
Confounding factors
There are several factors that affect the validity and accuracy of the checks – mostly driven by when the checks are run. The checks themselves are sensitive to the availability of up to date data from both the source and the destination.
- Data is live – the time when the data is synced is more likely not exactly the same time when the synopses are computed. So, when the source system is a “live system”, the data is getting modified at the source. So, it is quite impossible to get an exact match between the synopses at the source and destination at any given point.
- There could be delays in replicating data from the source – there are several reasons when the data being synced from the source is delayed. So, data for a given time period may be synced at a later time. If the synopses are computed before all the data for a given time period has arrived, even a successful check is not meaningful.
- There is a translation between the source and destination schemas – the types at the source system could be different from the types at the destination. Connectors provide a mapping between them – which sometimes causes either some loss of information or translation to values that are harder to compare.
Potential solutions for robust data quality checks
Like mentioned earlier, there are several issues to be considered to ensure the quality and fidelity of the data syncs whenever there are data integrations. We briefly describe in this section how one can solve for each of the confounding factors.
- When data is live, depending on how the source data changes, one can consider performing approximate checks with custom error intervals in order to account for the small delta between the source and destination synopses.
- When there could be delays in the data at the source, the source synopses have to be computed at or after all the data for the time period has been available for syncing. In addition, the destination synopses have to also be computed only after all the raw data from the source has been synced.
- When the source and destination schemas are not exactly the same, the synopses have to be computed on derived columns at the source so that the exact same synopses can be computed at the destination.
For each of these potential solutions, Datacoral offers the building blocks for analysts to have the flexibility they need to customize the data quality checks. In subsequent posts, we will go over each of the above confounding factors and show how Datacoral has solved for them so that Datacoral customers don’t have to worry about the fidelity of the data being synced by the connectors.
Subscribe to this blog to get notified when new blogs come out. Write us at hello@datacoral.co or sign up for a free trial to see the Datacoral in action in your production environment.