When running data analytics and reporting, a data team needs to be able to answer a few key questions:
- How do we know we can believe the data?
- How do we know we can believe the report?
- Are we using the latest data from production?
Whether business leaders explicitly ask these questions or not, analysts should know the answers, as confidence in their reporting depends on vouching for data quality and fidelity. Answering these questions is tricky because most companies have dozens (or hundreds) of data sources, each containing pieces of mission critical data. Customer relationships in Salesforce, product usage data in one (or more) local databases, ads running on Facebook and Twitter, website traffic data in Google Analytics, infrastructure logs in Datadog … it’s a lot to keep track of.
Data from one or two data sources can very rarely prove reporting hypotheses, which means data quality and fidelity are top-tier concerns for analysts building reports and running analytics on behalf of business leaders. But if you’re not a data scientist, how do you know your data warehouse is meeting data quality and data fidelity requirements?
Let’s talk about what these terms mean, how they impact the quality of an analyst’s work, and how data problems can be solved in complex data integrations.
What is data quality?
Data quality measures how closely the data being analyzed matches the real-world phenomena the data is supposed to represent. Several metrics can measure data quality, including:
- Data fidelity: How identical is the data between the source and destination?
- Data type matching: Does the data type in the source match the data type in the destination warehouse or lake?
- Data relevance and availability: Is all the relevant data available for problem solving?
- Data consistency: Does all the data from the source align and make sense?
These metrics help analysts understand whether the data they’re working with is relevant to the problem they’re trying to understand and solve. The same data may be good quality for one type of analysis and bad for another. For example, if you need to understand the average age of all users on a platform and you have `birth_date` and `address`, the birth_date is the right data for the analysis, even if the `address` data are corrupt.
Data quality is much more nuanced than having pristine data that’s accurate across the board.
What does data quality mean for data integration?
Data integration is the process of replicating data from different source systems into a central data store such as a data warehouse or lake. For data integrations, data quality is directly related to the fidelity of the data copied from the source into the destination.
Understanding the integrity of destination data requires certain transformations like mapping schemas. There could be lossy data transformations, or data that’s representable in the source may not be representable in the destination. All of these circumstances can reduce the fidelity of data in an integration.
When measuring data quality, different analyses can have different criteria for what good quality data is.
What is a data quality guarantee?
A data quality guarantee is a metric or threshold found in a data integration tool. As the tool moves data, it provides information about data freshness, copy fidelity, and invalid values. These metrics help analysts understand the quality of the data.
A data integration tool can also produce an alert or notification when all data for a time period has been reliably copied. This is particularly useful when building a daily financial report, which should only run when all transactions for a given day have been captured. The data integration tool should tell us that all the relevant data for a given time period is available. This feature is known as a high water mark or “close of books.”
What is data fidelity?
Another data quality guarantee relates to the data’s fidelity. The integration tool can compute checksums of the data at both the source and the destination. Identical checksums prove that the destination data is a high fidelity copy of the source.
There are a few data fidelity constraints and guarantees. Unless a source system and destination system are the same (i.e. mysql as the source and mysql as the destination), it’s highly likely the data types on both ends will differ, so, a data translation will be needed and it might be lossy.
A nested structure (like Mongo DB) can be normalized into a relational schema (like MySQL) and is not necessarily a lossy translation. In that case, one can technically prove that the data from the source is exactly the same as what’s in the destination. By performing the translation from the source to the destination — which depends on what the source and destination systems allow — we may be able to keep the data exactly the same in both systems. For example, an integer in one system is likely the same as an integer in another.
Where there is loss of data, a source database column can be as large as a gigabyte, but in a warehouse a column may not be more than 4MB, or might be as small as 64KB. In those cases, there is a loss of data because it must be truncated from the source to the destination. If columns need to be truncated, we can prove whether the data has changed by computing checksums on both sides. If the first 64KB of a string in the source has the same checksum as the string in the destination, we have a high fidelity copy.
What are data quality challenges in data integrations?
In data integrations, connectors constantly move data from one system to another and we don’t always know that the data is reliably copied. To prove this, we need to find another tool to compare the data in both places. It’s a circular problem to solve, but can be done using checksums.
Checksums are a good way of measuring data quality or the validity in a data integration, especially on live systems. The source data and the destination data are continuously changing. Though checksums and comparisons are important, the key question is, “When do we compute the checksum?”
If we compute at a point in time, we have to ensure that the checksum in the destination is at the same version of the source. Knowing when to do the computation of the checksums is the hard problem, which is why data quality for data integrations on live data is really tricky.
Thankfully, there are ways to address these issues.
Checksums and comparisons
We only need to check checksums for data whose versions we know in both the source and destination. This is a situation where Change Data Capture excels, and why it is such a robust way of building and designing data integrations. If we can keep track of versions of the data in source and destination, we can compute the checksums on these known versions.
It’s impossible to run a checksum at the exact same write time as the source and destination. We need additional information about which version of the data was in each database. That is why robust data quality guarantees are a lot harder in a data integration.
Any data analysis is basically garbage in, garbage out. If we want our analysis to be meaningful, we need the data to be high quality and represent real world phenomena. A data stack that cannot give quality guarantees means analysts are flying blind.
An analyst’s entire worldview is determined by the data they’re seeing. If there is no validation of that data, any work they do will be meaningless because they can’t guarantee the data they’re looking at is an accurate representation of real-world occurrences.
For analysts to reliably present a picture of what is really happening in their products, they must be able to demonstrate that their analytics data is accurate and timely. This means data quality and data fidelity must be provably high and the data transformations are not lossy.
A highly-functional data integration tool is the heart of a data team. Robust connectors, versioned data, fast checksum comparisons, and consistent high fidelity are some of the low-level problems a good data integration tool solves. Handling these with automation makes the life of a data analyst much easier.
If you need help building robust data pipelines with high observability and quality guarantees, you should take advantage of Datacoral’s free trial. It takes minutes to get started and connect to over 80 different data sources.