This blog was originally posted on Towards Data Science.
When building or rethinking their data stack, organizations often approach the task with a tool-first mindset. We recently looked at popular tools like Fivetran, dbt, Airflow, and Looker on Snowflake and noticed that many reference architectures and implementation plans are unnecessarily complex. It doesn’t have to be this way.
We recommended a simplified conceptual framework for thinking about building a data stack. We believe that making the metadata drive the stack provides clean separation of concerns and encourages simplicity. Plus, it helps data teams think strategically and use the right tools to fulfill their data strategy needs.
Our three-layer framework looks like this:
- Layer 1: Data Flow
- Layer 2: Metadata
- Layer 3: DevOps Tooling
Let’s take a deeper dive into why metadata is so critical to building a scalable and maintainable data stack; why metadata is an afterthought for most companies; and how to take a holistic platform approach to meeting data objectives.
Missing the Metadata
Companies across the globe are building data stacks and investing heavily in data teams, leading to a proliferation of tools for managing data processing, state management, and configuration. These tools are excellent for building pipelines that are ingesting and transforming data, but they end up creating metadata silos. These silos cause an impedance mismatch between different tools, resulting in a lack of cohesive intelligence. There is a need for a lot of glue code to integrate the tools by translating the metadata from one tool to another.
Why? For the most part, data stacks are built piece-meal based on the prevailing business requirements. It is hard to justify taking a platform approach to data architecture when you are a small company with a small amount of data. But, as the company grows, there is a proliferation of tools, leading to a fragmented view of the data stack and exponential increase in complexity.
Business leaders want valuable insights while data teams want stable infrastructure that can support data pipelines processing millions of records. Business leaders want to make up-to-the-minute decisions based on fresh data while data teams are worried about the cascading impacts of schema changes to a transformation job. We can see how quickly the goals for a data stack can diverge.
There is a metadata-first approach to solving the problem that the broader industry is starting to see, but hasn’t been able to get a full handle on. In recent months, we’ve seen several metadata-focused products come to market, which is very encouraging. But, most of these products solve simpler, short-term problems like cataloging of data sources, dashboards, machine learning jobs, and spreadsheets, to support searching, auditing, and compliance. These metadata search tools are really useful, yet they do not solve for the platform approach that companies really need to have a coherent stack.
Today’s metadata tools centralize metadata from different systems and try to provide a single pane of glass view of what is happening across the stack. But, the metadata is not consistent across tools because each tool is built differently, which means more work has to be done to standardize the metadata. Playing whack-a-mole with metadata standardization gets old real fast. Data practitioners often find that the legacy metadata systems are not always up to date, so the tools themselves go stale along with the focus on metadata. Then, the predominant use case for metadata systems becomes auditing and compliance, which are only done when auditing and compliance needs actually arise.
If you are thinking about a metadata system after building the rest of your data stack, in some sense, you have already not chosen the path of building a holistic data stack.
Current metadata solutions end up reinforcing the problem of a tools-first mentality. We believe data teams should think about metadata in a very different way — one that makes metadata part of the critical path of a data pipeline.
We think that metadata needs to drive the overall system. The metadata layer is key but is usually added as an afterthought to the actual data stack itself. Metadata is not part of the critical path for designing data pipelines, therefore long-term predictability within the data stack is lacking.
Data engineering as we know it today consists largely of complex integrations and custom code driven by a variety of tools. The upside, as we’ve just discussed, is the availability and commoditization of data connectors and transformation tools. They’ve made it easier than ever to unify raw datasets and compute KPIs/metrics. The downside is the lack of integration and end-to-end intelligence. More importantly, these downsides increase the difficulty of serving meaningful insights to applications, AI/ML models, and business leaders.
The data stack ends up providing data but with no guarantees on the quality of the data.
Where tools are fragmented and metadata is siloed, teams end up with fractured expertise. A handful of people become the Fivetran experts, some become the dbt wizards, and others become the Airflow gurus. Soon enough, the data stack begets a “one tool leads to another” mindset. Without native integrations between the tools, there are no unifying end-to-end views showing health of the data pipeline nor is there a holistic understanding of schema dependencies. And what happens when there’s a data issue? As we explored in our data framework post, the debugging flow, in this case, is a pretty complicated one — between the analyst, analytics engineer, and data engineers who need to locate the right tool before being able to troubleshoot errors.
These are critical, everyday problems that add friction to deriving value out of a data stack. Far deeper problems can arise without a long-term, strategic view of the data stack as a platform.
High-value platforms don’t get built by accident — they require some forward-thinking and willingness to invest in architecture upfront. If the start is good, the platform will allow for organic growth within meaningful constraints. And our claim is that investing in metadata first is essential in building a sustainable data platform. Knowing and planning the metadata makes orchestration efficient and scalable. DevOps authoring tool development becomes much easier. Data and code collisions are preventatively addressed. Transformation changes and orchestration logic are kept in sync.
How do we develop a metadata mindset? We begin with data pipelines.
Data Pipelines Bring Life to a Data Stack
Data pipelines don’t get a lot of love (despite efforts to the contrary), yet they bring life to the data stack. On the surface, they seem simple enough that even non-technical folks can understand what they do. Data moves from a source to a data warehouse before being transformed to generate insights and the pipeline’s utility is immediately known.
If a pipeline brings life to the stack, the pipeline metadata is the pulse of the stack. We see it as the lifeblood that needs to be managed and kept “clean” all the time. And the only way to keep it clean is to put the metadata layer on the critical path for a data stack’s success. We must therefore position metadata as the pipeline’s main driver otherwise it becomes just another source of information.
By tracking and understanding the metadata, we keep data clean. Cleanliness allows systems to run without issue and provides self-documentation and observability. Other tools can consume clean metadata with greater ease. It is very difficult, if not impossible, to understand data quality and freshness without clean metadata.
Data pipelines today are built using ETL systems, which are typically workflow managers. These systems can run jobs that are dependent on each other and allow an engineer to hand-code such jobs and dependencies into a pipeline.
In other words, data pipelines consist of jobs that move and transform data. The output data of one job is used as the input data of another job, making these jobs interdependent. Data pipelines essentially become graphs, more specifically, DAGs (Directed Acyclic Graphs), where the nodes are jobs and the edges represent dependencies between the jobs. When a data pipeline is executed, we must ensure that the jobs run in the correct order of dependency. This dependency-based execution is the job of the orchestration system.
Typically ETL tools are built around an orchestration system; Airflow is a great example. A data engineer writes pipeline code within a workflow manager that defines these jobs and their dependencies. The workflow manager interprets this pipeline code to execute jobs in the right order. It is up to the data engineer to make sure that all the data dependencies translate correctly into job dependencies for the workflow manager to interpret.
These workflow managers are provided configuration about the jobs and their dependencies and generate metadata about job executions. But for an analyst, data dependencies are more important than job dependencies. Metadata for a data pipeline consists of the following:
- Connector configurations: How data is retrieved from the source (incrementally or as snapshots), fetching changed rows periodically vs reading the change logs continuously.
- Batching configurations: How often to fetch data or refresh transformations
- Lineage: How the different transformations rely on the data coming from connectors and other transformations. Also, how publishers depend on the underlying data
- Pipeline runtime metadata: Logs of successful processing steps, failures, freshness check results, data quality metrics used to as barriers for tasks in the pipelines, history of historical syncs, and reprocessing actions
- Schema changes: A history of how schema of the input data and the corresponding transformations has changed over time.
Typically, most of this information is hidden inside of the job definitions themselves or in the metadata managed by the ETL system or workflow manager, so this information is not cleanly exposed to other applications. There is typically an integration needed into a different system which then uses this metadata for other applications.
Challenges of Siloed Pipeline Metadata
In our simplified conceptual framework article, we reviewed an example data stack consisting of:
- Fivetran for ingest
- dbt for transform
- Looker for visualization
- Snowflake for storage
- Airflow for orchestration
In this example, Airflow is the workflow manager. Tasks to trigger dbt to run are scheduled to run with an implicit understanding of when Fivetran typically brings data in.
We’ll use a common scenario in which Fivetran extracts data from Salesforce and MySQL, and loads it into Snowflake daily. A dbt model joins data from these two sources to compute a daily report. Typically, Fivetran brings data in every day around 1am, so Airflow could schedule dbt to run at 3am to be safe. Most days, this works.
But one day there’s a lag in pulling data from MySQL because it was overloaded, so Fivetran is delayed and doesn’t bring in data until 3:05am. By then, Airflow had already triggered a dbt run, which means that the Salesforce data was up-to-date but MySQL data was not, so the model refresh results in incorrect data.
Now imagine that Fivetran ingests MySQL data incrementally every hour, but ingests Salesforce data once every day. Immediately you run into the problem of having to wait for all 24 hours of MySQL updates for the day and one daily update of Salesforce. This problem is solved by coding up a couple of sensors in Airflow that wait for 24 runs of MySQL ingest and one day of Salesforce ingest. When the sensors are satisfied, a task triggers dbt to run.
Several challenges arise with this type of setup. We refer to this as a “split-brain” around the metadata:
- dbt constructs its own DAG for transformations, and that DAG will not have the same level of operations sophistication as Airflow, which is a sophisticated workflow manager. dbt is meant to be more of a data modeling tool than a fully-fledged workflow manager like Airflow, so it is hard to figure out how to do efficient reprocessing without having to recompute the whole DAG.
- Airflow does not have any information about data dependencies as those are encapsulated within dbt. This means that when there is a change in a query in dbt, like when the query in dbt joins three tables instead of two, then the job dependencies are not updated automatically. So Airflow might trigger a dbt run when only two of the three tables are updated, causing missing information.
- Airflow does not have information about the data loaders from Fivetran. When there are delays in Fivetran loads, or if the data loaded from Fivetran is not complete, Airflow is unaware and it can just trigger a dbt run.
- To change a daily data fetch into an hourly data fetch, teams have to change code in Airflow to appropriately handle downstream job dependencies.
None of these problems are unsolvable, but they require implementing a significant amount of code in Airflow in order to understand the data dependencies and add sensors for data quality, etc. But even after that code is set up in Airflow, the ongoing maintenance of data pipelines requires expensive engineering time. Every change to a query results in work to update the Airflow pipeline. And the fact that there are multiple systems involved means that debugging problems takes way too long.
Typical problems that occur in siloed pipeline metadata systems include:
- Metadata is siloed in different systems causing the problem of a “split-brain.” Additionally, orchestration is inconsistent across different systems like fivetran and dbt.
- Significant responsibility rests on an external workflow manager like Airflow to coordinate processes. This requires significant coding to set up correctly.
- Ongoing operations like schema changes, historical syncs, and reprocessing are difficult to coordinate across different systems.
- Authoring is increasingly difficult because it’s hard to foresee the implications of changes
- Debugging problems in the stack is difficult because one has to jump between multiple systems.
A Blueprint for Metadata-First Design
We propose an architecture where we first define the metadata itself. Instead of worrying about how the different parts of the data pipeline functionality work (like how to ingest data from different sources, how to transform the data itself, and how the entire pipeline is orchestrated), we first model the metadata that can exist.
Once we have defined the metadata itself, we can then build the orchestration driven by this metadata. This pipeline metadata starts off being external to the ETL system. The ETL system operates using this metadata and has very little metadata of its own. Then, the rest of the use cases can rely on the metadata to perform their operations.
The metadata layer itself can be thought of as having four components:
- Metadata-only applications for auditing and observability
- Augmented metadata for access control
- Pipeline metadata for schema, lineage, and statistics
When deciding that orchestration and every other part of the stack is reliant on the metadata, all of the components of the data stack can be designed in a coherent way. With a siloed approach, metadata is simply generated and captured into a metadata system. This makes metadata a second-class citizen in the data stack and metadata tooling becomes useful only for audit and compliance concerns and only as needed.
A metadata-first mindset means orchestration can automatically coordinate how schema changes, historical syncs, and reprocessing propagate across different steps of the data flow. As we’ve shown in our examples above, siloed metadata introduces many unnecessary challenges into orchestration and transformation work that would otherwise be solved by planning for metadata in advance.
Datacoral’s Approach to Metadata
Datacoral’s data pipeline platform has been built with a metadata-first approach. The advantages are clear to us because our platform helps us see the forest for the trees. We provide full transparency about the data coming in and data going out simply by monitoring metadata. In fact, we offer automatic schema discovery and change propagation. Engineers and analysts can easily write SQL queries for transformations and set data intervals for a view without worrying about collisions and data integrity problems. We help our customers become metadata-minded as soon as we begin working together.
We not only capture the data lineage but, our orchestration actually leverages the data lineage to ensure that transformations are performed only when the upstream data is updated. Our connectors publish metadata whenever they ingest every new piece of data from any source. Our orchestration inspects the lineage graph to figure out which transformations can run based on every new piece of data that has been ingested. When all data that a transformation depends on is updated, the orchestration triggers the transformation. And, when the transformation runs, it again triggers the orchestration system to update dependencies within the lineage graph. Not only that, even schema changes are propagated through the lineage graph. Our customers don’t run into data errors because our platform comes with built-in data quality checks which are triggered when every new piece of data enters the data stack. Datacoral is able to ensure that all data in the warehouse is consistent.
None of this would be possible without clean metadata or without making metadata be in the critical path of the data pipeline. Our platform automatically understands data dependencies. We automate data quality checks and provide built-in observability because all of the metadata is already available. Our platform and connectors can scale to any size thanks to our award-winning serverless architecture, which offers simplicity with the ability to grow organically.
Our platform is not a piecemeal approach to building a data stack. Because we have a holistic understanding of metadata and the data expectations of modern companies, we don’t face the problems many teams have with an assemblage of as-needed tools. Integration and coherent design are available from day one and our customers don’t need tons of custom code integrating disparate systems.
The biggest challenge we see in the industry is the number of systems involved. We offer a model where we start with a clear definition of configuration and state management metadata, and then build the data processing and orchestration. Our services are managed so our customers don’t have to become tooling and DevOps experts. Datacoral’s pipelines connect ingest, transform, model building, and publishing services with end-to-end views of every change within the stack.
Data stacks built from a combination of tools inevitably lead to siloed metadata, fragmented expertise, and long-term data management complexity. Not only do we see this in our everyday experience with new customers, but new tools are emerging to resolve these problems. Tools-first thinking requires plenty of custom code and comes with no guarantee of end-to-end visibility of data pipelines and metadata. Even solid workflow management solutions like Airflow require constant code maintenance and significant engineering investment.
When metadata is managed as part of the critical path of a data pipeline, significant short and long-term advantages are realized. Tools like Datacoral leverage clean metadata to allow for automated lineage capture as well as schema change propagation. Transformation jobs can be quickly and easily changed with out-of-the-box visualization of cascading data dependencies. Teams are able to move quicker and they can spend more time working with their data instead of being forced to work on the plumbing.
At Datacoral, our simplified conceptual framework for a data stack doesn’t stop at the metadata layer. We have much more to share in future posts about the data flow and devops tooling layers. Clean metadata opens new doors for scalable, sustainable, manageable, and lower cost data stacks. In our next article we will be highlighting some of these advantages.
We hope this article reinforces the need for conversation within the data industry around what being metadata first entails, and the benefits it provides at all levels of the modern data stack. If you have any thoughts about our metadata-first thesis, drop us a line at firstname.lastname@example.org.