How do you think about your data stack?
If your job requires analyzing data, or supporting teams who analyze your data, you will likely find yourself at some point working on data pipelines and learning about their complexity as you go along. While data pipelines are light years ahead of where they were 20 years, the day-to-day challenges data teams face today are largely the same. How have we come so far, yet stayed where we are?
We have seen tons of thoughtful work being done to promote the modern data stack. Andreessen Horowitz’s seminal thought piece on emerging architectures; Matt Turck’s exhaustive Data Landscape; and Tristan Handy’s The Modern Data Stack: Past, Present, and Future are all must-reads for the current data practitioner. (Tristan’s accompanying video about the future of data infrastructure is also a must-watch.)
However, based on our experience having worked with enormous scale data stacks in Internet companies as well as having interviewed data teams in hundreds of companies over the past 5 years, it still feels like the complexity of the modern data stack has not really reduced in the most obvious ways, so they are accepted as unavoidable. For example, even though there have been huge advances in simplifying the data stack, especially over the last five years, the cognitive load of a data team building and debugging data pipelines is still roughly the same. There are now even more tools with overlapping functionality to choose from. It is not obvious what the true implication of the tool choices is until after having put them together and working with them over a period of time. There is no simple, easy-to-understand framework that can help teams understand how to build the best data stack for their unique needs.
The goal of this article is to propose a simplified framework and start the conversation. We will not attempt to provide a comprehensive glossary of all the tools and how to fit them in (for that, we highly suggest taking a look at Matt Turck’s landscape pdf). But we will provide you with a conceptual model to help you think about the implications when you pick a tool.
We will provide a simplified framework for how you should think about building your data stack. More importantly, we hope that instead of thinking about your data stack purely in terms of the tools that are available or popular today — which leaves you at the mercy of the tools and how they evolve — you take back control of how you think about your data stack and slot in the tools you need to fulfill your needs. It does not mean that you won’t have to do the hard work of integrating the tools, but, at the least, it should help you rationalize why you are doing that hard work!
Our team at Datacoral actively uses this model to build our product and advise our customers and partners in selecting the right tools for their needs.
By sharing this model, we hope to provide data teams a better way to understand how they can fit different tools into their data stack. Ultimately, we hope to start the conversation to move toward implementing this conceptual model to build a better, more scalable modern data stack.
Who should care?
Whether you are a CTO at a Series A company that’s just getting serious about data (i.e., you don’t run analytics on a production database!) or a data director in an established company worrying about data governance and costs of managing its data, you are probably looking to uplevel your data stack to get more sophisticated with your data while also making it:
- More scalable
- Easier to manage
Most importantly, your goals are probably the same. Building a system to integrate, transform, load, and analyze your data is critical for your success. To meet your demands, hundreds of tools and technologies have become available in the market to help companies move, store, manipulate, manage, and govern their data. This includes:
- PaaS tools from public clouds
- Open-source tools
- New cloud-native SaaS tools
- A large but diminishing number of incumbents from the late 90s and early 2000s
Additionally, there are several options for every aspect of the data stack, like data integrations, query engines, ETL, data governance, machine learning, and data visualization.
The biggest problem companies face in up-leveling their data stack is making sense of what combination of tools to choose. There are far too many to choose from, and it is not clear where one tool’s functionality ends and another begins. Couple that with newer tools coming up all the time, and you never know if you made the right choices. But what is clear is that whatever set of choices you make, you will have to spend time integrating them and working through all of the issues that invariably arise. In addition, your choices also inform (or constrain) how you should operate your stack — basically deciding on the triad of people, process, and technology for your data stack.
Since it is hard to replace tools, the question is, “What analytics stack is going to serve my organization well for the next two to three years?”
We propose that such a stack should be “metadata first and metadata-driven.” Once we establish metadata as not just informational but the main driver of the entire data stack, we end up with a simple framework of three layers with the metadata layer in the middle:
- The data flow layer
- The metadata layer
- The DevOps tooling layer
With these layers, we can follow the principle of separation of concerns, clarify how we think about each part of the data stack, and understand how different tools slot in.
Using such a framework, we have found that it is easy to replace or fit in new tools and systems as they become available or as they become necessary based on the scale and complexity requirements.
Current State of Play
First, let’s examine what is considered a state-of-the-art modern data stack. (Any specific tools we mention are only to illustrate a realistic stack to make our point about working with different tools across an end-to-end architecture, not to critique the specific tools. We actually admire the capabilities provided by these tools and appreciate how they have dramatically improved how companies work with their data.)
For analytics, a combination of Fivetran, dbt, Airflow, and Looker on Snowflake is currently quite popular (this piece nicely lays out other common setups). While setting up such a data stack is a lot easier than it used to be, data teams still face day-to-day challenges with this stack. For example, let’s take the fictional Acme Company that provides SaaS billing software for medical practices. Alice, their customer success leader, consumes a Looker dashboard that summarizes all of their customer activity, including which factors cause customers to be unhappy with the product.
This Looker dashboard was built by Bob, their data analyst, who is a Looker expert. He has some models built in LookML, but mostly he is moving towards dbt because that gives him more control over what data models are available directly in their data warehouse, Snowflake.
Raj, their data engineer, set up the pipelines that pull data from Salesforce, Zendesk, and a MySQL product database into Snowflake using Fivetran. He has also written a few small scripts in Airflow to plug in services that are not offered by Fivetran. He is an aspiring data analyst and wants to move the entire organization to use dbt. He helps Bob leverage dbt to transform the raw data in the source tables from Fivetran to summaries in derived tables. He has orchestrated dbt via Airflow. This setup works quite well most of the time.
Now the time when Alice notices that the churn rate in the Looker dashboard does not match her understanding of the business and believes that the data is flawed. What happens from there?
In our interviews with hundreds of companies, we uncovered the following typical workflow to debug this flaw:
- Alice tries to dig deeper by clicking around the Looker dashboard before raising the issue with the data analyst, Bob, who built the dashboard.
- Bob verifies that the report’s query is accurate and then passes the baton to Raj, the engineer (data engineer, analytics engineer, or a part-time data engineer) in charge of the data pipeline.
- Raj then looks at each part of the pipeline from downstream to upstream tasks:
- Identify all the tables in the warehouse used in the report.
- Find all the sources for each of those tables. Sometimes they are derived tables computed via multiple transformations (via dbt); other times, they are source tables coming directly from the ingest pipelines (via Fivetran).
- Navigate to each of the logs to each of the derived tables in the dbt console to see if anything is amiss. Raj has to take notes of the timestamps when the transformed tables were updated.
- Examine the logs in Fivetran’s console and system tables that detail how each of the tables was modified. Raj has to pay close attention to the timestamps of when the source tables were updated. Then, he has to stitch together this context from Fivetran with the context he had from dbt to get a full picture of the data flow. In most cases, he finds that there are data issues when transformations in dbt were run before Fivetran updated all the source tables across the different connectors to Salesforce, Zendesk, and MySQL. So he must make sure that the timestamps of the dbt run when the derived tables were updated. He must log in to the Fivetran console or look at Fivetran system tables in Snowflake to figure out if anything went wrong with the data integrations. If nothing looks amiss in the Fivetran logs, he still can’t be sure because the data integration tool provides no data quality assurance.
- Finally, Raj must look at potential issues in the data source systems. If DevOps is unable to debug the issue, and no one on the team can get to the root of the problem, then suddenly this becomes a high priority action item for Bob, the analyst. Bob would have to build pipelines from scratch in this scenario to compare with the existing data to make sure Alice’s needs can be supported.
This entire effort takes hours, days, or sometimes weeks for the really complicated issues.
Now imagine that there is another team that wants to build machine-learned models on the same data. On top of that, the CISO of Acme has a mandate to incorporate strict access controls on all data based on whether they have PII. You can see the exponential increase in complexity of debugging issues with new systems in place. Acme ends up with an architecture that looks like the image below, where every system has to talk to every other system. There are impedance mismatches between different systems, causing the integrations to become hard and unmanageable.
We have seen reference architecture diagrams, like the “Unified Architecture for Data Infrastructure” in Andreessen Horowitz’s thought piece, that try to bring together many more tools and capabilities into what they call an emerging data architecture. These attempts show the complexity involved in thinking about the entire data stack. They seem to be projecting orthogonal concepts into a more linear-ish data flow model, and some outliers don’t really fit into the model. There are tools with overlapping functionality in different boxes, plus multiple integration points resulting in arrows between nearly every pair of boxes. This makes it difficult to figure out which subset of tools are needed now versus later and what the tradeoffs are with any of these choices. Also, the complexity of integrating the specific combination of tools is left as an exercise to the company! Many of the problems in data and data stacks can be attributed to this glue that attempts to solve integrating different tools.
A simplified three-layer framework
We believe that there is a way to disentangle the data stack if we think about metadata first. Once we establish the metadata layer not as merely informational but as the main driver of the entire data stack, we can follow the principle of separation of concerns and simplify how we think about each part of the data stack.
Once we establish what is in the metadata layer, we can build the data stack around it and make metadata the data stack driver.
Then we arrive at a conceptual model with three layers that are easy to disambiguate. While not all current tools follow this model across the board, we are encouraged by the movement to increase the prominence of metadata in general, and we think it can go a lot further. That said, the three layers look something like this:
Let’s look at what is in each layer, and then how one might think about how these layers interact with each other.
Layer 1: Data Flow
This layer consists of tools and systems that store, move, transform or visualize the data. Typically, most articles about creating the data stack focus on this layer since that is where the actual data flow happens. But each tool is currently making its own choices around metadata and DevOps tooling, which is why the overall data stack tends to become complicated. The data flow layer also includes systems that are not part of the data stack, namely:
- Sources — Systems where data resides. For example, databases, file systems, or SaaS systems.
- Destinations — Systems, and applications into which the publish tools push data.
The data flow layer includes tools and systems that get the data to flow from the sources to the destinations within the data stack.
- Ingest tools — Tools to centralize data from different sources into a data store.
- Stores — Data lakes and data warehouses, although nowadays the popular notion of a lakehouse is emerging. We are making the inherent assumption of ELT here.
- Transform tools — Tools to help clean, aggregate, and denormalize the data for analysis. We also include building features for machine learning and the model building of ML in the transform category.
- Visualization tools — Tools to visualize cleaned or aggregated data to generate insights.
- Publish/serve tools — Tools to publish the reports, aggregates, or models so that applications can use them.
There are options for each of the five categories above, and we will explore considerations for choosing a tool for each category in-depth in a forthcoming article.
Layer 2: Metadata
Metadata is a lot of things! Metadata includes the structure of data, configurations of the pipelines that created the data, relationships between different data, and other properties (freshness, who has access, etc.). We propose that being metadata-driven means that other aspects that govern the data (access control, auditing, etc.) or the data flow (orchestration) are also in the metadata layer.
The metadata layer consists of several aspects of the data and the data flow that are used and generated by the running systems. Also, we think that orchestration logic also belongs in the metadata layer since it drives the data flow. So, we propose that the metadata layer be comprised of the following:
- Schema/lineage — The core metadata of any data system. At the simplest level, schema is the table and column definitions. There can also be relationships between tables like foreign keys. Lineage is a way to capture how different data are related to each other and understand how one piece of data is generated?
- Statistics — Runtime metadata. These are metrics about different data and data flows, like different table sizes, the time it takes to compute an answer, the time it takes to ship data from one system to another, etc. These stats can optimize data flow, understand bottlenecks in the system, understand how the data volumes change over time, and more. One of the key statistics collected is data quality metrics, which provide information about whether analyses are being performed on complete and accurate data.
- Orchestration — The most critical part of the metadata layer. While others have explored orchestration’s criticality, we propose that it is critically important and should be included in the metadata layer because orchestration must be metadata-driven. Instead of just collecting and showing metadata, metadata’s insights become critical for the data flow itself. This has six main advantages:
- Metadata is always clean — If metadata is broken, the system is broken, and someone must fix it.
- Observability/auditability out of the box — There is no need for special instrumentation because metadata is always up to date.
- Data quality checks can be incorporated directly into the orchestration without additional integrations.
- Easy debuggability — Lineage and stats drive data flow and so are automatically captured. This makes it easy to debug problems in the data flow with a single pane of glass that shows the metadata in an easy to consume manner.
- Mundane operations like reprocessing or doing full refreshes (historical syncs) that should propagate through the entire data flow become a lot easier to manage.
- Authoring tools can access both the static and dynamic metadata to help data teams figure out how to make changes to their business logic without breaking data pipelines.
4. Access Control — With #1 above, one can specify who can access what schemas and tables at a granular level. Both role-based and policy-based access controls rely on the schema/lineage to provide flexibility.
5. Auditing — Keep track of what changes occurred to any of the data or data flows. If orchestration is metadata-driven, the orchestration log provides an automated way of auditing those changes. Logging from access control will provide an audit of who performed what operation on the system.
6. Observability — Use metadata to observe how the data is flowing through different systems.
Layer 3: DevOps Tooling
The DevOps tooling layer is the actual interface data teams use to build and manage data flows and administer their overall data stack. As mentioned previously, most of the data layer’s existing tools have their own DevOps workflows, so teams have to switch between tools while working on different parts of the data flow. Types of tools these teams use include:
- Authoring tools
- Different types include canvases with drag-and-drop boxes-and-arrows, web-based forms, or code written in SQL, Python, etc., and managed in a source version control system.
- Authoring tools can provide visibility (before actually running any changes, i.e., statically or at compile time) into changes made to one part of the data flow. Having robust or queryable lineage will allow for this. (This is how it ties to the metadata layer.)
- Authoring tools can use metadata statistics to determine how expensive the changes are or how much data will be processed to help the user figure out what to build.
- Teams should be able to test changes before pushing them to production.
- Tests should not interfere with production.
- Changes have to get pushed easily.
- The timing of changes to the pipeline is essential in keeping the data flows running smoothly. DevOps teams rely on deployment automation — plugged into the metadata — to push changes to the pipeline at the right time. This ensures that teams don’t accidentally break the data without knowing about it or without manual intervention. For example, if you are making a change to an hourly transformation in a data pipeline that also has a daily transformation downstream, you want to make sure that you don’t apply the change to the hourly until the end of the day so that the daily transformation does not end up with a half old version and a half new version of the hourly transformation.
- With the right runtime metadata available, debugging problems is easy.
- Automatically captured lineage allows you to trace problems in a data pipeline easily.
- It will also be straightforward to figure out which downstream processing might get affected when something upstream changes.
A framework that includes these three layers enables you to think about the people, process, and technology choices for your data stack. While the current set of systems and tools typically have default approaches to each of the layers, you can make your own data stack design choices by employing our way of thinking. Given that many of the tools have overlapping or redundant capabilities, the key to building your data stack with the existing tools is to understand the default choices these tools make for each of the different layers. From there, you can select the right combination that covers your needs in each of the three layers.
Below, we provide a sample of how you could use the lens of the three-layer framework to understand the tools and services that are commonly used in creating a data stack today:
- Data layer — Excellent Ingestion capability, limited transformation capability
- Metadata layer — Schema, tables, runtime logs, audit log of how the connectors are changed, logs of all the connector invocations to fetch the data, some statistics about the number of rows synced but not data quality metrics, orchestration via schedules unrelated to metadata
- DevOps tooling layer — Form-based web UI, no version control, maybe write programs using the connector management API, not much around testing
- Data layer — Excellent transformation capability in SQL, no other languages
- Metadata layer — Schema, tables, lineage, audit of changes, observability of runs, no access control, no orchestration
- DevOps tooling layer — SQL files in a Git repo, web UI for monitoring, deployment, no orchestration
- Data layer — No capabilities
- Metadata layer — Orchestration and observability of runs, no schema, no lineage, no access control
- DevOps tooling layer — Python code in Git repo, web UI to test and monitor
- Metadata layer — Schema, tables, access control, statistics for query planning, but no data quality metrics, no lineage
- DevOps tooling layer — Web workbench or JDBC
- Data layer — transformation via lookml, visualization
- Metadata layer — Schema, tables, access control, some lineage
- DevOps tooling layer — lookml on the UI as well as git integration, some testing
Hopefully, the above examples explain how to use the framework to understand what you are getting with each system or tool, either by design or as an artifact of the choices made by the system you chose.
Once you have a clear map of what you are getting in each layer from the combination of tools, you can start thinking about the people and process parts of the triad of people, process, and the technology of operating your data stack. Depending on how complex or manual the DevOps tooling is, you can decide to hire data engineers or data ops folks. When there is more than one system to deal with, you could figure out what kinds of processes to put in place to streamline how changes are made, and issues are debugged.
As you can imagine, exhaustively mapping out every single tool this way into the three layers is a herculean effort and not something we are planning on undertaking. We hope this framework sparks such a conversation so that, as an industry, we can help companies make the right decisions for them.
Like we mentioned earlier, having this framework hopefully allows you to choose your tools intentionally, instead of picking some tools for particular parts of your data flow requirements; only to then be surprised about how they fit in with other tools or how including them changes the people and processes of your data stack.
What have we done at Datacoral?
At Datacoral, we use this framework to inform our implementation of cloud-based data integrations and data pipelines. We started with a core implementation of centralized metadata and metadata-driven orchestration. We standardized the data and metadata interface for all components we built in the data and DevOps layers. Instead of just collecting and showing metadata, metadata is now part of the actual data flow itself, and the same Alice/Bob/Raj problem-solving workflow becomes a lot simpler.
Typically, Bob (analyst) or Raj (data engineer) have the same interface to be able to quickly pinpoint any issues because of our central metadata layer that drives the orchestration. Bob can focus on the business while Raj can spend more time modeling the data instead of debugging data quality issues in the plumbing.
In this article, we have shared with you:
- Our framework for how to make tool choices
- A way to think about data stacks and why you should be metadata first
- A glimpse of how a metadata first implementation simplifies the processes
At Datacoral, while we started with these principles, we know that we have not even scratched the surface of handling all of the issues companies face around their data stack. We hope that anyone reading this article is left with a clear framework for how to go about developing their own data stack.
Further, we hope this article sparks a conversation about how we in the data industry can reduce the complexity of how companies have to think about their data stack. If you have any thoughts about our metadata-first thesis, drop us a line at [email protected].
This post originally appeared on Towards Data Science, and is reposted here with minimal edits.