In the early days of data engineering, there were a handful of expensive tools with low interoperability. And by the early days, we mean 2000. With fewer than five key players in the market and a high barrier to entry, working with data meant becoming an expert DBA in a walled garden like Oracle or becoming a data engineer who had to build all systems from scratch. So, data engineers were more systems builders and backend engineers than DBAs.
Fast forward 20 years, and you will see today’s landscape is entirely different. New companies are entering the data engineering space every day with increasingly overlapping technologies. Tools like Fivetran, dbt, Airflow, Snowflake, Redshift, and Debezium are just a few of the options available to a data engineer. And with some very basic configuration and a couple clicks of the mouse, a data integration becomes possible.
The increasing pace of cloud computing capabilities also factors into the complexity of data engineering. Like other roles in software and product development, the boundaries of a data engineer’s role are blurring. It’s part devops, part integration engineering, part software development, part database administration, part automation engineer — when will it stop?
Perhaps it won’t!
Datacoral founder and CEO Raghu Murthy recently appeared on Ternary Data’s YouTube channel to discuss metadata-first architecture, and the discussion ended up diving deep into how the changing landscape is affecting the Data Engineers and Data Architects of today, and where those roles are heading. The article below is based largely on that conversation.
New Tooling, New Problems
Plug-and-play SaaS data tools are a double-edged sword. It’s easier than ever to set up a functional data integration, pulling data from hundreds of sources and third-party services with point-and-click data connectors. This is the tech data engineers dreamed of before 2015. It’s almost too easy to move data! But any tech that’s nearly-magical has its trade-offs.
Introducing: complexity. Despite their ease of use, these tools force data engineers into a “cold start” situation with unique infrastructure requirements and overlapping functionality between adjacent tools in the stack. Moving data into a data warehouse and making reports should be a solved problem by now. The cost of this ease-of-use is a long-term commitment to each tool’s walled garden.
Many tools in today’s market perform similar functions, and have their own underlying implementations and metadata. These differences create what we call an “impedance mismatch” between tools leading to a “split-brain problem.” Data engineers are constantly switching between tools to determine which does what and which is better. Each tool has its own name for data processing steps and data objects increasing cognitive load on the engineer.
Today’s data specialists are bolting and daisy-chaining tools together, all in the name of “simplifying the data stack.” And some tools, like Fivetran, are now offering certifications to use them. There might be core functionality in a Fivetran implementation that’s worth keeping, but it’s missing a feature that’s available in another tool. So what do engineers do? They install another tool to fill the gap!
Data engineering went from building data transportation software to tool integrations pieced together with glue code. Now there are tools like Airflow to unify orchestration, and tools like Collibra to unify metadata to allow data teams to get to the holy grail of a single pane of glass for operations and governance. As new data niches emerge, new tiers of tools are sure to follow.
Complexity will continue to increase because the number of compelling tools will increase.
Data engineering has gotten harder
Years ago, data teams were small because data problems were too hard to solve with available tools. Data engineering meant systems development which was a significant investment. As tooling improved, data teams became more capable at a lower investment cost. Then cloud computing came and crumbled the remaining barriers to entry in the field, which paved the way for more tools, greater scale, and (most importantly) more capable data teams.
The ROI for data insights has become very clear and attractive, especially in the C-suite.
Nowadays, more companies are playing in the data space than ever. Data engineers are evaluating new tools, figuring out how to integrate them with their existing tools, and setting up their devops processes to incorporate all of the tools in the data stack. Their roles are looking more and more like devops and SRE. Their day-to-day work is filled with context switching, monitoring, building, and other tasks that often have little to do with producing insights.
Something has to give. More tooling means more integrations, which leads to rapid growth in complexity and overhead. Many of the tools on the market today have only slightly different types of functionality. Data engineers are left managing a patchwork of tools that looks like a Picasso painting.
A simple, modern data stack.
All this complexity gives way to gaps. How do these gaps get addressed? More tools! There are so many moving parts in today’s data stacks. Years ago, code and complexity were manageable because entropy was lower, requirements were simpler, and tech options were fewer. And, no, this isn’t some nostalgic rant about how much better the life of a data engineer used to be. The natural order of things in technology is happening where the barriers to leveraging data have lowered significantly, resulting in the simple use cases becoming trivial. Companies are now trying to do more with their data and the data volume is increasing too. Newer tools are being built to solve for the new use cases, resulting in
It’s time to think about a standard way of building data stacks. Datacoral’s recommendation is a three-layer architecture consisting of data flow, metadata, and devops tooling. Our framework allows for open systems where tools can plug into a shared metadata layer. The authoring and devops tooling experience can be standardized because they’re using the same metadata layer.
The best a data architect can do today is determine what combination of tools will suit an organization’s data needs for the next couple of years. There are simply too many tools and paradigms available. Data architects need to avoid painting themselves into a corner by becoming too dependent on today’s changing landscape. As scaling and complexity needs increase, forward-thinking data architects today need to make decisions that allow for the replacement of small parts instead of ripping out the entire architecture.
Inevitably, it becomes an optimization problem. Data architects have to pick from dozens of options for each data goal to accomplish. Once the choices are made, they have to implement and succeed.
Frameworks are better than tools
A good data framework helps when there is no clear option. Instead of thinking about each tool in complete isolation, tools can be selected for capabilities based on familiarity with a current situation. For example, our three-layer framework allows data architects to ask:
- What are the implications of this tool for the devops tooling layer?
- What will debugging look like with this new tool?
- What is the mental model on how data teams should be operating day to day?
- How does the data layer move data from point A to point B?
- How is data authored, tested, debugged, deployed, and monitored?
Our framework has allowed us to have meaningful conversations before a company builds its data stack. Before our clients pick tools for the data flow layer, we talk about the implications for the metadata and devops tooling layers. It’s a holistic approach to a forward-thinking data stack.
The future of data tooling
In the next five to 10 years, there will be more tools. That shouldn’t scare today’s data engineers, especially if we can collectively consolidate architectures. Data engineers need to coherently build data stacks. Starting with a metadata layer allows everything else to be plug-and-play. A strong and well-designed metadata layer is more important than the underlying data flow layer. When a faster or more robust data integration tool comes around, it can plug in easily and replace something else.
The destiny of a data stack is determined by how robust and clean the metadata layer is.
We’re seeing a lot of funding in the data tooling space. It’s attracting many new players to the market, which is leading to an overlap between data engineering niches. The tools are already overlapping in functionality. A data architect today needs to pick and choose the capabilities of the tools they want to use. There will inevitably be newer tools for compliance, specialization, and other niches. There will inevitably be overlap in data ingestion and transformation. Data teams will face no shortage of problems they need or want to solve day-to-day. However, by starting with a solid framework — such as the three-part data stack, or a metadata-driven approach, teams can build layers upon layers of functionality without compromising architectural coherence.
The role of a data engineer has gotten more difficult over the years. Complexity of data integrations and tooling have increased cognitive load and created a patchwork of technology implementations. This is likely to continue for at least another decade.
Data architects should be thinking about frameworks instead of tools. We recommend starting with a robust metadata layer that allows devops and data flow tools to be plug-and-play. This thinking deprioritizes the importance of the tools themselves and focuses attention on more important, strategic problems. If you’re looking to gather insights from your data, and abstract yourself away from the day-to-day plumbing, you can try us out here. And if you’re looking to build the day-to-day plumbing, you can follow us for more tips on Linkedin or read more of our data-engineering how-to articles here.