Data warehouses, data lakes — and now data lakehouses. What’s next? The technology to manage your company’s data is changing so rapidly, it can be difficult to know what to do about new vendors bringing innovative data services to the market. Can they plug into your data stack? Will they compromise your data security strategy? If you have a data lake and a data warehouse, should you consolidate into a lakehouse? Is there something even newer on the market you’re missing?
That’s a lot to think about. We’re here to help.
We asked Datacoral’s CEO Raghu Murthy some key questions about data storage and compute technologies. We think these answers will help a lot of technology leaders caught in the middle of complex questions and increasing pressure to compete in today’s data-driven market.
When should you use a data warehouse versus a data lake?
Companies have mostly structured data and in those cases, most of the business requirements are around analytics. It’s faster and cheaper to use a data warehouse, which is purpose-built for structured data. It’s also beneficial for folks who are not that technical because they can use SQL to analyze their data.
But for companies that have semi-structured or unstructured data and they want to do some sophisticated data processing, which is not representable in SQL, then a data lake gives them the flexibility they need to incorporate multiple kinds of computation on the same data. In addition, data lakes also provide support for SQL on the structured parts of the unstructured data, but they’re not as performant or they don’t support ACID transactions or other features that make data warehouses so easy to work with.
What is an ACID transaction?
ACID is an acronym that represents four properties of a data processing system: Atomicity, Consistency, Isolation, and Durability. Any database offering transactional data processing will typically have all four of these properties and dramatically simplify application development. Applications built on top of ACID databases don’t have to account for situations like data changes being lost or multiple users performing conflicting concurrent changes to the same data. Those four properties ensure data integrity which applications can just focus on the data’s business logic.
What is a data lakehouse?
A lakehouse is an interesting amalgamation of a data warehouse and a data lake. Companies that have semi-structured data that want to do sophisticated data processing as well as analytics workloads have had to decide whether to keep their data in a warehouse or a lake or both. If they decide to keep it in both, that means there’s duplication of data and along with that comes the problem of keeping the data consistent. When the same data is sitting in two places, there’s complexity.
Data lakes are implemented as query engines or data processing engines operating on data stores in files with open formats in a file system. These file systems don’t really support efficient in-place updates. I can’t update a file in place, so if I want to change a file, I have to make a copy of the file with the changes and then potentially drop the old file. That makes data lakes more inefficient for transactions.
But of late, companies building data lakes are adding ACID properties to a data lake. What that means is I can start having a semblance of transactions inside a data lake. For example, I can perform updates as part of loading data into a data lake. I can do efficient, in-place updates of the data. So, they are calling them lakehouses since they are data lakes with data warehouse-like capabilities.
With lakehouses, Databricks has come up with Delta Lake as their lakehouse offering. They’re still using a file system to back the data, but they use mechanisms like Write Ahead Logs (WALs). This is similar to what transaction systems do, but the storage is on the file system. They’re trying to bridge the gap between data warehouse benefits (e.g. performant queries) and highly-sophisticated data processing that data lakes were built for.
On the other hand, data warehouse companies like Snowflake are adding data lake-like capabilities by allowing storage of unstructured data (via VARIANT types) and supporting multiple compute engines (via external user defined functions). It will be interesting to see who wins out in the long run!
Why build a data lakehouse?
A data lakehouse is still fairly new. People are excited about it because they don’t want to be paying for two separate systems (a lake and a warehouse), so if they can get both in the same system, that’s what they’d like. Even cloud data warehouses like Snowflake are starting to offer more sophisticated data processing. Services like Spark on Snowflake are starting to be offered, so it’s coming in both directions. The data lakes are trying to become data warehouses and vice versa. They’re trying to support the workloads of the other.
That’s the reason lakehouses are getting popular. It’s yet to be seen if a single system can support all the workloads that can exist. We had this move away from consolidated databases — Oracle was the classic “one size fits all” database for transactions and analytics, but people said, “We can’t have one central database.” Then newer databases came out for different applications and people are saying, “Now there’s data scattered in all these places,” and they’re trying to consolidate back in. But they know that specialized databases offer much better performance and easier programming for some workloads, but siloed data becomes a problem they have to deal with.
I think this progression is pretty common in technology in general — diversification of tools to support higher scales and different new use cases followed by consolidation to simplify management.
What are the design considerations for data warehouse and data lake pipelines?
When designing data pipelines for data lakes versus data warehouses, the main considerations revolve around the functionality each architecture supports. Typically data pipelines append, replace, or update data in tables. Data warehouses support efficient pipelines for all three types of changes to tables. But they tend to be more expensive than data lakes as data volumes grow.
On the other hand, data lakes use cheap data storage, so are less expensive than data warehouses as data volumes grow. But data lakes typically use append-only tables. There’s a notion of partitions that data lakes support and data pipelines are essentially creating new partitions for every new time period. For example, there might be a new daily partition of the same table, which may contain all the previous data. In that situation, queries only have to be made on the latest partition to get access to the data.
But if the new partition is built incrementally, then a query must fetch data from multiple partitions, which becomes inefficient. Data lake pipelines are typically done for append-only data. For example, with event streams or logs where new data comes in all the time, a data lake pipeline is typically able to handle that well. But when a table needs to be updated, typically a copy is made with the changes. It’s still possible, but it’s more of a heavy lift and not as performant.
Data warehouses and lakes are here to stay. They are proven technologies for structured and semi-structured or unstructured data. The technology is mature and they each serve different purposes. The type of data a company generates or ingests makes a big difference in the business requirements and how the data is stored and computed.
Data lakehouses are in their nascency, which means the support needs are likely higher, as is the risk of production availability. Though lakehouses bring the best of both worlds at potentially lower cost due to lower storage requirements, there is some concern in the consolidation of technology and the likelihood of proprietary implementation requirements.
Datacoral integrates data from 80+ data integrations into both data warehouses (Amazon Redshift and Snowflake) and data lakes (S3+Athena). You can learn more at www.datacoral.com or write to us at [email protected].