Ever wonder what it would have been like to work on data at Yahoo or Facebook in their early days? Or how and why was Apache Hive started at Facebook? How Hive influenced Facebook’s data decisions as they grew from managing terabytes to hundreds of petabytes? We’ve got you covered with some great stories today.
Influential startup founder, advisor, and investor Pete Soderling interviewed Datacoral founder and CEO Raghu Murthy, on his role shaping data infrastructure in the early days of Yahoo and Facebook. These experiences informed his current philosophies around change data capture and metadata-first data stacks. The questions have been summarized to improve readability.
If you’re into the history of data science at these major companies, this interview is a great way to spend an hour. We’ll summarize Raghu’s responses to Pete’s questions below with some light editing for readability. If you want to watch or listen to the whole interview, check it out on the Data Council YouTube channel.
Question: How did you get into data at Yahoo?
“I started at Yahoo back in 2000. At the time, I didn’t have much experience with databases and was into compilers. Yahoo’s goal was to process large amounts of data with unix programs. Loading it into Oracle and processing it was too expensive and complicated at the time. We eventually wrote a custom SQL parser to generate the configuration for these unix programs so we could run them on multiple machines with single NFS-mounted NetApp filers. By 2006 or 2007, we had close to 100 terabytes of data and I enrolled in a graduate program at Stanford. It was there I learned about the research work in parallel databases and realized that we’d basically solved very similar problems without knowing about the research!
I got into data by solving hard problems as a software engineer and systems designer. I was working on data before it was called “big data.” My work focused on unix-based data processing systems.”
Question: How did you build Hive while scaling data at Facebook?
“When I started at Facebook in 2008, they were just getting started on Hive. They were facing similar problems to what I’d seen at Yahoo: processing large amounts of data and seeking alternatives to Oracle. The alternative being tested at the time was Hadoop. They were realizing that writing MapReduce programs was hard, so they were writing a SQL compiler that would generate the MapReduce jobs – called Hive. I joined the initial team that built and open sourced Hive.
Hive was built at Facebook with the intention of democratizing data so every part of the company could leverage it. We didn’t want to have a centralized team that controlled access to it because that was the traditional way of doing things back then. We called it “the priest model” because to get to the data, you had to go through a priest.
There was a lot of security around sensitive data and people were worried about what was being stored and how it was being accessed, like payment data and other information that required hardening. But this general movement to have everyone at the company become data aware was the impetus for Hive and pretty much every layer built on Hive. This included ways of building data pipelines for ingesting data without writing much code, building dashboard, and other similar work. Every product engineer knew how to interface with the data without having to jump through too many hoops.
I left the team in 2013 after five years and the data went from 50 terabytes to somewhere between 150 and 200 petabytes of managed storage across multiple data centers. This was definitely a very different trajectory of scale than what I’d seen at Yahoo. The last project I did on the data team was to make sure all that data could sit and function in multiple data centers.
After Facebook, I ran engineering for Bebop, an enterprise application platform company acquired by Google. Shortly after, I moved on to become an EIR at Social Capital.”
Question: What was it like being an Engineer in Residence?
I had built my career working on large scale data processing systems to help companies like Yahoo and Facebook to extract value from their data. They had scales where off-the-shelf solutions would be too expensive and they had the resources to invest in engineers like us to meet their business requirements. Most of the companies working with Social Capital were small and would likely have never hit the scale of Facebook in their entire lifetime. But, they were all scrambling to hire expensive data engineers to build their data stack. I was wondering how best to help these companies get started.
Question: How did cloud computing and serverless tech lead to Datacoral?
At the time, cloud technologies were becoming popular, and specifically serverless computing. I realized that the complexity of building a data infrastructure stack had to be much lower because of the primitives offered by cloud platforms and the fact that data volumes weren’t that high. I started pulling on that thread and realized that it might be possible to build a “push-button data infrastructure stack” where you don’t really need engineering to set things up.
It was a real a-ha moment.
Instead of all this custom software engineering, you could have an analyst or data scientist who could fend for themselves as long as they knew SQL or could do their job without having to deal with underlying systems and plumbing of pipelines. That got me thinking. I started writing code and told some of Social Capital’s portfolio companies. They said they’d use it if I built it.
We engineered the complexity out of data. I had seen so many of these companies with data scientists or analysts who were stuck writing scripts and building pipelines instead of analyzing data. Their analytics productivity was decreasing over time. So it was a pretty clear return on investment.
And this led me to start Datacoral.
Question: How did early customers respond?
“In the early days, most companies started off saying data engineers were a prerequisite to getting serious about data. And I spoke to some of these companies who told me they had everything covered. Then six months later, they came back and said, “Maybe we need your help. Can you help us skip the hiring process and you can build everything right for us?”
We started providing our product and services. As it matured, we found that companies could actually grow their data teams with more data analysts and data scientists–and fewer data engineers. They needed engineer help with CI/CD pipelines, but relied on us and our product for the data processing. Datacoral allowed them to build pretty sophisticated pipelines without any expensive engineering.
So, we enabled data teams to start punching above their weight-level in terms of supporting multiple business units and departments with a really small team. We built and managed the pipelines, they work on the analytics.”
Question: What’s with the focus on metadata?
“Metadata has become fairly popular recently. We focus on metadata to serve orchestration instead of having people write code to drive orchestration.
We use the term “metadata-first” to describe our overall approach to building a future-proof data stack. If each component of your data flow is publishing information, it’s moving and transforming data. It’s also updating a centralized metadata layer with information about how much data is moving, from which source, to which destination, and more. In the process of publishing information, components are also publishing events to a metadata layer. For example, it might say, “Here is a batch of data that updated a table in a data warehouse.”
If orchestration has to rely on that event to trigger something downstream, you can build a data dependency graph (or DAG). We made it so the operational metadata included the data dependencies as well, so when a new batch of data gets written to the warehouse, the orchestration would figure out what is downstream of that particular batch. Because the orchestration did the dependency management, it could then trigger the transformations.
So you have this really lightweight, event-driven orchestration where even the orchestration engine itself can be serverless because it is only operating on a small neighborhood of the DAG. It doesn’t have to load the entire DAG into memory. The orchestration is serverless and stateless, but is smart enough to figure out what should run and what should be triggered after that.
This design dramatically simplified the entire architecture.”
Listen to the whole interview
There’s a lot more detail and depth to Raghu’s discussion with Pete than what we’ve featured here. Pete was able to highlight Raghu’s early experiences at Yahoo and Facebook, which led to Raghu’s role in the development of Apache Hive, which led to Datacoral.
In some upcoming articles, we’ll dive a little deeper into Raghu’s thoughts on Data as a Product (DaaP) and Data as a Service (DaaS). If you’d like to learn more about Datacoral and its metadata-first platform, we strongly encourage you to check out our free trial.
And, as Raghu mentions in the interview, if you ever have questions or want to discuss any of the topics from the discussion, you can email him at raghu@datacoral.co.