How much time do you spend maintaining your data pipeline? How much end user value does that provide? Raghu Murthy founded DataCoral as a way to abstract the low level details of ETL so that you can focus on the actual problem that you are trying to solve. In this episode he explains his motivation for building the DataCoral platform, how it is leveraging serverless computing, the challenges of delivering software as a service to customer environments, and the architecture that he has designed to make batch data management easier to work with. This was a fascinating conversation with someone who has spent his entire career working on simplifying complex data problems.
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what DataCoral is and your motivation for founding it?
- How does the data-centric approach of DataCoral differ from the way that other platforms think about processing information?
- Can you describe how the DataCoral platform is designed and implemented, and how it has evolved since you first began working on it?
- How does the concept of a data slice play into the overall architecture of your platform?
- How do you manage transformations of data schemas and formats as they traverse different slices in your platform?
- On your site it mentions that you have the ability to automatically adjust to changes in external APIs, can you discuss how that manifests?
- What has been your experience, both positive and negative, in building on top of serverless components?
- Can you discuss the customer experience of onboarding onto Datacoral and how it differs between existing data platforms and greenfield projects?
- What are some of the slices that have proven to be the most challenging to implement?
- Are there any that you are currently building that you are most excited for?
- How much effort do you anticipate if and/or when you begin to support other cloud providers?
- When is Datacoral the wrong choice?
- What do you have planned for the future of Datacoral, both from a technical and business perspective?
Interviewer: Could you start by introducing yourself?
Raghu: Absolutely. Thanks Tobias. My name is Raghu Murthy. I’ve been an engineer most of my career, working on big data processing systems, before it was called big data. I’ve worked at companies like Yahoo. This is back in the day in the year 2000 when they were having to process tens of terabytes of data and there’s not that many systems that could handle that kind of stuff, so we had to build a bunch of stuff in house. A similar theme has kind of followed me through the years where in 2008 I joined Facebook where again their data volumes were growing pretty quickly and we ended up in a situation where we had to build quite a lot of systems ourselves in. As part of that, I ended up working on Apache hive, which you’re probably familiar with. We built it and open sourced it and over a five year period ended up working on the data infrastructure stack at Facebook where we grew from a 50 terabyte single Hadoop cluster to about 200 petabytes of data across multiple data centers. And through those years I worked in pretty much every layer of the data infrastructure stack, starting from an auto instrumentation, a library called nectar, which would get data into hive and an orchestration layer on top so that when people are trying to build pipelines, uh, and these would turn into thousands or even millions of jobs a day, they need an orchestration systems. So I built that and then finally ended up working on a project to actually make it make the Facebook getting infrastructure stock to become multitenant and multi data center. And that was a significant amount of learnings. And over the next couple of years I did a bunch of other projects. And finally over, the past few years I’ve been working on data coral mainly as a way to apply a lot of the learnings that I’ve had over the years and make it so that we can provide companies a way to just get started on their data without having to build any, any of the infrastructure that typically takes a significant amount of time