Have you ever had a question and wished you could go straight to an expert for answers? That’s the premise behind reddit’s “Ask Me Anything” sessions, wherein redditors have the opportunity to ask experts quite literally anything in a real-time, written Q&A format.
Datacoral’s CEO, Raghu Murthy, recently held his own Ask Me Anything to shed light on all things data, engineering, and entrepreneurship. While questions ranged from tongue-in-cheek (“how can I recover a hacked Facebook account?”) to tough (“what are drawbacks to using Datacoral?”), we’ve pulled together a few of the most popular or interesting questions and their answers.
Answers have been lightly edited for format.
Question 1: Why build and why buy, and when does it make sense to do one over the other? (u/iamnotaldonsmith)
Answer: My general rule of thumb is to buy tools to solve problems that are undifferentiated and you know every other company is also trying to solve – unless of course you are the company building the said tool in a different way!
For things that are critical for your success, if you can’t think of tools that solve them well or cost effectively (or both), then it makes sense to invest in building tools yourself. Of course, the prerequisite to build something is being able to hire the right talent – which is hard and expensive.
It is a tradeoff that happens everywhere, not just in the data stack. Of course, with the advent of the clouds and SaaS vendors, the barrier to getting a good data stack has reduced dramatically over the past few years. So, companies are investing lesser and lesser amounts of engineering on their data stack and spending more on hiring folks to help them get value from the data. Data tool vendors and clouds are becoming the centers of excellence for engineering a data stack.
Question 2: As startups grow in size (both in their team size and number of users for their product), how have you seen their data infrastructure evolve? Are there good ways to evolve and bad ways to evolve? (u/courseIII)
Answer: Startups typically treat data infrastructure as an afterthought – they start off with off-the-shelf tools like google analytics to understand how their product is being used. Any analysis on production data is done directly on read replicas of the production databases. This is probably level 0 with regards to the maturity of data infrastructure. Executives and engineers working on the product are also working on the data.
Once they want to do more sophisticated analyses on their Google analytics data or want to join it with data from other sources, they think about investing in a data lake or a data warehouse to centralize all of their data. At this point, they have a few critical decisions to make:
- Most important – how do I manage the metadata of all of my data? (Just kidding – no one does this!)
- Which data warehouse should I choose? Answer to this is a lot simpler nowadays – snowflake, redshift. Although, there are newer data warehouses and query engines that are coming up.
- Which ingest tool should I use? Either pick a SaaS vendor, or build something internally since you don’t want to expose your data outside of your systems.
- Which visualization tool should I use? Again, there are plenty of good options here.
Once the data is in the data warehouse, startups get away with just running queries to join the data from different sources using their visualization tool. This is level 1 wrt data infra maturity. Until this point, having someone who knows SQL and can take the help of a devops person to get credentials for data sources is enough.
Soon, they realize that the queries are repetitive and becoming slow. So, they have to invest in a transform and orchestration tool to be able to pre-create aggregated/joined data so that analysis becomes easier. This is where there is a need for a dedicated engineer to be hired/loaned to work on the data stack. At this point, depending on the choices made, you can expect your stack to evolve reasonably, vs, will require rewrites or significant engineering investments to keep things going over time.
If the tools chosen require a bunch of programming, then your data stack becomes like your product software stack – which means that the more you want to do, the more engineering resources you need. This is level 2 of maturity I’d say. Now, you have a basic functional unit of a data stack that can keep you going for a while.
The next levels of maturity involve doing more with the data – like machine learning. And also getting more streamlined processes to the data stack – like compliance, governance, auditing etc.
Our take is that companies should be able to quickly get past the first 3 levels of maturity without much engineering.
Question 3: How did you end up becoming a data architect? Was this a specialization from your school or did you start as CS/SWE and then transitioned? (u/vickus1)
Answer: “Data architect” was not even a thing when I first got started. I was a software engineer in an internet company working on distributed systems that were being built to process large amounts of data. My background is in computer science, but I ended up liking what I worked on, so I did graduate studies in distributed systems and databases. Data architect as a role is more popular now in many companies since the problem statement for a data stack is becoming more standardized – be able to centralize data from different sources to make it available for analyses, improving business processes, and the product. My learnings while building data infrastructure have allowed me to help a few companies architect their data stacks. With Datacoral, we are providing our opinionated architecture for a data stack as a product for any company to use.