Raghu Murthy, the Founder of Datacoral, set forward his vision for a cloud-native, end-to-end, data infrastructure platform in 2015. He based this vision on his experiences working on Yahoo! and Facebook’s data teams, and serving as an Engineer in Residence at Social Capital. Below is his original framework upon which the Datacoral platform was built. Note that some of the companies referenced in this memo have since changed ownership, changed names, or otherwise may no longer exist in the form they embodied at the time of this memo.
Datacoral provides customers with a push-button, end-to-end data infrastructure stack on AWS that:
- Makes it easy to get value out their data right away
- Scales with both the size and complexity of their data usage
Startups have a data infrastructure problem
Companies that are just getting started with their data infrastructure are trying to build for one or more of the following:
- Business intelligence – To understand how their business is doing
- Monitoring the health of their applications and servers – To capture and analyze application performance (latency), efficiency(memory/battery/network), and reliability (crashes and soft errors)
- Data products – To collect data from multiple sources, enrich the data, and then serve that data to their own customers
- Data-intensive products – To make their applications work via a reasonable data pipeline, perhaps for connected devices
Most startups neither want to spend the time nor have the know-how to set up a reasonable data infrastructure stack on AWS — even though AWS has all the building blocks. They instead have to wade through all available options and invariably are stuck with a patchwork of AWS services, third party services, and their own scripts to suit their needs. This setup requires constant maintenance and doesn’t really scale.
Once startups do have data they can query and analyze, it is pretty hard to collaborate on it when, for example, there are data anomalies. Most of the time when there is an anomaly, discussions happen on email threads and messaging tools where it is difficult to gather all the context to understand and analyze the anomaly. There is really no tool out there that supports such conversations around data.
Beyond analyses, companies use machine learning and data mining to get better insights and optimize their businesses. Again, it is hard to hire someone to do this and integrate with solutions that are out there.
Data infrastructure companies have sales/scale/integration problems
Data infrastructure or Big Data is a crowded space. These companies can be roughly classified into the following:
- End-to-end SaaS data infrastructure companies (Mixpanel/Flurry/Kissmetrics/Google Analytics) that provide a push-button setup but are not flexible in the types of workloads they support
- End-to-end software and support contract companies (Cloudera/Hortonworks/Databricks/Altiscale/MapR) that are are time consuming and expensive but don’t fully solve the instrumentation part of the stack
- Point-solutions plus support contract companies (Confluent/MemSQL) that have optimized a small sliver of the entire data infrastructure stack and need to integrate with the rest of their customer data stacks.
- BI tools (Tableau/Looker/NetSuite/Amplitude/Qlik/Interana) that provide sophisticated ways to model and query data. However, they require a lot of setup time to model the data for analysis which may be separate from the raw event modeling. They are also expensive
- Pure consulting companies (think big analytics/Silicon Valley data science/Snowplow Analytics) that are also time consuming, expensive, and not suitable for startups
- Third-party integration/multiplexer companies (Fivetran/Treasure Data/Xplenty/Ajilius for integrations; Mparticle/Segment for multiplexing). Integration companies provide software and services to transfer data from multiple third-party sources into a customer’s warehouse. These services are very useful but still need the customer to have and maintain a data warehouse. Multiplexing companies try to take advantage of companies that have not yet decided on which end-to-end tools to use and so are trying to switch between different tools.
- Data integration companies (Tamr, ClearStory, Alation) that allow customers to combine data from multiple sources but are more for schema management and are more suited to bigger companies that have many custom, disparate data silos.
What is needed?
Companies should be able to quickly build a data infrastructure stack (specifically just using AWS services) that:
- Is useful to them immediately, i.e. is end-to-end from instrumentation all the way to storage, query, analytics.
- Has guidelines on data modeling for instrumentation to allow maximum flexibility in analytics
- Allows them to scale both in size and the sophistication of how they use the data without having to do a rewrite
- Gives them full control over all of their raw data
- Is affordable
Our goal is to get to a state where companies focus on building their product instead of worrying about setting up a data infrastructure stack. After their product is established, they can focus on generating insights and collaborating to make those insights actionable.
What will Datacoral build?
Datacoral will leverage AWS services and pluggable, reusable slices to compose the “right stack” for the customer and the level of scale and sophistication that they need right now. As their company grows, we will help them grow their stack accordingly.
Different layers of the data infrastructure stack need to work well with each other. Datacoral will provide modules that can do each of the following:
- Auto-instrumentation: Instrumentation modules in the most popular app development frameworks — web and mobile — to capture every user interaction without the product developer having to explicitly instrument anything.
- Flexible: Developers will have the ability to annotate the automatically generated events with custom attributes specific to the application.
- Efficient: Given the volume of data, we will devise a smart store/compress and forward mechanism to reduce the number of round trips and the amount of data logged
- Provide the ability to get snapshots of data from multiple sources like production databases and external services like Facebook, Mixpanel, and Salesforce. This will allow customers to combine all the data stored in different silos into one, easy-to-query place
- Endpoints will collect the logs sent by the instrumentation modules and make them available for efficient query
- Database/query engine and BI tool to run queries on raw data
- Standard dashboards for the business health corresponding to their vertical
- Tool to run ad-hoc analyses on raw data
- Tool to make it easy for people to have conversations about the analytics that have been created
- Data mining and machine learning plugins will allow customers to gain more sophisticated insights about their data
How will customers use Datacoral?
Customers will have to provide only the following information, and we will take care of everything else:
- Cross-account-role so that we can deploy the data infrastructure stack in their AWS account. If they want to just try it out on our account, we could do that too.
- Credentials for the different integrations they need
Once they do this, we will be able to set up a stack for them in a separate VPC. Datacoral will give them:
- Client-side instrumentation libraries in multiple languages
- A URL to point to those libraries
- The URL of the collaborative BI tool
We will also provide guidance for modeling their instrumentation data as well as their production databases.
Once everything is set up, Datacoral can also provide operations and maintenance.
- AWS is eating the world. AWS has 10 times more utilized cloud capacity than the other 14 providers combined. Last year, AWS had five times all the others combined, according to Gartner.
- Big Data business model maturity chart
- App data SDKs (from mParticle)