September 2019

September was a big month for marketing data infrastructure at Datacoral, especially in the context of our partnership with Amazon Web Services.  At the beginning of the month we conducted and published our first webinar recording: 

The top 5 Requirements for AWS-Native data pipelines 

This event covered how Datacoral goes beyond popular cloud-based ETL and ELT products to support a cost effective, scalable and compelling data infrastructure platform within AWS using native AWS services. 

Beyond supporting AWS-best practice ELT centered around S3, Redshift and Athena, the five additional requirements are: 

Watch the webinar recording of Top 5 Requirements for AWS-Native data pipelines

Raghu Murthy interviewed by Dan Woods at EarlyAdopter.com

Datacoral’s founder, Raghu Murthy was featured on Dan Woods’ podcast at EarlyAdopter.com, talking about the origin of Datacoral, and the advantages of deploying serverless data integration technology.  

Listen to Creating the Right Data Pipelines podcast with Dan and Raghu

Building Serverless Data Pipelines on Amazon Redshift by writing SQL with Datacoral

We are very pleased to see that Amazon Web Services has published our blog about how SQL is the data programming language used to build data pipelines in Datacoral.  

The article originally appears on the Amazon Partner Network blog here.  

This article builds upon our earlier data programming series, and does a nice job of illustrating how we use SQL as the programming interface combining it with header comments which are key/value setting that tell Datacoral how to process the query.  When we deploy it in our system we call it a Data Programming Language (.dpl) file, but it’s just your query with Datacoral instructions in the headers.

Data Innovators using Datacoral Webinar

We also just ran a webinar that profiled five of our customers, Greenhouse, Front, Jyve, Swing Education and Cheetah. Characterized as Data Innovators, these fast growing organizations are inventing transformative business models for the gig economy, logistics, mobile, collaboration, human capital management and artificial intelligence.  A recording of this event is available here, and becomes the building block for what we will introduce next week. 

Watch the Data Innovators using Datacoral Webinar video 

Data Infrastructure for Startups Program

On October 10th, we will unveil our next initiative, the Data Infrastructure for Startups Program designed to help early stage technologists overcome the inevitable issues in tapping their data resources.  The common, cost-conscious mindset is to use a combination of open source and ingenuity to build out initial data infrastructures. While it may be ok to trade an engineer’s time for money saved, at some point, fairly quickly, that becomes a maintenance burden for that engineer that further bogs down their productivity. 

This is a very common problem for data engineers who are building and healing data pipelines, and this program will help resolve that, just as we did for the startups we featured earlier this week. 

As we roll this out, we will also feature data solutions that we have implemented including: 

Join us for our free webinar on October 10 at 9AM Pacific for details of this program.

And what is the ROI if I don’t have to hire so many team members?

I’ve been at Datacoral for two months.  In that time, I’ve met or corresponded with most of our customers. What impresses me is how they describe the value that our data-infrastructure-as-a-service brings them. More than one says that we have saved them from needing to hire a team of engineers to build out and manage their data infrastructure. Okay, that sounds like pretty big value, but it’s abstract value, because I can’t immediately assess what ‘team’ means in terms of membership size, responsibility, roles, and skillset of each team member. So, I’ve been thinking about how to turn the value of this mysterious engineering team into something that I can explain and everyone else understands.  

In discussing this with colleagues and friends, we concluded the following: 

  1. One person isn’t a team, and that individual will be overwhelmed with data access and availability requests within the first three hours on the job. By the end of the week, they’ll be asking for more resources to share the workload. 
  2. With two people we get doubles tennis or beach volleyball–cooperators certainly, but they are a pair and not quite up to team standards for availability, skill coverage and redundancy. 
  3. My original minimum team size estimate is three people, like half-court basketball, because they can potentially provide seven-days per week on-call support coverage, without overworking any one individual. Of course data pipelines break all the time because data is unpredictable therefore you do need everyday coverage
  4. Four people teams are found in curling, where positions and roles begin to take shape. Curling’s roles on a team are Leads, who take the first two turns at throwing the rock before sweeping the next six throws. Seconds and Thirds take two throws each while the Skip, the skipper and strategist for the team, takes the last two throws. The throws, themselves are important in curling, as this puts the rock in motion, allowing the sweepers to work around it. In the data pipeline world, I equate this to three people focusing on work around the data, while the rock represents the data itself. Out of all four team members, only one per turn touches the data, the other three are working the infrastructure around it. So, only 25% of the team touches the rock (data) during any given turn.   
  5. Full-court basketball, with five people per side, seems to me to be the first all-purpose team size. Here we begin to have defined roles, performance expectations and a high degree of flexibility. I can see one team member working on future infrastructure additions like mapping data models for new sources, two managing the daily chores of orchestration and data quality checks while one person works on improving the value inside the pipeline by building and testing new SQL transformations. And finally, the point guard directing the action, defining how the overall architecture grows to support future needs and making sure management is happy. I like five. 

What’s disappointing about that 5-person team I just laid out is that only one of them is focused on actually deriving value within the data pipeline building transformations, while, like the curling team above, are all others working outside and around the data.  So, a five person team’s time invested in data (%TID) is only 20%, which seems really low when you consider there’s five of them.

I’m sure that I can keep making the same kinds of analogies with a six-member hockey team and a seven-member water polo team, etc, where the puck or ball represents the data, and the activities and positioning of players receiving passes are infrastructure. We can get all the way up to the 11-player-per-side sports like Soccer and American Football and it all still works.  Lots of coordination to prevent the other team from disrupting the flow of the ball (representing data).  

Enough Sports, What about Work Teams?

Searching online turned up some great organizational development and scrum articles like Mark Ridley’s “What’s the perfect team size.” (I confess, I’m biased towards this one because he uses Neo4j, the last product I marketed, to express the total number of relationships among teams of any size.) He ultimately concludes that the ideal team size is between 4 and 9 people.  

Building the ROI Model

Now that I have my five team members, I’d like to figure out how much I might save if I didn’t need them. Or what I might accomplish if I reallocated them and their focus time to specific tasks to which they are suited.  To answer the first question, “how much money would I save if I didn’t need to hire a team?” is pretty straightforward. 

I can figure out how much a data architect or data engineer can expect to make from Salary.com. (Salary.com does not yet track salaries for Data Engineers.) From the table below, we see that the average salary is $142.6k and I’ll estimate that their fully loaded costs with benefits and bonuses add another 25%, making the average cost per team member, just over $178k. So, if I save three team member hires, that’s over $530k and a five-member team is almost $900k per year. Wow!  

San FranciscoMedian Salary
Data Architect I$98,000
Data Architect II$129,000
Data Architect III$144,000
Data Architect IV$163,000
Data Architect V$179,000
Average$142,600
25% Benefits$35,650
Avg Team Members’ Cost$178,250
3-Member Team Savings$534,750
5-Member Team Savings$891,250

Increasing Team Member’s Percent Time Investment in Data (%TID)

Unlike Uber, I’m not going to lay off my engineers, they are still too valuable.  What I’d like to do is increase their time invested in data (%TID), capitalizing on their SQL skills because the infrastructure doesn’t need much attention. That table looks like this: 

Team Member %Time invested in DataExisting %TIDNew %TIDData Investment
1 Team Member20%70%+50%
Value per member$35,650$124,775$89,125
5 Team Members’ Investment in Data $623,875$455,625

So, if they go from 20% of their time invested in data, to 70%, that’s a plus 50% bump in their data investment, which works out to be over $89k per employee or over $450k per team. 

We can now look at these benefits in a few ways: 

Those are pretty big returns, and I suspect the benefits of outsourcing my data infrastructure (assuming I make no trade-offs in security and manageability) are even bigger, like: 

I’m hosting a webinar tomorrow morning on the Top 5 requirements for AWS-native Data Pipelines. In it, I’ll explain the latter two benefits more deeply as we discuss why a serverless microservices architecture, which is how we are built, is the optimal deployment model for AWS customers. Then we’ll talk about how orchestration, change awareness and data publishing are the remaining key requirements beyond supporting AWS best practice ELT, like most other vendors are touting to enjoying the customer benefits I’ve described in this post. 

Click here to register for tomorrow’s event or to receive its recording afterward.  

Datacoral is headed to New York City on Thursday, July 11 as a Silver sponsor of AWS Summit at the Jacob Javits Center. The event is free and runs from 7 AM to 6:30 PM.

We will be showing off our AWS-based Data Infrastructure as a Service (DIaaS) for data engineers, data scientists, Redshift administrators and BI analysts. Datacoral is a complete, end-to-end data pipeline service that runs securely in your VPC, connects to your cloud data, organizes and orchestrates it in Redshift, and allows users, applications and original sources to harness the results. Data is delivered as materialized views to whatever target you want.

We help customers address the most critical problems in data self-service–building and maintaining their data pipelines. Our customers tell us that we help save them over half a million dollars in resources per year, while giving their data engineers time to actually work with the data, not around it, which results in happy data scientists and consumers.

If you have data pipeline troubles, or are just moving into AWS altogether, then come see us in Booth #149, in the far left corner as you enter the exhibit hall, next to the Dev Lounge. We will have plenty of space and great give-aways including t-shirts, wireless phone chargers, pens, stickers and more. Plus you can meet Datacoral’s founder Raghu Murthy, who cut his teeth building infrastructures at Facebook and Yahoo!.

See you in New York!

Schedule a demo or request information

 

We use cookies on our website. If you continue to use our website, you are agreeing to our use of cookies in accordance with our Cookie Statement. For information about how to change your cookie settings, please see our Cookie Statement.