This post was originally an email which our CEO, Raghu Murthy, shared with Datacoral customers in response to the AWS outage that occurred on Wednesday, November 25th, 2020. It is shared with light edits.
We hope that you and your loved ones had a relaxing holiday break. One that was not too heavily impacted by the AWS outage last week. In case you weren’t aware, AWS had a pretty significant outage of its services in the us-east-1 region on the day before the Thanksgiving break. You can read more about the outage from The Verge and from the Washington Post.
Customers use Datacoral to automate data pipelines quickly. We have been working with data pipelines for decades. Below, we will share with you some of the steps we took in order to make sure that our customers were not impacted by the outages.
Unlike other failures that we have seen over the years with AWS, this outage was unprecedented in the sheer number of mission critical services that were affected and the length of time that they were down. A full retrospective has been provided by AWS and the reasons behind the issue are a fascinating read. That said, we had a very real and practical problem of having to make sure that all of our customer data pipelines would recover once AWS was back online. Amazon Kinesis was the central service impacted during the outage and Kinesis is heavily used as a scalable concurrency-controlled queueing mechanism in our serverless, event-driven architecture.
One of the hardest things about data pipelines as opposed to other application software is that downtime results in a backlog that needs to be recovered. In addition, recovering from the backlog is not just a matter of rerunning everything since that leads to duplicate work and unnecessary pressure on the source and destination systems. For example, if there is a connector that is fetching a snapshot from a source every hour and there is a downtime of 18 hours, you don’t want the connector to fetch a snapshot 18 times in quick succession – that would just result in 18 times the work for the source, data pipeline and data warehouse. On the other hand, if a connector is fetching data in an incremental fashion, just pulling the deltas every hour, you do want the connector to fetch all the deltas that were missed during the 18 hour downtime.
The same argument around snapshot and incremental processing applies for transformations and publishers as well. Transformations and publishers are trickier since they have their own schedule independent of the connector schedules. For example, a daily transformation may depend on a table that is updated by a connector every hour. In that case, the daily transformation has to depend on 24 of the hourly updates to complete. Finally, when there is a combination of snapshot and incremental processing between connectors, transformations, and publishers that are all running at different schedules, and there are 100s of thousands of such tasks in a single installation, you get a sense for the complexity of the recovery.
AWS spent over 18 hours trying to fix the issue in Kinesis and related services. During this entire time, all pipelines of the affected customers were quiesced. So, there was a backlog of 100s of thousands of tasks (nodes in the pipeline execution DAG) that needed to be caught up while also handling any new data that was coming in. Our goal was to catch up our customer’s data pipelines before the end of the Thanksgiving week so that their data was up-to-date when they got back from the holidays.
There is a saying that the culture of the company shows through when shit hits the fan! And I’m glad to say that we have the people, process, and technology to confront situations like this and recover from them in a systematic fashion! Our entire engineering team showed up and chipped in with ideas. We quickly set up a process to track ownership of each step of the recovery as well as communicating progress to our customers during the recovery. And, technology-wise, while we have significant automation built out, some recovery steps would have still required manual effort. We ended up doubling down on further automation for recovery. This, coupled with the fact that our architecture is asynchronous and event-driven, meant that our systems were able to recover fairly rapidly and the team spent much more time making sure that the underlying AWS systems were not overloaded.
I just want to also take a moment to thank our team. It was heartening to see how deeply everyone cares about doing right by our customers. While the timing of this outage was less than ideal, and pulled team members away from friends and family over the holiday, we have come out stronger with better processes and automation to handle situations like this in the future.
We really appreciate the opportunity of providing you a worry free data pipeline for your analytics. As always, we’d love to hear from you, and are happy to answer any questions you may have about the impact of the AWS outage or product feedback.