Datacoral provides an end-to-end data engineering platform to extract, load, transform and serve business data. In short, we help data scientists create and manage data pipelines with only a few clicks. As part of our installation process, we deploy a set of micro-services within our customers’ VPC to help them with their data engineering workloads. These services could be fetching data from a database like Postgres and loading data into a warehouse like Redshift, Snowflake or Athena.
Internally, our services communicate via events, which we call “data events”. In this blog post, we will first explore what these data events are, then show how they enable event-based orchestration, which leads to robust data pipelines, and finally, discuss how our customers can use these events to unlock different use cases.
A data event represents a batch of data that was extracted, loaded, transformed or published by our platform. The event consists of success or failure status for the event, the time period (or “timelabel”) for the batch of data and other metadata for that batch.
These are emitted by the compute process (usually an AWS Lambda) that is responsible for the completion of the data “movement” and is received by any compute processes(s) that are responsible for using the recently moved data. As an example, we might have a “data event” sent by a lambda that loads data into AWS Redshift after a table has been updated.
The Data Event might look like the following:
Data event types include:
- S3 Data Fetch events (from different connectors)
- Redshift, Glue, Snowflake table update events
- Redshift, Athena, Snowflake materialized view update events
- Data Publish events (to destinations such as Postgres, Salesforce)
An example of how data events are used within our architecture is shown below. This figure shows Datacoral fetching data from a database into S3, adding metadata in AWS Glue and loading the data into AWS Redshift.
Any system that has more than one job/task/process/component has an orchestration problem. Orchestration refers to the automated configuration, coordination and management of the different parts of a system. In the context of data engineering, the coordination component of orchestration can either be event-based or schedule-based. In the first case, processes get triggered when they receive an event from another process (such as the Redshift loader process being triggered by an event from the data fetch process). In the second case, processes start and run based on some specified schedule (such as
0 * * * *, or at the 0th minute every hour).
Data Event-based Orchestration vs Schedule-based Orchestration
In the earlier example, we explored a data event-based architecture for fetching data from a database (such as Postgres) and loading into a data warehouse like Redshift. If this was schedule-based orchestration, we might have a setup that looks like the following:
- Data starts to be fetched from the source database to a known prefix in S3 at 9am everyday. This usually takes up to 15 minutes to complete.
- At 9:15 am, a process wakes up and loads data from S3 to Redshift. This load usually takes up to 15 minutes.
- At 9:30 am, an external process wakes up and begins reading data from Redshift.
However, there are many ways things can go wrong here. First, what if the data fetch from Postgres to S3 takes more than 15 minutes on some day? In that case, we will have the process that loads data into Redshift still wake up at 9:15 am, and load the previous day’s data. This will lead to incorrect results downstream, without an explicit error or failure anywhere in the data pipeline. In an event-driven architecture, processes can check metadata such as status and timelabel to know whether to load data into Redshift.
Second, what if this data fetch needed to be run again at 4pm on some day? What if this needed to be run twice a day? What if data loads to Redshift began to take up to 30 minutes (instead of just 15 minutes)? In all these cases, manual code changes need to adjust schedules for every individual process, which leads to a very brittle data pipeline. In an event-driven architecture, processes only respond to the right “data event”.
Third, this approach is not scalable as the number of dependency graphs (or DAGs) increases. For example, if data needs to be fetched from a MySQL database, then a separate script needs to be created and managed for this new data flow. We need to figure out whether the fetch from MySQL to S3 takes shorter or longer than 15 minutes and then adjust schedules accordingly. For event-based orchestration, when a process is successful, its dependencies can be read in (from a config store), and triggered. This process is better aligned with a microservices architecture which leads to improved scalability.
While there are other points of comparison we can make, this should be a convincing enough argument for event-driven orchestration.
Exposing our Data Events to Customers
Many customers want to build on top of our orchestration framework to unlock diverse use cases such as business logic driven alerting and notifications, specialized monitoring, downstream data processing and other uses within their own products. We are excited to announce that we’ll be making our data events available to our users which will allow them to build easily on top of the Datacoral platform.
As a first iteration, users will be able to provide us an SQS queue through our application, and we will push all data events to this queue for downstream consumption. Here’s how to set up an SQS queue to receive “data events” in your Datacoral installation.