Data Programming Interface (DPI) for Analytics Tools (instead of APIs)

Data Programming Interface (DPI) is more flexible and approachable to data practitioners who just know SQL.
Simplified Anatomy of an Analytics Tool

In the previous post, we introduced the concept of a data program where a data practitioner who knows SQL can declaratively specify data flows as a combination of data functions. The data functions themselves have signatures/types that are based on the data schema — the attributes (and their types) that form the input to the data functions. In this post we will talk a little more about what it means for the data schema itself to be the interface, which we call the Data Programming Interface.

First, let’s discuss how data processing for analytics is typically thought about. A good example is Google Analytics which is used for the instrumentation and analyses of website activity. The simplified anatomy of such an analytics tool is shown in Figure 1 below.

As far as the user of the analytics tool is concerned, it is just a two step process:

  1. Developer makes an API call — essentially a function call that requires a standard set of attributes (like userid, timestamp), and some application-level information.
  2. Once the API call is made, pretty graphs appear in a UI provided by the tool (with some latency).

The developer doesn’t have to worry about what kind of processing happened for the data to get cleaned and aggregated. The analytics tool handles it.

In reality, these analytics tools have multiple systems that are used for collection and transformations of the data to prepare it for visualizations. But, functionality-wise, all tools internally can be thought to have the following three steps:

Step a. Events that the tool receives through the API call are stored in a raw table

Step b. A data pipeline takes the raw table through a series of transformations — cleaning, enriching and aggregating

Step c. The output of the data pipeline is stored as a set of summary tables from which data is extracted for visualization

The visualizations are made available (Step 2 for the user) because of Steps a, b, c happening within the tool.

While this level of abstraction provided by the tools is great to get started on analytics, most data scientists find such tools pretty rigid and constraining. A few problems include:
  1. Given that input data is provided through API calls, application code is required if one has to send events of different kinds to the tool. For example, if analytics needs to be done on events coming from devices as well as ticketing events generated in Zendesk, there is code to be written to send both of these types of events to the analytics tool.
  2. One cannot easily “backfill/replay” events from a historical time period or redo the analysis after repopulating the events data

Now, imagine that instead of making an API call, a data scientist is allowed to create in a database, the input events table that has the same attributes and application-level information in columns but has data from different sources.

In Datacoral’s Data Programming Language, this would look something like below:

-- receive events from an events endpoint
INSERT INTO events.device_events
FROM events-endpoint
-- pull data from zendesk
INSERT INTO zendesk.ticket_events
FROM zendesk-connection-paramaters
-- create an events table as input to the analytics tool
-- by combining events from the endpoint and zendesk
INSERT INTO analytics.all_events
SELECT user as user_id, ts as timestamp, 
  'device' as event_category, action as event_action,
  label as event_label, value as event_value, 
  { ip, device_id } as event_context
FROM events.device_events
UNION ALL
SELECT requester_id as user_id, created_at as timestamp, 
  'ticket' as event_category, ticket_event as event_action, 
  subject as event_label, description as event_value, 
  { ticket_id, assignee }  as event_context
FROM zendesk.ticket_events

So, Step 1 (for the user) and Step a (internal to the tool) in Figure 1, can be replaced by the data program above. This leads to the creation of a raw table, such as the one below.

analytics.all_events — Input to the Analytics Tool

A data pipeline similar to the one within the analytics tool can then take the raw table through a series of transformations. This would result in the same output summary tables that can then drive the visualizations that our user is familiar with.

With data programs, instead of using an API to interface with the analytics tool, the data scientist is using the data directly to interface with it. To be more specific, the analytics tool would specify a data interface, i.e., a schema, for its input, rather than an application programming interface (API). This is what we are calling a Data Programming Interface (DPI).

At Datacoral, we believe that data practitioners who are comfortable in SQL can utilize a DPI to provide input data more flexibly into tools for analytics (amongst other things). Datacoral also makes it easy for data practitioners to build data pipelines as data programs. These data program would then describes their input via DPI. However, that is a topic for a future post.

A similar idea has been discussed before — Jonathan Hsu, in his blog wrote that a raw table that corresponds to a specific schema:

(
  user string,   -- user identifier
  dt date,       -- date that the user created value
  inc_amt double  -- value that the user created on the date dt
)

This table can activate sophisticated analyses around growth, churn, and LTV. At Datacoral, we have generalized some of these concepts into the much more powerful notion of Data Programming.

In the next edition of this series of blog posts, we will describe in detail the features of our Data Programming Language.

If you’re interested in learning more about Data Programming, or want to chat about building scalable data flows, reach out to us at hello@datacoral.co or sign-up for a demo.

Share

Twitter
LinkedIn
Facebook

We use cookies on our website. If you continue to use our website, you are agreeing to our use of cookies in accordance with our Cookie Statement. For information about how to change your cookie settings, please see our Cookie Statement.