Before joining Substack as a data engineer, Mike Cohen served in data science and analytics roles at startups like Venmo and Fin. Mike discovered Datacoral while at Fin, and quickly became an advocate for Datacoral at Substack. Mike found that features like fast data syncs, hard deletes, automatic schema change detection, and the ability to double-click into data lineage and synchronization processes were invaluable to his work and his team.
Why Datacoral?
Before deploying Datacoral, Mike’s team at Fin was struggling with several difficult issues, including:
- Data tools created proprietary columns into the dataset that were useful for the tools themselves, but not for Fin’s analytics requirements
- Tooling did not automatically track or account for schema changes
- Data syncs ran on schedules that weren’t fast enough for business requirements
One of Mike’s colleagues had previously worked with the Datacoral team and recommended bringing them in to address these problems. According to Mike, “From then on, it was simple and straightforward. Everything just always worked and support was always top-notch.”
Using Datacoral at Substack
Fin’s data stack implementation was so successful that Mike knew he had to bring it into his next role at Substack. One of his first projects was to set up Substack’s data infrastructure with a data warehouse to centralize all the production data.
Substack’s early data infrastructure included the production database and a separate events database. Mike created a Snowflake database as a centralized data destination. Datacoral was instrumental in establishing functional pipelines and maintaining synchronization between production and the warehouse. “If it’s something that’s of super high importance, we use Datacoral. That includes our production data, our Zendesk data for support tickets, and we’re starting to spin up and use it for our marketing data as well. It’s growing with us.”
What stands out for Substack, in terms of Datacoral’s offering, is the fast data synchronization. “The syncs are fast, frequent, and one of the things Datacoral’s excellent at is capturing hard deletes.” Though several other tools on the market recommend adding a `deleted_at` column and using a soft deletion mechanic, it can be the source of additional, unnecessary work for another data worker.
Furthermore, soft deletes can lead to data mismatches between production and the data warehouse. For GDPR-related data requirements, soft deletes and other data compromises are insufficient. Soft deletes can also lead to row-count discrepancies and miscounts. Cohen adds, “Our production data will say we have 1,000 users and our data warehouse will say we have 1,500 users. The soft deletes didn’t propagate to the data warehouse.”
High Data Quality with Datacoral
With the help of Datacoral, Substack knows its “data is going to be correct, accurate, and up to date.” Mike and his data team don’t worry about where data went or whether a slice of data has been dropped. Datacoral handles all of the orchestration and data quality checks. As a result, Substack need not be concerned with data loss or synchronization issues.
Beyond the bread and butter data pipeline features, Cohen specifically calls out the following “killer features” of Datacoral:
- Snapshot data syncs, even on large tables
- Data quality checks are built into the integration through Datacoral’s SQL transformations
- Double-clickable data elements
Substack uses snapshot data syncs for the freshest data, providing them a nearly-live copy of what’s in production. Considering the challenges of hard and soft deletes, Cohen especially appreciates the ability to run these syncs as often as he wishes.
Datacoral’s data quality checks offer “really nice peace of mind.” Built with the help of Datacoral’s support team, Substack has data checks on the most important tables. They have the assurance that what’s in production is the same as what’s being seen in Snowflake.
When Cohen needs to “dig deeper,” he’s able to “break apart a process into its component parts. When you’re syncing data from Postgres into Snowflake, there’s a piece where you’re syncing out of Postgres and another syncing into Snowflake. Each of those pieces is highlighted and there are helpful UI elements like a Gantt chart that shows you how long things take. It gives you a sense of what’s happening and when.”
Cohen continues, “Some people want to know the answers instantaneously and some can accept some semblance of latency. Datacoral lets us do both.” Tables with tens of millions of records take longer to sync than tables with small datasets. Datacoral manages the synchronization schedules for the lowest latency considering the table size. “For tables [where] we want high fidelity, we can set that and be able to do analytics to answer different questions at different speeds depending on stakeholder requirements.”
Protecting Production Data
Using production data without accessing the production database is one of the first challenges companies face on the journey to becoming data-driven. “When we need to analyze production data, we only access it through Snowflake, our data warehousing tool. That’s by design.” This allows Substack to serve meaningful data insights into BI tooling, create denormalized views for analytics, or provide intelligent customer-facing features.
Protecting production infrastructure and data allows Substack to keep its data and metadata clean. Keeping production clean allows Datacoral to seamlessly and transparently manage orchestration, maintain high data quality, and ensure data fidelity expectations are met.
Cohen says that without Datacoral, Substack would be left to deploy homegrown or third-party tooling. The proliferation of tooling leads to unsustainable maintenance and custom coding costs.
Datacoral is “superior to other products”
There’s a reason Cohen is happy to talk about Datacoral. After bringing it to Substack, he says, “It’s been smooth sailing ever since. I’ve been a big fan and continue to be an advocate for the product because I think it’s superior to other products I’ve seen and done demos of.”
Fast data syncs, the ability to handle hard deletes, automatic schema change detection, and high data quality are just some of the reasons Substack chooses Datacoral.
To try the product for yourself, you can sign-up here.