Series: Building Airbyte’s Data Stack

Simon Späti
September 13, 2022
10 min read

Building a data stack is hard. You start adding tool after tool; you generate copies of data sets in different places—eventually, you lose track of where the data flows through. We want to take a step back and let you join our data journey, tell you the pain we had and how we are trying to overcome this.

We start with how the Airbyte data stack started and how it’s going and will publish follow-up articles with hands-on examples of how we solved the challenges mentioned in this article with open-source tooling. Uses cases such as adding an orchestrator on top of Airbyte, dbt, Metabase, or how adding a metric layer to our data stack. We believe these are universal building blocks, and with a commitment to open-source and transparency, we hope it will help you.

How it started: Goals, OKRs, KPIs

As a fast-growing data startup, we are immensely data-driven. But before we can measure them, we must define what we want to measure. Airbyte has company-wide OKRs (Objectives and Key Results) to follow from day one, such as daily_active_users, installations_per_day, etc. These are not only essential for any company to align on commonly defined goals. But these are usually the start of analytics and dashboards.

In terms of the data stack, it meant using simple tools because not everyone is a data engineer. OKRs are great for building a bridge; they give you non-productive pressure to see how your company is heading—aligning everyone on company-wide goals. 

How Our Data Architecture Grew: Lots of SaaS Tools

Although sharing this early-stage architecture might be intimidating, we want full transparency on how it grew. Many other companies might experience similar data flows, especially in a fast-growing company. Let’s look at what happened at Airbyte during the first year and how our data stack grew. Below is the data flow diagram that shows different SaaS tools stitched together.

Overview of initial Data Flow with diverse SaaS tools
Overview of initial Data Flow with diverse SaaS tools

The power of SaaS tools is to quickly build up knowledge in each technology, perfect for validating your assumptions. As many of these are closed-source tools, we have to use them if we want to make progress fast. In some cases, the tools are not bringing the value you were hoping for; thus, you can stop the subscription and not lose lots of expensive time to build it up and manage it. Especially without a dedicated data team, it's a wise decision to use SaaS tools. If you don’t, it's natural that business users are more likely to choose closed-source as they are less engineering savvy as data engineers build these systems.

Not all tools are related to the (modern) data stack but helping us collect and assess OKRs and monitoring. Below, the above tools and their function are listed, with a separate category for data and analytics.

  • Data and Analytics: DataDog monitors and observes our infrastructure. Metabase and Superset/Preset for dashboards and internal analytics. BigQuery for storing data in a central data warehouse. Airbyte for data integration.
  • User Engagement and Feedback: Orb for credit management. Amplitude for daily active users. HubSpot Growth Marketing. Segment for all custom user interactions. Salesforce for handling customer relationship management. Intercom building customer relationships.
  • Developer Experience: GitHub for engineering workflows and maintaining the core of the Airbyte repo.

How It’s Going: How Our Data Architecture Grew

Fast forward to today, we have around 100 people. And with it, also the need for analytics is growing. That's why we decided to start building a Data Analytics Team, focusing our efforts on building a robust plan to solve the data movement problem, which is at the base of the Data Science Hierarchy of Needs. With ideally two or three data engineers for each data scientist or data analyst, we want to consolidate tools and optimize for our internal analytics. 

To start with, we need to understand the past. We want good monitoring of the present, which requires appropriately reflecting on past events before tackling work on complex models for predicting the future. We want to achieve this with simple overviews, boosting efficiency, and automating repetitive tasks across Airbyte. 

Improved Data Flow Overview

How do we achieve this? In the first step, we align data flow with data integration into a BigQuery data warehouse with our very own ELT Tool. That’s also where we leverage the power of the long-tail integration of Airbyte with 170+ Source connectors.

Data should flow through a central data warehouse hub where transformation logic is centralized and used once. We want not to repeat ourselves, that everything is version controlled and tested.

Improved Data Flow Overview
Improved Data Flow Overview

Improving Access for Our Data Consumers

As these data consumers have different use cases and skills, how do we support them? We have different personas that queried our data today or in the future. We want to ease the query process with the above layers and structure and make data as available as possible. We have business users such as product managers who need to report how the company is doing with their SQL and dashboard tools of choice. Open-source users (you) want to extract data for their data warehouse. Heads or managers wish to pull analytics about specific topics to gain insights and overviews. Data Scientists and other power users want to experiment with data interactively within Notebooks. And internal apps and pipelines access analytical data in an automated fashion.

With these personas and use cases in mind, we are building our Airbyte Data Stack. In this chapter, we will focus on tools that are considered part of the (modern) data stack. You can say from data integration to transofrmation to the BI visualization layer. 

Data Warehouse: BigQuery

As our data warehouse, we use Google BigQuery. As we run our Kubernetes on Google Cloud, it's natural to start with BigQuery as our data warehouse, besides being very elastic and scalable.

Here we land our data from data integration from our engagement and other sources. BigQuery also makes it easy to join lots of diverse data with good query times. 

Start with Data Integration: Airbyte

For connecting all different sources, we use our extensive source connectors that are constantly updated, for example above mentioned event tracking and engagement feedback. With it, we can integrate data in a single tool, and easy to use with our beautiful UI. 

Data Modeling and Transformation (DWH Layers): dbt

The Data Warehouse layers we use to structure our internal data are inspired by the best practices dbt-post on structuring project folders.

1-staging - Technical Layer: Contains the base models where a minimum of modifications is applied. It has the same granularity, but the columns have been renamed, recast, or usefully reconsidered into a consistent format. Each model should bear a one-to-one relationship with the source data table it represents. Each model should maintain a one-to-one relationship with the source data table it means. 

2-marts - Analytical Layer: In more traditional Data Warehouse Models, this layer aims to build dimension and fact tables (see more on Dimensional Modeling) to get closer to business aspects of the data. However, this still requires users to write join-queries, so it might not be the easiest thing to do for useability. 

3-business - Business Layer: One-big-tables focused on delivering a simple table for end-users. These tables should be the ones exposed in the Business Intelligence or Data Virtualization tools for consumption by reports and dashboards.

4-metabase - Reporting Layer: Copies of queries used in the Metabase, our Business Intelligence Tool. Note that because we are using the cloud-hosted or managed Metabase yet. These models should be strict clones of the ones deployed in the data visualization tool, enabling us to build tests and verify that dashboards won't get broken as part of our CI for this repo automatically.

5-reverse-load - Reverse ETL: This is the layer that pushes enriched data back from the data warehouse to source systems such as Salesforce. We started to use Hightouch.

The Self-Serve Layer: Metabase

We use Metabase for company-wide and shareable dashboards. It allows our users to easily create an analytics dashboard without much prior information, even with little SQL knowledge, although that never hurts. We will continually add on top and follow up with an article if possible, especially with a metrics layer, and centralize metrics and dimension.

About Our Open-Stack-Data Project and This Series

We think there are not enough reference projects out there, and what is a better use case for our fast-growing startup story? We want to share internal implementations and examples on an upcoming open-source repo. The goal is to produce a repo where we can share internal implementations from the challenges we face at each stage, along with a blog post.

These are topics we have in mind to tackle in future posts of the series:

  • Automated GitHub PR workflow with Dagster to scan and detect changes in SQL statements using different tools such as dbt, Hightouch, and Metabase.
  • ~Challenge: Multiple places to maintain SQL code without version control and checks of the data team. We are reducing duplicate SQL code and queries.
  • Adding a centralized Metrics Layer for all metrics and dimensions definitions.
  • ~Challenge: Metrics can get complex fast. First, there is the transformation logic. But ad hoc business rules also need to be applied on the fly in different dimensions. You could implement this logic in existing tools (dbt, BI tools, etc.). But it's hard to centralize logic and avoid being locked into it. That's where a metrics layer can help.
  • Analytics on self-deployed Airbyte instance: How many syncs have been running, how long did they take
  • ~Challenge: Running Airbyte OSS with many different syncs, you might lose oversight of what ran successfully or where you had errors. That's where a small operational dashboard on top of the Airbyte Postgres database can help.
  • Adding an orchestrator that oversees all the data tools and orchestrates them.
  • ~Challenge: With a growing team and architecture, tools are mostly used siloed. It's hard to get a single control plane where you see what's going on and maintain the different tools. That's where orchestrators can help.
  • How to solve the data discovery and cataloging issue.
  • ~Challenge: The bigger the company grows, the harder it is to get an overview of existing data sets or dashboards—that is where data catalogs help.

Please provide comments on other recommendations you feel we should cover. As we write new sections, we may add or change topics!

Wrapping up

If you liked what you read so far, make sure you sign up for the Newsletter to get the next one in this series. If you have comments, or critiques, join our Slack Community to network with 6000+ data engineers.

Limitless data movement with free Alpha and Beta connectors
Ready to unlock all your data with the power of 300+ connectors?
Try Airbyte Cloud FREE for 14 days