Building a data stack is hard. You start adding tool after tool; you generate copies of data sets in different places—eventually, you lose track of where the data flows through. We want to take a step back and let you join our data journey, tell you the pain we had and how we are trying to overcome this.
We start with how the Airbyte data stack started and how it’s going and will publish follow-up articles with hands-on examples of how we solved the challenges mentioned in this article with open-source tooling. Uses cases such as adding an orchestrator on top of Airbyte, dbt, Metabase, or how adding a metric layer to our data stack. We believe these are universal building blocks, and with a commitment to open-source and transparency, we hope it will help you.
As a fast-growing data startup, we are immensely data-driven. But before we can measure them, we must define what we want to measure. Airbyte has company-wide OKRs (Objectives and Key Results) to follow from day one, such as daily_active_users, installations_per_day, etc. These are not only essential for any company to align on commonly defined goals. But these are usually the start of analytics and dashboards.
In terms of the data stack, it meant using simple tools because not everyone is a data engineer. OKRs are great for building a bridge; they give you non-productive pressure to see how your company is heading—aligning everyone on company-wide goals.
Although sharing this early-stage architecture might be intimidating, we want full transparency on how it grew. Many other companies might experience similar data flows, especially in a fast-growing company. Let’s look at what happened at Airbyte during the first year and how our data stack grew. Below is the data flow diagram that shows different SaaS tools stitched together.
The power of SaaS tools is to quickly build up knowledge in each technology, perfect for validating your assumptions. As many of these are closed-source tools, we have to use them if we want to make progress fast. In some cases, the tools are not bringing the value you were hoping for; thus, you can stop the subscription and not lose lots of expensive time to build it up and manage it. Especially without a dedicated data team, it's a wise decision to use SaaS tools. If you don’t, it's natural that business users are more likely to choose closed-source as they are less engineering savvy as data engineers build these systems.
Not all tools are related to the (modern) data stack but helping us collect and assess OKRs and monitoring. Below, the above tools and their function are listed, with a separate category for data and analytics.
Fast forward to today, we have around 100 people. And with it, also the need for analytics is growing. That's why we decided to start building a Data Analytics Team, focusing our efforts on building a robust plan to solve the data movement problem, which is at the base of the Data Science Hierarchy of Needs. With ideally two or three data engineers for each data scientist or data analyst, we want to consolidate tools and optimize for our internal analytics.
To start with, we need to understand the past. We want good monitoring of the present, which requires appropriately reflecting on past events before tackling work on complex models for predicting the future. We want to achieve this with simple overviews, boosting efficiency, and automating repetitive tasks across Airbyte.
How do we achieve this? In the first step, we align data flow with data integration into a BigQuery data warehouse with our very own ELT Tool. That’s also where we leverage the power of the long-tail integration of Airbyte with 170+ Source connectors.
As these data consumers have different use cases and skills, how do we support them? We have different personas that queried our data today or in the future. We want to ease the query process with the above layers and structure and make data as available as possible. We have business users such as product managers who need to report how the company is doing with their SQL and dashboard tools of choice. Open-source users (you) want to extract data for their data warehouse. Heads or managers wish to pull analytics about specific topics to gain insights and overviews. Data Scientists and other power users want to experiment with data interactively within Notebooks. And internal apps and pipelines access analytical data in an automated fashion.
With these personas and use cases in mind, we are building our Airbyte Data Stack. In this chapter, we will focus on tools that are considered part of the (modern) data stack. You can say from data integration to transofrmation to the BI visualization layer.
As our data warehouse, we use Google BigQuery. As we run our Kubernetes on Google Cloud, it's natural to start with BigQuery as our data warehouse, besides being very elastic and scalable.
Here we land our data from data integration from our engagement and other sources. BigQuery also makes it easy to join lots of diverse data with good query times.
For connecting all different sources, we use our extensive source connectors that are constantly updated, for example above mentioned event tracking and engagement feedback. With it, we can integrate data in a single tool, and easy to use with our beautiful UI.
1-staging - Technical Layer: Contains the base models where a minimum of modifications is applied. It has the same granularity, but the columns have been renamed, recast, or usefully reconsidered into a consistent format. Each model should bear a one-to-one relationship with the source data table it represents. Each model should maintain a one-to-one relationship with the source data table it means.
2-marts - Analytical Layer: In more traditional Data Warehouse Models, this layer aims to build dimension and fact tables (see more on Dimensional Modeling) to get closer to business aspects of the data. However, this still requires users to write join-queries, so it might not be the easiest thing to do for useability.
3-business - Business Layer: One-big-tables focused on delivering a simple table for end-users. These tables should be the ones exposed in the Business Intelligence or Data Virtualization tools for consumption by reports and dashboards.
4-metabase - Reporting Layer: Copies of queries used in the Metabase, our Business Intelligence Tool. Note that because we are using the cloud-hosted or managed Metabase yet. These models should be strict clones of the ones deployed in the data visualization tool, enabling us to build tests and verify that dashboards won't get broken as part of our CI for this repo automatically.
5-reverse-load - Reverse ETL: This is the layer that pushes enriched data back from the data warehouse to source systems such as Salesforce. We started to use Hightouch.
We use Metabase for company-wide and shareable dashboards. It allows our users to easily create an analytics dashboard without much prior information, even with little SQL knowledge, although that never hurts. We will continually add on top and follow up with an article if possible, especially with a metrics layer, and centralize metrics and dimension.
We think there are not enough reference projects out there, and what is a better use case for our fast-growing startup story? We want to share internal implementations and examples on an upcoming open-source repo. The goal is to produce a repo where we can share internal implementations from the challenges we face at each stage, along with a blog post.
These are topics we have in mind to tackle in future posts of the series:
Please provide comments on other recommendations you feel we should cover. As we write new sections, we may add or change topics!
If you liked what you read so far, make sure you sign up for the Newsletter to get the next one in this series. If you have comments, or critiques, join our Slack Community to network with 6000+ data engineers.
Get all your ELT data pipelines running in minutes with Airbyte.