How to build E-commerce Data Pipeline with Airbyte?

Join our newsletter to get all the insights on the data stack

This guide is built for data engineers looking to create a seamless and efficient data pipeline for e-commerce analytics. We'll be diving into the practical implementation of a data workflow using Airbyte for data integration, dbt for transformation, Dagster for orchestration, and Google BigQuery for data warehousing.

What this hands-on tutorial covers:

Integrating Key Data Tools: Learn how to connect Airbyte, dbt, Dagster, and BigQuery to build a cohesive data pipeline.
Hands-On Configuration and Usage: Get your hands dirty with the actual setup and use of these tools, configuring each component to work together effectively.
Data Transformation Techniques: Understand how to leverage dbt for transforming raw e-commerce data into a format ready for analysis.
Workflow Visualization and Management: Explore how Dagster provides visibility and control over your data workflows, making it easier to monitor and adjust processes as needed.

This tutorial is structured to give you a clear, no-nonsense path to setting up a robust e-commerce data pipeline. It's about getting the job done efficiently and understanding the nuances of each tool involved in the process.

Prerequisites

Before beginning this tutorial on E-commerce analytics with Airbyte, dbt, Dagster, and BigQuery, ensure you have the following prerequisites set up:

Python 3.10 or later: If you haven't installed Python yet, download it from the official Python website. Make sure to select the version that matches your operating system.
Docker and Docker Compose (Docker Desktop): Install Docker by following the instructions on the official Docker documentation. Choose the version appropriate for your OS.
Airbyte OSS version: Deploy the open-source version of Airbyte locally by following the installation instructions in the Airbyte Documentation.
Google Cloud account with BigQuery: BigQuery will serve as our data warehouse. If you don’t already have a Google Cloud account, sign up for a free trial account. This amount is more than sufficient for this project.
Terraform (Optional): Terraform is used for infrastructure as code, allowing you to define and manage resources programmatically. This step is optional, as Airbyte resources can also be managed via the UI. If you choose to use Terraform, follow the official Terraform installation guide.

1. Setting Up the Project Environment

In this section, you'll set up the local environment for your E-commerce analytics project. You'll clone the necessary repository, explore the project structure, and prepare a Python environment. Follow these steps:

1. Clone the Repository:

Start by cloning the Airbyte quickstart repository. This repository contains a variety of projects, but we're interested in one specific to E-commerce analytics with BigQuery. Use the following commands in your terminal:

git clone --filter=blob:none --sparse https://github.com/airbytehq/quickstarts.git
cd quickstarts
git sparse-checkout add ecommerce_analytics_bigquery

These commands clone the repository with minimal data and then navigate into the project directory.

2. Navigate to the Project Directory:

Once cloned, move into the project directory:

cd ecommerce_analytics_bigquery

Now is a good time to open this directory in your IDE (like Visual Studio Code). Take a moment to familiarize yourself with the folder structure and files. You'll notice three main folders:

dbt_project: Contains everything related to dbt (Data Build Tool), used for data transformation.
infra: Holds the Airbyte Terraform configuration for infrastructure management.
orchestration: Contains the Dagster code, which will be used for orchestrating our data workflows.

3. Set Up a Virtual Environment:

A virtual environment is used for managing Python dependencies. Create and activate one using these commands:

For Linux and Mac:

python3 -m venv venv
source venv/bin/activate

For Windows:

python -m venv venv
.\venv\Scripts\activate

This isolates our project dependencies from other Python projects on your system.

4. Install Dependencies:

Install all necessary Python dependencies for the project using pip. Run:

pip install -e ".[dev]"

This command installs all packages listed in the project's setup.py file, including any optional development dependencies.

2. Setting up BigQuery

BigQuery, Google's serverless data warehouse, plays a pivotal role in this project for storing and analyzing large datasets efficiently. Follow these steps to set up BigQuery:

1. Create a Google Cloud Project

If you don't have a Google Cloud project yet, you'll need to create one.

Go to the Google Cloud Console.
If you’re totally new to Google Cloud, you will need to sign up. Google typically offers free trials which are excellent for projects like this.
Go to the console, click on the "Select a project" dropdown and select "New Project".
Give your project a name, select an “Organization” and click on “Create”.

2. Create BigQuery Datasets

We'll need two datasets in BigQuery: one for raw data from Airbyte and another for transformed data from dbt.

In the Google Cloud Console, go to BigQuery.
Create two new datasets named raw_data and transformed_data. If you choose different names, remember to update these in your project code.

How to create a dataset:

In the BigQuery UI, click on the three dots next to your project name.
Click “Create Dataset”.
Enter raw_data or transformed_data as the dataset ID.
Click "Create Dataset".

‍

Create two BigQuery datasets: raw_data and transformed_data

3. Create Service Accounts and Assign Roles

Service accounts will allow our application to interact with BigQuery securely.

Go to “IAM & Admin” > “Service accounts” in the Google Cloud Console.
Click “Create Service Account”.
Name your service account (one for Airbyte and one for dbt) and proceed to “Create and continue”.
Assign the “BigQuery Data Editor” and “BigQuery Job User” roles. These roles enable the service account to modify data and execute jobs in BigQuery.
Complete the process by clicking on “Done”.

‍

Create two service accounts, for Airbyte and dbt

Add the necessary roles to the service accounts

4. Generate JSON Keys for Service Accounts

JSON keys are used for authenticating the service accounts.

Navigate back to the “Service accounts” page and find the service accounts you just created.
Click on the account name.
In the “Keys” section, click “Add Key” > “Create new key” and choose JSON format.
The key will download automatically. Securely store this key and keep it confidential.

‍

Create a new JSON key for both service accounts

3. Setting Up Airbyte Connectors

Airbyte connectors are essential for data integration in our project. They enable data synchronization between various platforms. You can set up these connectors either via Terraform for a more automated approach or manually using the Airbyte UI. Choose one of the two following options.

Setting Up Airbyte Connectors with Terraform

Terraform allows you to establish infrastructure as code for the creation and management of your Airbyte connectors. Here's how you can set this up:

1. Navigate to the Airbyte Configuration Directory:

This is where your Terraform configuration files are located.

2. Modify Configuration Files:

Within the infra/airbyte directory, you'll find three Terraform files:

provider.tf: Defines the Airbyte provider, credentials and the server url.
main.tf: Contains the main configuration for creating Airbyte resources. In this case, it creates a “Faker” source, a BigQuery destination, and a connection between them.
variables.tf: Holds various variables, including credentials. You may, or may not use this file. More on this below.

Adjust the configurations in these files to suit your project's needs. You can utilize the variables.tf file to manage the BigQuery connection credentials: You’ll be prompted to enter them when you execute terraform plan and terraform apply. If going for this option, just skip to step 3.

If you don’t want to use variables, remove them from the file, and provide credentials for your BigQuery connection directly in the main.tf file:

dataset_id: The name of the BigQuery dataset where Airbyte will load data. In this case, enter “raw_data”.
project_id: Your BigQuery project ID.
credentials_json: The contents of the service account JSON file. You should input a string, so you need to convert the JSON content to string beforehand.
workspace_id: Your Airbyte workspace ID, which can be found in the webapp url. For example, in this url: http://localhost:8000/workspaces/910ab70f-0a67-4d25-a983-999e99e1e395/ the workspace id would be: 910ab70f-0a67-4d25-a983-999e99e1e395.

3. Initialize Terraform:

This step prepares Terraform to create the resources defined in your configuration files.

terraform init

4. Review the Plan:

Before applying any changes, review the plan to understand what Terraform will do.

terraform plan

5. Apply Configuration:

After reviewing and confirming the plan, apply the Terraform configurations to create the necessary Airbyte resources.

terraform apply

6. Verify in Airbyte UI:

Once Terraform completes its tasks, navigate to the Airbyte UI. Here, you should see your source and destination connectors, as well as the connection between them, set up and ready to go 🎉.

‍

Setting Up Airbyte Connectors Using the UI

Start by launching the Airbyte UI by going to http://localhost:8000/ in your browser. Then:

1. Create a source:

Go to the Sources tab and click on "+ New source".
Search for “faker” using the search bar and select "Sample Data (Faker)".
Adjust the Count and optional fields as needed for your use case. You can also leave as is.
Click on "Set up source".

2. Create a destination:

Go to the Destinations tab and click on "+ New destination".
Search for “bigquery” using the search bar and select "BigQuery".
Enter the connection details as needed.
For simplicity, you can use "Standard Inserts" as the loading method.
In the "Service Account Key JSON" field, enter the contents of the JSON file. Yes, the full JSON.
Click on "Set up destination".

3. Create a connection:

Go to the Connections tab and click on "+ New connection".
Select the source and destination you just created.
Enter the connection details as needed.
For this project, leave the “replication frequency” as “Manual”, since we will orchestrate the syncs with Dagster.
Click on "Set up connection".

That’s it! Your connection is set up and ready to go! 🎉

‍

Establish a connector between Faker and BigQuery

4. Setting Up the dbt Project

dbt (data build tool) allows you to transform your data by writing, documenting, and executing SQL workflows. Setting up the dbt project requires specifying connection details for your data platform, in this case, BigQuery.

1. Navigate to the dbt Project Directory:

Move to the dbt project directory in your project's file structure.

cd ../../dbt_project

This directory contains all the dbt-related configurations and SQL models.

2. Update Connection Details:

Within this directory, you'll find a profiles.yml file. This file holds the configuration for dbt to connect to BigQuery.

profiles.yml: Update the project field with your BigQuery project ID and the keyfile field with the path to your service account JSON key.
models/sources/faker_sources.yml: Update the database field with your BigQuery project ID.

Using environment variables for credentials can enhance security. For example, you can reference an environment variable in the keyfile field. This way, you avoid hardcoding sensitive information directly in the file.

3. Test the Connection:

Verify your dbt setup by testing the connection to BigQuery.

dbt debug

This command checks the connection to your data platform. If everything is set up correctly, this command should report a successful connection to BigQuery 🎉.

Remember, the accurate configuration of profiles.yml and faker_sources.yml is crucial for ensuring dbt can successfully connect and transform your data in BigQuery.

5. Understanding the dbt project

This section, while optional, is important for gaining a deeper understanding of the dbt project's structure and its role in data transformation. Understanding these components will help you comprehend the flow of data and the transformations it undergoes.

In the models directory of the dbt project, you'll find three main folders: sources, staging, and marts.

Sources:

The sources folder defines the starting point of our data transformations. Here, we specify the configuration of our source data.
We assign a name to our source, in this case, “faker”. This name is vital for Dagster to identify dependencies and construct a Directed Acyclic Graph (DAG).
We also define the database (our BigQuery project ID) and the schema (the BigQuery dataset).
Within this folder, we detail the tables and columns present in our source data, establishing a clear blueprint of what our raw data looks like.

Staging:

The staging folder represents the first layer of our dbt models.
This stage is primarily about selecting and refining the columns we're interested in from the source data. Think of it as a preparatory step that cleans and structures data for more complex transformations.

Marts:

In the marts folder, we dive into more sophisticated data transformations.
This folder includes three models that focus on calculating product popularity, purchase patterns, and user demographics.
These models are where your data starts to deliver real insights and value.

Each SQL file in the dbt project corresponds to a table in BigQuery. By the end of the transformation process, your transformed_data dataset in BigQuery will consist of data shaped by both the staging and marts models.

6. Orchestrating with Dagster

Dagster is a modern data orchestrator designed to help you build, test, and monitor your data workflows. In this section, we'll walk you through setting up Dagster to orchestrate Airbyte and dbt.

1. Navigate to the Orchestration Directory:

Switch to the directory containing the Dagster orchestration configurations:

cd ../orchestration

This directory contains all the necessary configurations for Dagster to manage your data workflows.

2. Set Environment Variables:

Dagster requires certain environment variables to be set to interact with other tools like dbt and Airbyte. Set the following variables:

export DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=1
export AIRBYTE_PASSWORD=password

Note: The AIRBYTE_PASSWORD is set to password as a default for local Airbyte instances. If you've changed this during your Airbyte setup, ensure you use the appropriate password here.

3. Launch the Dagster UI:

With the environment variables in place, kick-start the Dagster UI:

dagster dev

4. Access Dagster in Your Browser:

Navigate to http://127.0.0.1:3000 in your browser. Here, you'll see assets for both Airbyte and dbt. To understand how these assets are connected and how data flows between them, click on “View global asset lineage” in the Dagster UI. This visual representation of data lineage is invaluable for understanding and debugging the data pipeline.

Dagster DAG, which hasn't been materialized

5. Materialize Dagster Assets:

In the Dagster UI, click on "Materialize all". This action triggers the entire data pipeline:

First, the Airbyte sync will extract data from the Faker source and load it into BigQuery.
Following that, dbt will transform the raw data, materializing the staging and marts models in BigQuery.

Dagster DAG, materialization in progress

6. Verify the Workflow:

Visit the Airbyte UI to confirm that a sync is running.
After the dbt jobs have completed, check your BigQuery console to ensure the views have been created in the transformed_data dataset.

‍

By following these steps, you'll have a fully functional data pipeline orchestrated by Dagster, efficiently managing data extraction, loading, and transformation processes. 🎉

7. Understanding the Dagster project

As we conclude this tutorial, let's delve into the orchestration aspect of the project to understand how Dagster manages the workflow. The key to this orchestration lies in the interaction between several Python scripts: assets.py, constants.py, and definitions.py.

assets.py:

The assets.py script is where the configurations for both dbt and Airbyte are managed.
It leverages the @dbt_assets decorator to define dbt assets. This involves specifying a dbt manifest file, which contains metadata about the dbt project, and executing dbt commands to build these assets.
For Airbyte, the script configures an instance specifying host, port, and credentials. It then loads assets from this Airbyte instance, applying a filter with the key prefix “faker”. This prefix corresponds to the name of our dbt sources, ensuring that the right data is loaded and processed.

constants.py:

The constants.py script sets up the dbt project directory. It determines the path to the dbt manifest file.
Depending on the environment configuration, this script either generates a new dbt manifest at runtime or utilizes an existing one. This flexibility ensures that the Dagster integration with dbt is always equipped with the correct project metadata.

definitions.py:

Finally, the definitions.py script brings everything together. It configures a Dagster job by combining assets, schedules, and resources.
This script is the backbone of the orchestration, as it integrates dbt and Airbyte workflows. By setting schedules and defining resources, it manages how and when data transformations and syncs occur.

Next Steps

Congratulations on successfully deploying and running the E-commerce Analytics Quickstart! You've taken a significant step in harnessing the power of Airbyte, dbt, Dagster, and BigQuery for data analytics.

Here's what you can do next to deepen your understanding and maximize the value of your project:

Explore the data and insights:

Now that you have your data pipeline set up, it's time to explore the datasets in BigQuery. Run some queries, visualize the data, and start uncovering insights. Look for trends, patterns, or anomalies that could help in making informed business decisions. This exploration is vital to understand your data's story and its implications.

Optimize the dbt models:

Reflect on the transformations you've implemented using dbt. Is there room for efficiency improvements? Consider optimizing your existing models for performance or scalability.
Additionally, think about new models you could create. As your business needs evolve, so too should your data models. Tailor them to extract deeper or more specific insights that can drive strategic initiatives.

Expand the data sources:

Your data ecosystem is not limited to what you've set up so far. With Airbyte's vast array of connectors, explore adding more data sources. Integrating different types of data can enrich your datasets, offering a more comprehensive view of your business landscape.
Think about social media data, web traffic analytics, customer feedback, and other data sources that could provide a fuller picture of your e-commerce business.

Contribute to the community:

As you learn and improve your setup, consider sharing your experiences with the community. Write blog posts, create tutorials, or contribute to forums.
If you've developed enhancements or new ideas for the Quickstarts repository, consider contributing them back to the quickstarts GitHub repository. This not only helps others who are following the same path but also enriches the repository with diverse perspectives and ideas.

About the Author

Thalia Barrera is a data engineer and technical writer at Airbyte. She has over a decade of experience as an engineer in the IT industry. She enjoys crafting technical and training materials for fellow engineers. Drawing on her computer science expertise and client-oriented nature, she loves turning complex topics into easy-to-understand content.

About the Author

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

Join our newsletter to get all the insights on the data stack

Should you build or buy your data pipelines?

About the Author

About the Author

Join our newsletter to get all the insights on the data stack

What this hands-on tutorial covers:

Prerequisites

1. Setting Up the Project Environment

1. Clone the Repository:

2. Navigate to the Project Directory:

3. Set Up a Virtual Environment:

4. Install Dependencies:

2. Setting up BigQuery

1. Create a Google Cloud Project

2. Create BigQuery Datasets

How to create a dataset:

3. Create Service Accounts and Assign Roles

4. Generate JSON Keys for Service Accounts

3. Setting Up Airbyte Connectors

Setting Up Airbyte Connectors with Terraform

1. Navigate to the Airbyte Configuration Directory:

2. Modify Configuration Files:

3. Initialize Terraform:

4. Review the Plan:

5. Apply Configuration:

6. Verify in Airbyte UI:

Setting Up Airbyte Connectors Using the UI

1. Create a source:

2. Create a destination:

3. Create a connection:

4. Setting Up the dbt Project

1. Navigate to the dbt Project Directory:

2. Update Connection Details:

3. Test the Connection:

5. Understanding the dbt project

6. Orchestrating with Dagster

1. Navigate to the Orchestration Directory:

2. Set Environment Variables:

3. Launch the Dagster UI:

4. Access Dagster in Your Browser:

5. Materialize Dagster Assets:

6. Verify the Workflow:

7. Understanding the Dagster project

Next Steps

About the Author

About the Author

Should you build or buy your data pipelines?

Similar use cases

Move data from LinkedIn Ads to BigQuery

How to build E-commerce Data Pipeline with Airbyte?