Building an E-commerce Data Pipeline: A Hands-On Guide to Using Airbyte, dbt, Dagster, and BigQuery
Create a seamless and efficient data pipeline for e-commerce analytics. Dive into the practical implementation of a data workflow using Airbyte, dbt, Dagster, and Google BigQuery.
Join our newsletter to get all the insights on the data stack
Should you build or buy your data pipelines?
Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.
Thalia Barrera is a data engineer and technical writer at Airbyte. She has over a decade of experience as an engineer in the IT industry. She enjoys crafting technical and training materials for fellow engineers. Drawing on her computer science expertise and client-oriented nature, she loves turning complex topics into easy-to-understand content.
Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.
Thalia Barrera is a data engineer and technical writer at Airbyte. She has over a decade of experience as an engineer in the IT industry. She enjoys crafting technical and training materials for fellow engineers. Drawing on her computer science expertise and client-oriented nature, she loves turning complex topics into easy-to-understand content.
About the Author
This guide is built for data engineers looking to create a seamless and efficient data pipeline for e-commerce analytics. We'll be diving into the practical implementation of a data workflow using Airbyte for data integration, dbt for transformation, Dagster for orchestration, and Google BigQuery for data warehousing.
What this hands-on tutorial covers:
Integrating Key Data Tools: Learn how to connect Airbyte, dbt, Dagster, and BigQuery to build a cohesive data pipeline.
Hands-On Configuration and Usage: Get your hands dirty with the actual setup and use of these tools, configuring each component to work together effectively.
Data Transformation Techniques: Understand how to leverage dbt for transforming raw e-commerce data into a format ready for analysis.
Workflow Visualization and Management: Explore how Dagster provides visibility and control over your data workflows, making it easier to monitor and adjust processes as needed.
This tutorial is structured to give you a clear, no-nonsense path to setting up a robust e-commerce data pipeline. It's about getting the job done efficiently and understanding the nuances of each tool involved in the process.
Prerequisites
Before beginning this tutorial on E-commerce analytics with Airbyte, dbt, Dagster, and BigQuery, ensure you have the following prerequisites set up:
Python 3.10 or later: If you haven't installed Python yet, download it from the official Python website. Make sure to select the version that matches your operating system.
Docker and Docker Compose (Docker Desktop): Install Docker by following the instructions on the official Docker documentation. Choose the version appropriate for your OS.
Google Cloud account with BigQuery: BigQuery will serve as our data warehouse. If you don’t already have a Google Cloud account, sign up for a free trial account. This amount is more than sufficient for this project.
Terraform (Optional): Terraform is used for infrastructure as code, allowing you to define and manage resources programmatically. This step is optional, as Airbyte resources can also be managed via the UI. If you choose to use Terraform, follow the official Terraform installation guide.
1. Setting Up the Project Environment
In this section, you'll set up the local environment for your E-commerce analytics project. You'll clone the necessary repository, explore the project structure, and prepare a Python environment. Follow these steps:
1. Clone the Repository:
Start by cloning the Airbyte quickstart repository. This repository contains a variety of projects, but we're interested in one specific to E-commerce analytics with BigQuery. Use the following commands in your terminal:
These commands clone the repository with minimal data and then navigate into the project directory.
2. Navigate to the Project Directory:
Once cloned, move into the project directory:
cd ecommerce_analytics_bigquery
Now is a good time to open this directory in your IDE (like Visual Studio Code). Take a moment to familiarize yourself with the folder structure and files. You'll notice three main folders:
<span class="text-style-code">dbt_project</span>: Contains everything related to dbt (Data Build Tool), used for data transformation.
<span class="text-style-code">infra</span>: Holds the Airbyte Terraform configuration for infrastructure management.
<span class="text-style-code">orchestration</span>: Contains the Dagster code, which will be used for orchestrating our data workflows.
3. Set Up a Virtual Environment:
A virtual environment is used for managing Python dependencies. Create and activate one using these commands:
For Linux and Mac:
python3 -m venv venv
source venv/bin/activate
For Windows:
python -m venv venv
.\venv\Scripts\activate
This isolates our project dependencies from other Python projects on your system.
4. Install Dependencies:
Install all necessary Python dependencies for the project using pip. Run:
pip install -e ".[dev]"
This command installs all packages listed in the project's <span class="text-style-code">setup.py</span> file, including any optional development dependencies.
2. Setting up BigQuery
BigQuery, Google's serverless data warehouse, plays a pivotal role in this project for storing and analyzing large datasets efficiently. Follow these steps to set up BigQuery:
1. Create a Google Cloud Project
If you don't have a Google Cloud project yet, you'll need to create one.
Create two new datasets named <span class="text-style-code">raw_data</span> and <span class="text-style-code">transformed_data</span>. If you choose different names, remember to update these in your project code.
How to create a dataset:
In the BigQuery UI, click on the three dots next to your project name.
Click “Create Dataset”.
Enter <span class="text-style-code">raw_data</span> or <span class="text-style-code">transformed_data</span> as the dataset ID.
Click "Create Dataset".
3. Create Service Accounts and Assign Roles
Service accounts will allow our application to interact with BigQuery securely.
Go to “IAM & Admin” > “Service accounts” in the Google Cloud Console.
Click “Create Service Account”.
Name your service account (one for Airbyte and one for dbt) and proceed to “Create and continue”.
Assign the “BigQuery Data Editor” and “BigQuery Job User” roles. These roles enable the service account to modify data and execute jobs in BigQuery.
Complete the process by clicking on “Done”.
4. Generate JSON Keys for Service Accounts
JSON keys are used for authenticating the service accounts.
Navigate back to the “Service accounts” page and find the service accounts you just created.
Click on the account name.
In the “Keys” section, click “Add Key” > “Create new key” and choose JSON format.
The key will download automatically. Securely store this key and keep it confidential.
3. Setting Up Airbyte Connectors
Airbyte connectors are essential for data integration in our project. They enable data synchronization between various platforms. You can set up these connectors either via Terraform for a more automated approach or manually using the Airbyte UI. Choose one of the two following options.
Setting Up Airbyte Connectors with Terraform
Terraform allows you to establish infrastructure as code for the creation and management of your Airbyte connectors. Here's how you can set this up:
1. Navigate to the Airbyte Configuration Directory:
This is where your Terraform configuration files are located.
2. Modify Configuration Files:
Within the <span class="text-style-code">infra/airbyte</span> directory, you'll find three Terraform files:
<span class="text-style-code">provider.tf</span>: Defines the Airbyte provider, credentials and the server url.
<span class="text-style-code">main.tf</span>: Contains the main configuration for creating Airbyte resources. In this case, it creates a “Faker” source, a BigQuery destination, and a connection between them.
<span class="text-style-code">variables.tf</span>: Holds various variables, including credentials. You may, or may not use this file. More on this below.
Adjust the configurations in these files to suit your project's needs. You can utilize the <span class="text-style-code">variables.tf</span>file to manage the BigQuery connection credentials: You’ll be prompted to enter them when you execute <span class="text-style-code">terraform plan</span> and <span class="text-style-code">terraform apply</span>. If going for this option, just skip to step 3.
If you don’t want to use variables, remove them from the file, and provide credentials for your BigQuery connection directly in the <span class="text-style-code">main.tf</span> file:
<span class="text-style-code">dataset_id</span>: The name of the BigQuery dataset where Airbyte will load data. In this case, enter “raw_data”.
<span class="text-style-code">project_id</span>: Your BigQuery project ID.
<span class="text-style-code">credentials_json</span>: The contents of the service account JSON file. You should input a string, so you need to convert the JSON content to string beforehand.
<span class="text-style-code">workspace_id</span>: Your Airbyte workspace ID, which can be found in the webapp url. For example, in this url: http://localhost:8000/workspaces/910ab70f-0a67-4d25-a983-999e99e1e395/ the workspace id would be: <span class="text-style-code">910ab70f-0a67-4d25-a983-999e99e1e395.</span>
3. Initialize Terraform:
This step prepares Terraform to create the resources defined in your configuration files.
terraform init
4. Review the Plan:
Before applying any changes, review the plan to understand what Terraform will do.
terraform plan
5. Apply Configuration:
After reviewing and confirming the plan, apply the Terraform configurations to create the necessary Airbyte resources.
terraform apply
6. Verify in Airbyte UI:
Once Terraform completes its tasks, navigate to the Airbyte UI. Here, you should see your source and destination connectors, as well as the connection between them, set up and ready to go 🎉.
Setting Up Airbyte Connectors Using the UI
Start by launching the Airbyte UI by going to http://localhost:8000/ in your browser. Then:
1. Create a source:
Go to the Sources tab and click on "+ New source".
Search for “faker” using the search bar and select "Sample Data (Faker)".
Adjust the Count and optional fields as needed for your use case. You can also leave as is.
Click on "Set up source".
2. Create a destination:
Go to the Destinations tab and click on "+ New destination".
Search for “bigquery” using the search bar and select "BigQuery".
Enter the connection details as needed.
For simplicity, you can use "Standard Inserts" as the loading method.
In the "Service Account Key JSON" field, enter the contents of the JSON file. Yes, the full JSON.
Click on "Set up destination".
3. Create a connection:
Go to the Connections tab and click on "+ New connection".
Select the source and destination you just created.
Enter the connection details as needed.
For this project, leave the “replication frequency” as “Manual”, since we will orchestrate the syncs with Dagster.
Click on "Set up connection".
That’s it! Your connection is set up and ready to go! 🎉
4. Setting Up the dbt Project
dbt (data build tool) allows you to transform your data by writing, documenting, and executing SQL workflows. Setting up the dbt project requires specifying connection details for your data platform, in this case, BigQuery.
1. Navigate to the dbt Project Directory:
Move to the dbt project directory in your project's file structure.
cd ../../dbt_project
This directory contains all the dbt-related configurations and SQL models.
2. Update Connection Details:
Within this directory, you'll find a <span class="text-style-code">profiles.yml file</span>. This file holds the configuration for dbt to connect to BigQuery.
<span class="text-style-code">profiles.yml</span>: Update the <span class="text-style-code">project</span> field with your BigQuery project ID and the <span class="text-style-code">keyfile</span> field with the path to your service account JSON key.
<span class="text-style-code">models/sources/faker_sources.yml<span>: Update the <span class="text-style-code">database</span> field with your BigQuery project ID.
Using environment variables for credentials can enhance security. For example, you can reference an environment variable in the <span class="text-style-code">keyfile</span> field. This way, you avoid hardcoding sensitive information directly in the file.
3. Test the Connection:
Verify your dbt setup by testing the connection to BigQuery.
dbt debug
This command checks the connection to your data platform. If everything is set up correctly, this command should report a successful connection to BigQuery 🎉.
Remember, the accurate configuration of <span class="text-style-code">profiles.yml</span> and <span class="text-style-code">faker_sources.yml</span> is crucial for ensuring dbt can successfully connect and transform your data in BigQuery.
5. Understanding the dbt project
This section, while optional, is important for gaining a deeper understanding of the dbt project's structure and its role in data transformation. Understanding these components will help you comprehend the flow of data and the transformations it undergoes.
In the models directory of the dbt project, you'll find three main folders: <span class="text-style-code">sources</span>, <span class="text-style-code">staging</span>, and <span class="text-style-code">marts</span>.
Sources:
The <span class="text-style-code">sources</span> folder defines the starting point of our data transformations. Here, we specify the configuration of our source data.
We assign a name to our source, in this case, “faker”. This name is vital for Dagster to identify dependencies and construct a Directed Acyclic Graph (DAG).
We also define the <span class="text-style-code">database</span> (our BigQuery project ID) and the <span class="text-style-code">schema</span> (the BigQuery dataset).
Within this folder, we detail the tables and columns present in our source data, establishing a clear blueprint of what our raw data looks like.
Staging:
The <span class="text-style-code">staging</span> folder represents the first layer of our dbt models.
This stage is primarily about selecting and refining the columns we're interested in from the source data. Think of it as a preparatory step that cleans and structures data for more complex transformations.
Marts:
In the <span class="text-style-code">marts</span> folder, we dive into more sophisticated data transformations.
This folder includes three models that focus on calculating product popularity, purchase patterns, and user demographics.
These models are where your data starts to deliver real insights and value.
Each SQL file in the dbt project corresponds to a table in BigQuery. By the end of the transformation process, your <span class="text-style-code">transformed_data</span> dataset in BigQuery will consist of data shaped by both the staging and marts models.
6. Orchestrating with Dagster
Dagster is a modern data orchestrator designed to help you build, test, and monitor your data workflows. In this section, we'll walk you through setting up Dagster to orchestrate Airbyte and dbt.
1. Navigate to the Orchestration Directory:
Switch to the directory containing the Dagster orchestration configurations:
cd ../orchestration
This directory contains all the necessary configurations for Dagster to manage your data workflows.
2. Set Environment Variables:
Dagster requires certain environment variables to be set to interact with other tools like dbt and Airbyte. Set the following variables:
Note: The <span class="text-style-code">AIRBYTE_PASSWORD</span> is set to password as a default for local Airbyte instances. If you've changed this during your Airbyte setup, ensure you use the appropriate password here.
3. Launch the Dagster UI:
With the environment variables in place, kick-start the Dagster UI:
dagster dev
4. Access Dagster in Your Browser:
Navigate to http://127.0.0.1:3000 in your browser. Here, you'll see assets for both Airbyte and dbt. To understand how these assets are connected and how data flows between them, click on “View global asset lineage” in the Dagster UI. This visual representation of data lineage is invaluable for understanding and debugging the data pipeline.
5. Materialize Dagster Assets:
In the Dagster UI, click on "Materialize all". This action triggers the entire data pipeline:
First, the Airbyte sync will extract data from the Faker source and load it into BigQuery.
Following that, dbt will transform the raw data, materializing the <span class="text-style-code">staging</span> and <span class="text-style-code">marts</span> models in BigQuery.
6. Verify the Workflow:
Visit the Airbyte UI to confirm that a sync is running.
After the dbt jobs have completed, check your BigQuery console to ensure the views have been created in the <span class="text-style-code">transformed_data</span> dataset.
By following these steps, you'll have a fully functional data pipeline orchestrated by Dagster, efficiently managing data extraction, loading, and transformation processes. 🎉
7. Understanding the Dagster project
As we conclude this tutorial, let's delve into the orchestration aspect of the project to understand how Dagster manages the workflow. The key to this orchestration lies in the interaction between several Python scripts: <span class="text-style-code">assets.py</span>, <span class="text-style-code">constants.py</span>, and <span class="text-style-code">definitions.py</span>.
assets.py:
The <span class="text-style-code">assets.py</span> script is where the configurations for both dbt and Airbyte are managed.
It leverages the <span class="text-style-code">@dbt_assets</span> decorator to define dbt assets. This involves specifying a dbt manifest file, which contains metadata about the dbt project, and executing dbt commands to build these assets.
For Airbyte, the script configures an instance specifying host, port, and credentials. It then loads assets from this Airbyte instance, applying a filter with the key prefix “faker”. This prefix corresponds to the name of our dbt sources, ensuring that the right data is loaded and processed.
constants.py:
The <span class="text-style-code">constants.py</span> script sets up the dbt project directory. It determines the path to the dbt manifest file.
Depending on the environment configuration, this script either generates a new dbt manifest at runtime or utilizes an existing one. This flexibility ensures that the Dagster integration with dbt is always equipped with the correct project metadata.
definitions.py:
Finally, the <span class="text-style-code">definitions.py</span> script brings everything together. It configures a Dagster job by combining assets, schedules, and resources.
This script is the backbone of the orchestration, as it integrates dbt and Airbyte workflows. By setting schedules and defining resources, it manages how and when data transformations and syncs occur.
Next Steps
Congratulations on successfully deploying and running the E-commerce Analytics Quickstart! You've taken a significant step in harnessing the power of Airbyte, dbt, Dagster, and BigQuery for data analytics.
Here's what you can do next to deepen your understanding and maximize the value of your project:
Explore the data and insights:
Now that you have your data pipeline set up, it's time to explore the datasets in BigQuery. Run some queries, visualize the data, and start uncovering insights. Look for trends, patterns, or anomalies that could help in making informed business decisions. This exploration is vital to understand your data's story and its implications.
Optimize the dbt models:
Reflect on the transformations you've implemented using dbt. Is there room for efficiency improvements? Consider optimizing your existing models for performance or scalability.
Additionally, think about new models you could create. As your business needs evolve, so too should your data models. Tailor them to extract deeper or more specific insights that can drive strategic initiatives.
Expand the data sources:
Your data ecosystem is not limited to what you've set up so far. With Airbyte's vast array of connectors, explore adding more data sources. Integrating different types of data can enrich your datasets, offering a more comprehensive view of your business landscape.
Think about social media data, web traffic analytics, customer feedback, and other data sources that could provide a fuller picture of your e-commerce business.
Contribute to the community:
As you learn and improve your setup, consider sharing your experiences with the community. Write blog posts, create tutorials, or contribute to forums.
If you've developed enhancements or new ideas for the Quickstarts repository, consider contributing them back to the quickstarts GitHub repository. This not only helps others who are following the same path but also enriches the repository with diverse perspectives and ideas.
Create a seamless and efficient data pipeline for e-commerce analytics. Dive into the practical implementation of a data workflow using Airbyte, dbt, Dagster, and Google BigQuery.
Learn how to easily move your Azure SQL Database data into BigQuery where it can be combined with data from other sources to get a holistic view of your business and to gain valuable insights.