Creating Data Pipeline with dbt & DuckDB Using Airbyte

Learn to build efficient data pipelines using Airbyte, dbt, and DuckDB. A comprehensive guide for data engineers with practical implementation steps.

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

Your organization’s data might be available in dispersed locations and varied formats. Integrating this data into a single repository, like a data warehouse, is often considered beneficial for facilitating high accessibility within your organization.

However, ensuring data compatibility with the destination schema can be challenging. For this reason, you must first apply transformation logic to the data to enhance its quality.

One prevalent approach to handling this data workflow is using tools like dbt and DuckDB. DuckDB is a lightweight SQL database system for temporary and persistent data storage in the development environment. dbt, conversely, is a transformation tool that offers adapters to connect to prominent databases like DuckDB to orchestrate the transformation logic.

This article will walk you through the process of integrating dbt DuckDB with Airbyte, one of the best data pipeline tools.

DuckDB Overview

DuckDB

DuckDB is an in-process relational database management system (RDBMS) that offers online analytical processing (OLAP) functionality. This makes it a prominent option for handling analytical query workloads. Using DuckDB, you can perform operations like aggregation and joins on large-scale data tables.

Key Features

  • Highly Portable: DuckDB is extremely portable, as it has no dependencies. It supports all the major operating systems, including Windows, Linux, and macOS, as well as CPU architectures like ARM and x86.
  • Fast Analytical Processing: Instead of querying each row sequentially, DuckDB relies on a columnar vectorized query execution engine, which enables faster execution of complex queries. This engine allows the processing of large batches of data in one go.
  • Ease of Use: DuckDB is one of the most simplistic data management solutions, with straightforward installation and embedding in-process operations. It doesn’t require the configuration of any dependencies at compile and run time.
  • Open Source: DuckDB's entire source code is freely available on GitHub.  It is actively maintained by a growing community of developers and is released under the permissive software MIT License.

dbt Overview

dbt

dbt, or data build tool, is an open-source transformation solution that allows you to prepare data for analysis. Using dbt, you can modify your source data into a format compatible with the downstream applications. To accomplish this, dbt supports data modeling capabilities, allowing you to create custom transformation models using SQL. Eventually, you can analyze the transformed data and generate actionable insights that can enhance business performance.

dbt offers two flexible products, including the Cloud and Core versions. dbt Cloud is a fully managed service, while dbt Core is a self-managed, open-source option.

Key Features

  • Job Scheduler: The dbt Cloud job scheduler assists you in defining cron-based and event-driven execution of dbt commands. The cron-based executions run at specified intervals using cron syntax, while the event-driven executions are executed after the completion of certain actions like API calls.
  • Discovery API: With dbt Cloud discovery API, you can query and retrieve comprehensive information about your project. This allows you to achieve details about your project’s model, sources, and more.
  • Project Monitoring: dbt Cloud offers the Run history dashboard within its UI, so you can monitor all your dbt projects. This dashboard offers a detailed overview of the jobs executed in your project and provides you with filters to narrow down the focus to specific aspects.
  • Automated Documentation: dbt automatically generates project documentation. The document is automatically updated upon the success of the dbt job. You can get back to the documentation to preview transformation tasks and dependencies.

Airbyte Overview

Airbyte

Airbyte is an AI-powered, no-code data replication tool that enables you to move data from various sources to your preferred destination. It offers an extensive library of 550+ pre-built connectors, allowing you to develop data pipelines to extract and load structured, semi-structured, and unstructured data. In case a connector is unavailable, Airbyte provides a suite of Connector Development Kits (CDKs) and a Connector Builder for custom connector development.

Key Features

  • Custom Transformation with dbt: You can effortlessly integrate Airbyte with dbt and execute transformations immediately after performing data synchronization between platforms. Using the transformation capabilities offered by dbt, you can clean and enrich data for enhanced analytics.
  • Developer-Friendly Pipelines: PyAirbyte—a Python library—enables you to use Airbyte connectors in a Python development environment. Utilizing PyAirbyte, you can extract data from multiple sources into popular SQL caches, including DuckDB, Postgres, and BigQuery. These caches are compatible with Python libraries like Pandas.
  • Change Data Capture (CDC): CDC allows you to automatically identify incremental changes made to the source data and replicate them into the destination. This feature supports you in tracking data updates and maintaining data consistency between various platforms within your organization.
  • Automated RAG Transformation: Airbyte supports automated chunking, embedding, and indexing operations that let you transform raw data into vector embeddings and store them in a vector database. You can further use these embeddings to develop robust AI applications.

You can choose to use the Airbyte Cloud version for building the dbt DuckDB pipeline. Its intuitive user interface and pre-built connectors simplify your journey of extracting the data from your preferred sources into DuckDB or Motherduck. The Airbyte DuckDB destination connector is compatible with Motherduck—an analytics database—for persistent storage.

To perform custom transformations on the consolidated data, you can integrate Airbyte with dbt. All you have to do is navigate to the Setting tab and configure dbt Cloud integration by providing the Service Token.

However, if you intend to use DuckDB local storage, you can leverage PyAirbyte to develop a custom data pipeline.

Prerequisites for Building dbt DuckDb Pipeline Using PyAirbyte

When integrating dbt DuckDB using PyAirbyte, it is essential to follow a structured procedure. This involves establishing a data pipeline architecture. In this architecture, PyAirbyte aids you in extracting data to your local machine as a DuckDB cache. Following this, you can create a dbt DuckDB connection that allows data transformation. Eventually, you can store the data in a data warehouse or a database based on your requirements.

Before getting into the steps, you must ensure that all the prerequisites are satisfied. You must have:

  • Access to a code editor. This tutorial uses Jupyter Notebook.
  • Python.
  • GitHub desktop configured for version control.

Building Data Pipeline with DuckDB, dbt, and PyAirbyte

Let’s explore how PyAirbyte enables you to build data pipelines using dbt and DuckDB.

Step 1: Installing Necessary Dependencies

Create a virtual environment to isolate all your dependencies in Jupyter Notebook. For a Unix-based system, you can use:

python3 -m venv env
source env/bin/activate

Install all the necessary libraries, including PyAirbyte, DuckDB, dbt core, and dbt DuckDB adapter plugin, on your system by executing:

pip install airbyte duckdb dbt-core dbt-duckdb

Step 2: Extract Data from a Source

After installing the required libraries, import them into the editor to begin the data movement.

import airbyte as ab
import duckdb
import dbt-core
import dbt-duckdb

To check the number of available connectors provided by Airbyte, run the following:

ab.get_available_connectors()

In this tutorial, we will use Polygon.io—a stock market API—as a source. To extract data from this API, you must first log in to the dashboard and access the API key. After accessing the key, execute:

source = ab.get_source(
    "source-polygon-stock-api",
    install_if_missing=True,
    config={
      "apiKey": ab.get_secret("POLYGON_API_KEY"),
      "stocksTicker": "AAPL" ,
      "multiplier": 1,
      "timespan": "day",
      "start_date": "2024-01-01",
      "end_date": "2024-12-31",
      "adjusted": "true"
      }
)

Output:

Enter Polygon API Key

Provide the API key in the terminal. This step will retrieve the past year’s records for the company Apple's stock, starting from 1st January 2024 to 31st December 2024. You can now verify the connection by running the check method:

source.check()

Output:

Successful Connection Message

After successfully establishing a connection, select all of the source's streams and load data into the internal DuckDB cache:

source.select_all_streams()
read_result: ab.ReadResult = source.read()

Output:

Data Synchronization Details

The .cache folder relative to the current working directory will store this cache data.

Cached Data Directory

Step 3: Create a DuckDB Connection

To effectively work with the DuckDB adapter for dbt, create a connection to the dataset. Replace the file_location placeholder with the exact location where the default_cache.duckdb file is available:

conn = duckdb.connect(r"file_location")

Verify the successful connection to the dataset by executing:

df = conn.execute("SELECT * FROM stock_api").fetchdf()

Using this df variable, you can print the Apple stock prices to check the format of the dataset. For example, to print the first five rows in the dataset, use:

df[:5]

Output:

First Five Data Rows

Step 4: Set Up a dbt Project

To initialize a new dbt Project, you can execute the following code:

dbt init dbt_duckdb_project

The terminal will prompt you to select the adapter database. Here, you can choose the available duckdb option. This step will create a new folder called dbt_duckdb_project. You can either navigate to the project folder or use the command:

cd dbt_duckdb_project

In the .dbt folder, locate the profiles.yml file. This file contains all the dbt connections to the database.

Ensure the profiles.yml file has the DuckDB connection with the correct dataset directory location. If not, you edit the file with this format:

dbt_duckdb_project:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: /path/to/default_cache.duckdb

Replace the path with the actual location of the dataset.

Step 5: Create dbt Models

You can add custom SQL statements to define models that modify the Polygon stock market data stored in DuckDB, removing inconsistencies and unnecessary values. The models folder lets you specify how you intend to modify the dataset. If you wish to load this dataset in a target system, transform it according to the destination schema. Another option is to use the same Python environment to analyze and visualize the dataset based on the business requirements.

Step 6: Test and Run the Project

Finally, you can test and execute the dbt project.

To validate your dbt project configuration, execute:

dbt debug

You will see a success message as if everything works as expected. After this, run the dbt project using:

dbt run

Use Cases for DuckDB dbt Data Pipeline

Here are some of the use cases of integrating dbt with DuckDB:

Developing Data-Driven Applications

DuckDB assists you in building applications that provide faster data access to end users. dbt can play an essential role in supporting data transformation for this system without requiring an external data storage solution. Integrating DuckDB and dbt pipeline with a binary instruction format for virtual machines, like WebAssembly (WASM), is considered beneficial. This enables the execution of high-performance analytics engines within a browser.

Cost Optimization

Directly storing data available from various sources in a warehouse can increase storage and computation costs. With dbt DuckDB integration, you can clean the data, remove unnecessary information, and enrich the dataset before loading it into the target system. This can help you cut down additional costs occurring due to redundancy.

Data Quality Testing

Working with dbt and DuckDB in your local environment enables you to implement strategies to improve data quality in the development stage. This can help ensure the generation of high-quality data. After aligning the strategy with the incoming data stream, you can deploy them in the production environment. This is especially helpful for scenarios like schema change management and performance optimization of existing data infrastructures.

Conclusion

This article comprehensively demonstrates how to develop a dbt DuckDB data pipeline that enables you to transform, manage, and analyze data using SQL-based modeling and automation.

Airbyte Cloud simplifies this process by offering pre-built connectors, along with seamless dbt integration. This approach allows you to develop pipelines using a no-code user interface.

On the other hand, PyAirbyte provides a flexible approach. It enables you to extract data from various platforms and load it into a DuckDB cache. The cached data can then be used with the dbt DuckDB adapter, allowing you to create dbt models and apply business logic. Depending on your use case, you can either analyze this transformed data or load it into a database or a data warehouse.

To automate data migration workflows within your organization, sign up for Airbyte.

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

Similar use cases

Creating Data Pipeline with dbt & DuckDB Using Airbyte

Learn to build efficient data pipelines using Airbyte, dbt, and DuckDB. A comprehensive guide for data engineers with practical implementation steps.

Financial Market Monitoring with Airbyte and Polygon.io Integration

Discover financial market monitoring using Airbyte and Polygon.io integration. Streamline data for actionable insights