Building ETL Pipeline with Python, Docker, & Airbyte

Join our newsletter to get all the insights on the data stack

ETL—extract, transform, and load—pipelines are key components of modern data workflows. These pipelines facilitate seamless data movement across various platforms. However, manually performing the migration steps can be complex and time-consuming. To address this issue, incorporating certain best practices, such as using data integration tools like Airbyte, can help automate data replication.

Along with automation, another crucial practice includes containerizing the code and dependencies used in data pipeline development. This makes deployment easier and improves scalability. Docker, one of the most widely used containerization tools, enables you to package software so that it can run smoothly in various environments.

In this article, you will discover how to perform Python ETL in Docker.

Docker Overview

Docker is a software development tool that helps build, test, and deploy applications. It uses containers, which are lightweight, self-contained, executable packages of software. These containers are created from Docker images. A Docker image represents an application’s configurations and works as a blueprint for containers. Images include every component, including libraries and code, required to run the software.

You can use the Docker image on any machine with Docker installed. This implies that you do not have to stress about platform independence. By bundling dependencies, Docker helps you simplify application deployment, ensuring consistent and standardized operations.

Airbyte Overview

Airbyte is an AI-powered data integration tool. Its readily available 550+ pre-built connectors allow you to move structured, semi-structured, and unstructured data between various platforms. If the connector you need is unavailable, Airbyte provides Connector Development Kits (CDKs) and a Connector Builder to create custom connectors.

Here are some amazing features of Airbyte:

Airbyte’s OSS version allows you to run the Airbyte instance on your local machine. Just by using Docker and a few CLI commands, you can deploy Airbyte and initiate the development of ELT pipelines.
‍PyAirbyte is a Python library that enables you to use Airbyte connectors in a developer environment. It aids in pulling data from diverse sources into popular SQL caches, including DuckDB, Postgres, and BigQuery.
The Connector Builder has an AI assistant that reads through your platform’s docs to auto-fill configuration fields. This simplifies your connector development journey.
With CDC, you can identify incremental changes made to the source data and replicate them in the target system. This feature allows you to track data updates and maintain data consistency.
Airbyte’s automated transformation techniques enable you to convert raw data into vector embeddings. You can generate embeddings using its pre-built LLM providers, with compatibility across OpenAI, Cohere, and Anthropic.
Airbyte supports prominent vector databases, including Pinecone, Milvus, and Weaviate. You can store the embeddings in the vector stores using indexing. This can aid in the creation of AI applications.

To build an ETL pipeline using Airbyte & Python, we will use PyAirbyte in the following sections. The steps include:

Configuring a Docker project.
Creating ETL script with PyAirbyte.
Defining container properties.
Building and running the container.

How to Build ETL Pipeline Using Python, Docker, & Airbyte

In this section, you will learn to create an application that performs Python ETL in Docker, moving data from GitHub to Snowflake.

Prerequisites

Download and install Docker Desktop and Python.
You must have access to a code editor like VS Code.
Snowflake credentials.
GitHub API key.

Step 1: Initial Set-Up for the Project

After satisfying all the requirements, you can start setting up a project environment. First, create a virtual environment to isolate the dependencies.

python -m venv env

Create a directory with the following structure:


├── Dockerfile
├── app.py
├── env
└── requirements.txt

Here,

The Dockerfile contains the container configuration properties.‍
app.py file stores the software logic with Python code. This file will store the PyAirbyte code for building an application that performs Python ETL in Docker.‍
requirements.txt file holds all the additional libraries imported and used in the app.py file.

Step 2: Creating ETL Script with PyAirbyte

Navigate to the app.py file. Inside this Python extension file, import all the necessary libraries.

import os
import airbyte as ab
import pandas as pd
import snowflake.connector as snow
from snowflake.connector.pandas_tools import write_pandas

Create and configure the GitHub source by executing:

source = ab.get_source(
    "source-github",
    install_if_missing=True,
    config={
        "repositories": ["airbytehq/quickstarts"],
        "credentials": {
            "personal_access_token": ab.get_secret("GITHUB_PERSONAL_ACCESS_TOKEN"),
        },
    },
)

This code uses Airbyte’s GitHub repository as an example. The output of the above code will ask you for your GitHub access credential when you create a new container. To verify the connection, run the .check() method:

source.check()

This code must return a success message. You will be able to access GitHub data.

List the available streams available data streams for the source by executing the following:

source.get_available_streams()

Select the data streams that you intend to store. For example, to save the pull requests, issues, reviews, and stargazers data streams, use:

source.set_streams(["pull_requests", "issues", "reviews", "stargazers"])

Store the data in the temporary DuckDB local cache.

cache = ab.get_default_cache()
result = source.read(cache=cache)

You can read from the cache into a Pandas DataFrame using the to_pandas() method:

reviews = cache["reviews"].to_pandas()
stargazers = cache["stargazers"].to_pandas()
pull_requests = cache["pull_requests"].to_pandas()
issues = cache["issues"].to_pandas()

After storing the data in a DataFrame, you can perform transformations and make data compatible with the destination schema.

In the transformation step, you can use custom data transformations to align the properties of the data with business logic. Finally, after the transformation stage, you can load the data to a destination like Snowflake.

To accomplish this, establish Snowflake connection:

conn = snow.connect(
user=os.environ.get("USERNAME"),
password=os.environ.get("PASSWORD"),
account=os.environ.get("ACCOUNT"),
warehouse=os.environ.get("WAREHOUSE"),
database=os.environ.get("YOUR_DATABASE_NAME"),
schema=os.environ.get("YOUR_SCHEMA_NAME")
)

Replace USERNAME, ACCOUNT, WAREHOUSE, YOUR_DATABASE_NAME, and YOUR_SCHEMA_NAME placeholders with your original credentials as environment variables. You must provide the PASSWORD when you create a new container.

After creating a connection, write the DataFrame to a table in Snowflake. For example, to store the issues DataFrame in Snowflake, run:

write_pandas(conn, issues, "YOUR_TABLE_NAME", auto_create_table=True)

Replace the YOUR_TABLE_NAME placeholder with the Snowflake table name.

Step 3: Defining the Container Properties

Open the Dockerfile, an image configuration text file that contains all the commands required to create an image on which all the containers can be built.

Pull the base image, the official Python DockerHub image, using:

FROM python:3.8

Copy the app.py file from the host machine to the current directory (.) in the container.

ADD app.py .

Execute the pip command within the container.

RUN pip install airbyte snowflake pandas

When you run the container, this command will automatically execute and install the PyAirbyte, Snowflake, and Pandas libraries to support the code inside the app.py file.

‍Note: We can also provide the dependencies in a separate requirements.txt file and then directly call this file in the Dockerfile.

Specifying entry command. These commands will be executed when the container starts.

CMD [“python”, “app.py”]

Step 4: Building and Running the Container

After configuring the Dockerfile, open the terminal and enter the following command:

docker build -t python-pyairbyte .

This command will build a Docker image. The tag associated with this image will be python-pyairbyte. The ‘.’ at the end of the command tells Docker to search the Dockerfile in the current directory.

Using this image with the docker run command, you can now create a container. To use the newly created Docker image, run the command below:

docker run -e GITHUB_PERSONAL_ACCESS_TOKEN=your_token -e SNOWFLAKE_PASSWORD=your_password python-pyairbyte

The -e GITHUB_PERSONAL_ACCESS_TOKEN=your_token and -e SNOWFLAKE_PASSWORD=your_password sets an environment variable for GitHub and Snowflake credentials inside the container. This command will run the app.py file, initiating the Python ETL in Docker.

By using PyAirbyte, you can simplify the data migration logic in the Python script. Alternatively, you can also use Airbyte Cloud or the OSS version to develop ELT pipelines.

Conclusion

By following these steps, you can successfully build and containerize a data pipeline that performs Python ETL within a Docker. The process begins with PyAirbyte, which allows you to extract data from the source of your choice into SQL caches. Once the data is in a Pandas DataFrame, you can apply business logic and transform the data to make it analysis-ready. Finally, the transformed data can be stored in a destination like Snowflake.

About the Author

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

Building ETL Pipeline with Python, Docker, & Airbyte

Join our newsletter to get all the insights on the data stack

Should you build or buy your data pipelines?

About the Author

About the Author

Join our newsletter to get all the insights on the data stack

Docker Overview

Airbyte Overview

How to Build ETL Pipeline Using Python, Docker, & Airbyte

Prerequisites

Step 1: Initial Set-Up for the Project

Step 2: Creating ETL Script with PyAirbyte

Step 3: Defining the Container Properties

Step 4: Building and Running the Container

Conclusion

About the Author

About the Author

Should you build or buy your data pipelines?

Similar use cases

Validate data replication pipelines with data-diff

Building ETL Pipeline with Python, Docker, & Airbyte

Orchestrate data ingestion and transformation pipelines with Dagster