Learn how to build robust ETL pipelines using Python, Docker, and Airbyte. A guide for data engineers covering setup, implementation, & best practices.
Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.
ETL—extract, transform, and load—pipelines are key components of modern data workflows. These pipelines facilitate seamless data movement across various platforms. However, manually performing the migration steps can be complex and time-consuming. To address this issue, incorporating certain best practices, such as using data integration tools like Airbyte, can help automate data replication.
Along with automation, another crucial practice includes containerizing the code and dependencies used in data pipeline development. This makes deployment easier and improves scalability. Docker, one of the most widely used containerization tools, enables you to package software so that it can run smoothly in various environments.
In this article, you will discover how to perform Python ETL in Docker.
Docker is a software development tool that helps build, test, and deploy applications. It uses containers, which are lightweight, self-contained, executable packages of software. These containers are created from Docker images. A Docker image represents an application’s configurations and works as a blueprint for containers. Images include every component, including libraries and code, required to run the software.
You can use the Docker image on any machine with Docker installed. This implies that you do not have to stress about platform independence. By bundling dependencies, Docker helps you simplify application deployment, ensuring consistent and standardized operations.
Airbyte is an AI-powered data integration tool. Its readily available 550+ pre-built connectors allow you to move structured, semi-structured, and unstructured data between various platforms. If the connector you need is unavailable, Airbyte provides Connector Development Kits (CDKs) and a Connector Builder to create custom connectors.
Here are some amazing features of Airbyte:
To build an ETL pipeline using Airbyte & Python, we will use PyAirbyte in the following sections. The steps include:
In this section, you will learn to create an application that performs Python ETL in Docker, moving data from GitHub to Snowflake.
After satisfying all the requirements, you can start setting up a project environment. First, create a virtual environment to isolate the dependencies.
Create a directory with the following structure:
Here,
Navigate to the app.py file. Inside this Python extension file, import all the necessary libraries.
Create and configure the GitHub source by executing:
This code uses Airbyte’s GitHub repository as an example. The output of the above code will ask you for your GitHub access credential when you create a new container. To verify the connection, run the .check() method:
This code must return a success message. You will be able to access GitHub data.
List the available streams available data streams for the source by executing the following:
Select the data streams that you intend to store. For example, to save the pull requests, issues, reviews, and stargazers data streams, use:
Store the data in the temporary DuckDB local cache.
You can read from the cache into a Pandas DataFrame using the to_pandas() method:
After storing the data in a DataFrame, you can perform transformations and make data compatible with the destination schema.
In the transformation step, you can use custom data transformations to align the properties of the data with business logic. Finally, after the transformation stage, you can load the data to a destination like Snowflake.
To accomplish this, establish Snowflake connection:
Replace USERNAME, ACCOUNT, WAREHOUSE, YOUR_DATABASE_NAME, and YOUR_SCHEMA_NAME placeholders with your original credentials as environment variables. You must provide the PASSWORD when you create a new container.
After creating a connection, write the DataFrame to a table in Snowflake. For example, to store the issues DataFrame in Snowflake, run:
Replace the YOUR_TABLE_NAME placeholder with the Snowflake table name.
Open the Dockerfile, an image configuration text file that contains all the commands required to create an image on which all the containers can be built.
Pull the base image, the official Python DockerHub image, using:
Copy the app.py file from the host machine to the current directory (.) in the container.
Execute the pip command within the container.
When you run the container, this command will automatically execute and install the PyAirbyte, Snowflake, and Pandas libraries to support the code inside the app.py file.
Note: We can also provide the dependencies in a separate requirements.txt file and then directly call this file in the Dockerfile.
Specifying entry command. These commands will be executed when the container starts.
After configuring the Dockerfile, open the terminal and enter the following command:
This command will build a Docker image. The tag associated with this image will be python-pyairbyte. The ‘.’ at the end of the command tells Docker to search the Dockerfile in the current directory.
Using this image with the docker run command, you can now create a container. To use the newly created Docker image, run the command below:
The -e GITHUB_PERSONAL_ACCESS_TOKEN=your_token and -e SNOWFLAKE_PASSWORD=your_password sets an environment variable for GitHub and Snowflake credentials inside the container. This command will run the app.py file, initiating the Python ETL in Docker.
By using PyAirbyte, you can simplify the data migration logic in the Python script. Alternatively, you can also use Airbyte Cloud or the OSS version to develop ELT pipelines.
By following these steps, you can successfully build and containerize a data pipeline that performs Python ETL within a Docker. The process begins with PyAirbyte, which allows you to extract data from the source of your choice into SQL caches. Once the data is in a Pandas DataFrame, you can apply business logic and transform the data to make it analysis-ready. Finally, the transformed data can be stored in a destination like Snowflake.
Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.