3 Steps to Orchestrate dbt Core Jobs Effectively
With the increase in data sources, such as IoT devices, marketing applications, and ERP solutions, data is being generated on large scales. However, this raw data is not as effective in producing insights as it can be with a little modification. Due to this reason, most professionals prefer to transform the data to make it compatible with analysis techniques.
Tools like the data build tool (dbt) core can help you enhance productivity by providing free-to-use data transformation capabilities. Automating the dbt project enables you to reduce manual efforts by performing scheduled tasks.
This article will guide you through the steps on how you can orchestrate dbt core jobs with the help of Airbyte and Airflow.
What Is dbt?
dbt is one of the most popular data transformation tools that enables you to modify raw data to make it analysis-ready. It allows you to centralize your business logic by enabling transformations based on specific rules set by your organization. For example, features that are not necessary for business insights can be eliminated to enhance focus on key metrics. With its modular nature, dbt lets you break down complex tasks into smaller manageable chunks.
There are two different versions that dbt offers:
- dbt cloud: This is a browser-based platform that provides an interface to centralize data model development, as well as enables you to schedule, test, modify, and document the dbt projects.
- dbt core: It is an open-source version that enables you to build and execute your dbt projects through a command line interface.
How to Orchestrate dbt Core Jobs Using Airbyte and Airflow
Orchestrating dbt core jobs enables you to automate SQL script execution on your data that is available in a remote location. With this feature, you can reduce manual intervention while performing your daily data analytics tasks.
Step 1: Extracting Data Using Airbyte
The first step of orchestrating the dbt core jobs is ensuring the data is available in the correct platform where you can perform the transformations. In some organizations, developers prefer to build custom connectors to migrate data from various tools into a data warehouse. However, this method has some limitations in terms of flexibility and infrastructure management.
To overcome this challenge, you can use SaaS-based tools like Airbyte to automate data replication tasks for your organization. Airbyte is a no-code data integration tool that allows you to connect various data sources to a destination of your preference. With over 400+ pre-built connectors, it enables you to automate data movement between numerous tools.
If the connectors you seek are unavailable, you can use Airbyte’s Connector Builder or Connector Development Kit (CDK) to build custom connectors within minutes. The Connector Builder provides an AI-assist feature that reads through your connector’s API documentation and automatically fills out most necessary fields. These capabilities allow you to develop connectors without getting into the underlying technical complexities.
Let’s look at an example of data movement from Shopify to Snowflake, which you can use to orchestrate the dbt core jobs. The following steps will highlight how to move data from your Shopify store into a data warehouse like Snowflake and automate dbt jobs.
Configure Shopify as a Source
- Log in to your Airbyte account. After submitting your login credentials, you will be redirected to the Airbyte dashboard.
- Click on the Sources tab from the left panel.
- Enter Shopify in the Search Airbyte Connectors… box on the Set up a new source page.
- Select the available Shopify connector.
- A new Create a source page will appear on your screen.
- On that page, enter the name of your Shopify store and authenticate your Shopify account using OAuth2.0 or API password.
Configure Snowflake as a Destination
- Click on the Destinations tab on the left panel.
- On the Set up a new destination page, enter Snowflake in the Search Airbyte Connectors… box.
- Select the available Snowflake connector option.
- A new Create a destination page will appear on your screen. On this page, mention all the necessary information, including Host, Role, Warehouse, Database, Default Schema, and Username. These fields will help determine the location where you want to load your Shopify data.
- Authenticate your Snowflake account using Key Pair Authentication or Username and Password.
- Click Set up destination to configure it as a destination for your data pipeline.
Set up a Connection between Shopify and Snowflake
- Click on the Connections tab on the left panel of the Airbyte dashboard and select + New Connection.
- Select Shopify and Snowflake from the source and destination dropdown list.
- Specify the frequency of the data syncs according to your requirements. With Airbyte, you can choose manual or automatic scheduling for your data refreshes.
- You can select the specific data objects you want to replicate from Shopify to Snowflake.
- Choose the sync mode for your stream between full refreshes or incremental syncs.
- Click on Test Connection to ensure the setup works as expected.
- Finally, if the test passes successfully, click Set Up Connection to initiate the data movement.
By integrating platforms this way, you can ensure proper data synchronization between the source and the destination. Airbyte offers change data capture (CDC) functionality, enabling you to automatically identify and replicate source data changes in the destination system. This feature lets you perform dbt transformations on your organization's latest/updated data.
Step 2: Configure and Execute a dbt Project
This step involves setting up the dbt environment, organizing it, compiling, and executing the dbt project. Let’s explore each section in depth.
Basic Configuration
- In the first step, you must create and activate a virtual environment for your project. Execute the code in the command prompt:
- Configure the dbt core in your local environment.
- After setting up the dbt core, you can create a dbt orchestration project to perform transformation tasks on the Snowflake dataset. To do so, you will first need to create a new GitHub repository for version control. You can choose a name and whether you want to keep the repo public or private.
- To create a project, use the dbt init command with the project name. You can use the cd and the pwd commands to navigate to your project folder. It is often considered a good practice to use a code editor like VSCode to work with these files. A code editor provides an extensive marketplace with version control tools like Git, enabling you to manage your project effectively. Navigate to your dbt project folder; the content of this folder must include a ‘.yml’ and a ‘.sql’ files when executing the init method.
Organize, Compile, and Execute the Project
- Organizing the project folders is a best practice that can help you identify issues and resolve them while orchestrating dbt core jobs.
To organize your project on GitHub, you can keep all your dbt projects under the dags directory. For example, you can move the dbt folder under the repository—’airflow-dbt-orchestrate-project’ in the image—you created in the previous step.
- After organizing the files, you can configure a connection between the project and the Snowflake dataset to apply dbt logic on the replicated data. To achieve this, you must create a file ‘profiles.yml’ in your ‘.dbt’ folder. Enter the following code in the profiles.yml file by replacing all the values with your credentials:
By performing this step, you will be able to see the updates in your GitHub folder.
- Now, you can navigate to your project folder and perform transformation operations by executing SQL commands in a ‘.sql’ file. These commands will be compiled once you run the dbt compile command.
- After the successful execution of the SQL commands, you will be able to see its status on the terminal. If the last line displays “Done,” the results should appear in the compiled directory. The actual Snowflake table will appear on your screen. In this way, you can ensure effective communication between dbt and the database.
- Perform the dbt run command. The dbt core engine relies on the SQL available in the compiled folder. With the help of the INSERT INTO command, you can replicate the transformed data from a temp table into your preferred location in Snowflake.
Step 3: Use Airflow to Orchestrate dbt Core Job
The second step demonstrates how to create a dbt core job and work with your Snowflake data. Although you can run the transformations by executing the job, the orchestration method, which involves scheduling and automation tasks, is still pending. You can utilize Airflow—a scheduling and monitoring tool—to orchestrate dbt core jobs. It offers BashOperator, which is a beneficial component for automating the dbt processes.
The BashOperator is a task manager that allows you to run Bash commands within your directed acyclic graphs (DAGs). Using the BashOperator, you can execute tasks without installing Bash—a scripting language—on your local machine.
The DAGs represent all the operations you want to perform on your data according to a schedule. You can create custom DAGs according to your specific requirements by specifying the schedules, descriptions, and arguments. Read Airflow’s official documentation to know how to create a DAG.
Airflow scans through the acyclic graphs to form a workflow. When working with Airflow, you will have three tasks: dummy start, dbt module, and dummy end task.
- You must provide the project directory, the location of the profiles.yml file, and the model name that you wish to execute. For example,
- You might notice a reduction in the number of steps to schedule tasks once your SQL models are tested. In this process, dbt passes all its logs to Airflow UI, reducing manual efforts while transforming the data.
- Finally, you can review the changes by performing a simple query on your target Snowflake data table. This will ensure that all the steps are performed correctly.
Key Takeaways
Understanding the key concepts of dbt core jobs and how to orchestrate them provides you with the necessary information to automate your data workflow. Although manual techniques exist to transform data, integrating data integration tools like Airbyte with dbt core and Airflow enables you to save time. Using these tools, you can orchestrate almost all your data transformation requirements, which can allow you to extract impactful insights from your data.