Everything as Code (EaC) is a development approach aiming to express not only software but also its infrastructure and configuration in code. Changes to resources are managed programmatically using a Git workflow and a code review process rather than deployed manually. This post examines how to apply the same development philosophy to data infrastructure.
Why Everything as Code for Data Infrastructure?
Configuring data pipelines and related resources from a UI might be convenient. However, manual deployments have several drawbacks and risks. EaC can help you avoid these downsides using proven engineering methods.
- EaC makes it easier to reproduce environments and keep them in sync.
- Making changes to resources in EaC is as simple as editing text (making changes to the code) rather than manually deleting each resource and reconfiguring them from scratch.
- You can store your resource configuration in a Version Control System and maintain a history for auditability and rollback. If you stumble upon any issues, you can troubleshoot by reading the commit log and resolve these issues simply by reverting the change.
- Maintaining all resources in code allows collaboration via a pull request to ensure a proper review and approval process in your team.
- Code can be easily formatted and validated, helping to detect issues early on. Simply run <span class="text-style-code">terraform format</span> and <span class="text-style-code">terraform validate</span> to ensure your resources are formatted and configured properly.
- The best documentation is one that you don’t need to write or read because the code already explains the state of your resources.
- Code can be reused. Instead of clicking through UI components hundreds of times, you can declare your resource once and reuse its configuration in code for other similar resources.
- Finally, defining resources via code can automate many manual processes and save you time.
Airbyte Terraform provider
Airbyte is a data integration platform that simplifies and standardizes the process of replicating data from source systems to desired destinations, such as a data warehouse or a data lake. It provides a large number of pre-built connectors for various source systems, such as databases, APIs, and files, as well as a framework for creating new custom connectors.
Airbyte can be self-hosted or used as a managed service. This post will focus on the latter — Airbyte Cloud. One feature of this managed service is the recently launched Terraform provider, making it easy to define your data sources, destinations, and connections in code.
Airbyte ingestion, dbt & Python transformation, Kestra orchestration
This post will dive into the following:
- Using Airbyte’s Terraform provider to manage data ingestion
- Orchestrating multiple Airbyte syncs in parallel using Kestra
- Adding data transformations with dbt and Python
- Scheduling the workflow using Kestra’s triggers
- Managing changes and deployments of all resources using Terraform.
The code for the demo is available in the examples repository.
Let’s get started.
First, you need to sign up for an Airbyte Cloud account. Once you have an account, save your workspace ID. It’s a UUID provided within the main URL:
Then, navigate to Airbyte’s developer portal and generate your API key. Store both the Workspace ID and API key in a <span class="text-style-code">terraform.tfvars</span> file.
Download Kestra’s docker-compose file, for example, using curl:
Then, run <span class="text-style-code">docker compose up -d</span> and navigate to the UI. You can start building your first flows using the integrated code editor in the UI. In this demo, we’ll do that using Terraform.
Clone the GitHub repository with the example code: kestra-io/examples and navigate to the airbyte directory. Then, run:
Add the previously created <span class="text-style-code">terraform.tfvars</span> file (containing your API key and Workspace ID) into the same directory.
Finally, you can run the following command to validate your existing configuration.
Deploy all resources with Terraform
Now, you can watch the magic of the EaC approach. Run:
Airbyte sources, destinations, connections, and a scheduled workflow will get automatically provisioned. In the end, you’ll see a URL to a Kestra flow in the console. Following that URL, you can trigger a workflow that will orchestrate Airbyte syncs, along with dbt and Python transformations.
How does it all work?
Having seen the process in action, you can now dive deeper into the code to understand how this works behind the scenes. The repository includes the following files:
- sources.tf - includes configuration of three example source systems: PokeAPI, Faker (sample data), and DockerHub
- destinations.tf - configures the destination for your synced data (for reproducibility, we use a Null destination that will not load data anywhere)
- variables.tf - sets dynamic variables such as the Airbyte API key and workspace ID set in terraform.tfvars
- outputs.tf - after you run terraform apply, the output defined here returns the URL to the flow page in Kestra UI (from which you can run the workflow)
- main.tf - the main Terraform configuration file specifying required providers, Airbyte connections, along with the end-to-end scheduled workflow. Note how Terraform helps reference the resources to avoid redundancy. This way, you can ensure that your orchestrator (here, Kestra) always uses the right connection ID and that your data ingestion jobs, data pipelines, IAM roles, database schemas, and cloud resources stay in sync.
How does Kestra help?
While you could schedule data ingestion jobs directly from Airbyte, integrating those within an end-to-end orchestration pipeline gives you several advantages:
- An easy way of parallelizing your data ingestion tasks just by wrapping them in a Parallel task
- Highly customizable scheduling features, such as adding multiple schedules to the same flow, temporarily disabling schedules (without redeployments), and adding custom conditions or event triggers — you can schedule your flows to run even every minute.
- Integrating data ingestion with subsequent data transformations, quality checks, and other processes — the DAG view shown below demonstrates a possible end-to-end workflow.
This post covered the benefits of applying the Everything as Code approach to Data Infrastructure. We used Terraform with the Airbyte and Kestra provider plugins to manage data ingestion, transformation, orchestration, and scheduling — all that managed via code. By embracing the EaC philosophy, you can adopt software engineering and DevOps best practices to make your data operations more resilient. If you encounter anything unexpected while reproducing this demo, you can open a GitHub issue or ask via Kestra Community Slack. Lastly, give us a GitHub star if you like the project.