Article

The Art of Abstraction in ETL: Keeping The Good Things Going

•

May 3, 2023

•

8 min read

Building robust data ingestion is riddled with countless pitfalls to dodge and decisions to make, as outlined in our Art of Abstraction post series. Just like inventing a new recipe from scratch, perfecting these two steps may require some iteration as data teams discover which patterns and processes work best in their context.

The work does not stop once that initial groundwork has been laid. Data teams will want to ensure that their data ingestion processes and broader data pipelines continue to operate reliably and deliver expected results. Maintaining a status quo may sound easy, but keeping consistency in any dynamic environment is not a trivial task. Doing so successfully requires adopting best practices from software engineering and devOps.

Repeating a grocery-trip-to-final-meal process would require many elements. You’d need a trigger to go to the store, knowledge of what ingredients to buy, knowledge of where to find them, assurance that the store hadn’t reorganized their aisles, a recovery plan if any ingredients are unavailable, and finally a recipe to reference as you cook.

Similarly, to keep an ELT pipeline running, we need a stable environment, clear triggers to kick off processes, and a way to account for failures while protecting downstream systems. These intuitive needs map to the software engineering concepts of reproducibility, automation and orchestration, and observability.

Reproducibility affords the ability to recreate the exact code and environment (e.g. versions of installed dependencies) across sessions. Automation and orchestration provide increasingly complex and powerful means to execute jobs when (and only when) they should be run. Observability allows use to monitor those jobs to ensure that they ran as intended.

In this post, we’ll explore each of these in more detail.

Resilience & Reproducibility

Intuitively, if data teams want consistent results, they should provide consistent inputs to their data jobs. Teams must be confident that their automated data ingestion job is “the same” as the one that they developed and tested. The most important aspect of this is ensuring that they can effectively govern both changes to the explicit logic of their pipeline and the background context (e.g. versions of other packages) in which it is run.

Version Control

Version control is an essential best practice to any engineering effort. Version control tools like git and platforms like GitHub allow teams to track changes to their project code and configurations over time and to enhance the pipeline in thoroughly documented and reversible ways.

Airbyte’s new Octavia CLI uses a Configurations as Code paradigm to allow complete specification of data ingestion jobs in YAML plain-text. This enables easy version controlling and code reviewing of changes.

Environment Management

Data pipeline code may call a range of other open-source tools and packages under-the-hood. While such tooling allows data teams to accomplish great things, these dependencies can also become a liability. Even if the data team’s code stays the same, breaking changes in any of the upstream dependencies can ricochet through the pipeline and cause previously well-established processes to break or perform in different ways.

To ensure the system keeps running as intended, it’s important to stay consistent with the package versions being used. Airbyte helps this process in multiple ways. Within each sync, it imposes self-discipline on its own connectors with clear versioning and changelogs (for example, see the Google Analytics changelog here.) At the broader project level, its Docker-driven workflow establishes a good framework for explicitly characterizing and recreating the desired production environment.

Pipeline Automation

Once all the ‘pipes’ are built, a data team has higher value tasks to undertake than manually running them every day. Multiple options exist to run data ingestion processes precisely when needed to support a broader data pipeline.

Scheduling

At the most basic level, jobs need to be triggered to run on a schedule. Doing this requires a scheduler to tell a server to execute the code necessary to set up an environment (including installing needed packages, accessing credentials, forming connections to relevant systems) and executing the ingestion code.

For this task, Airbyte Cloud offers a built-in scheduler.

Orchestrating with Airbyte

However, often scheduling alone can by myopic. Data teams may want to kick-off ingestion only when certain upstream sources have finished generating data or, alternatively, let downstream transformation tools know when their load has finished. This requires a move from time-based scheduling to orchestration, the triggering of downstream tasks based on the completion of upstream tasks. This ensures that processes will not run out of order, like a transformation job running on partially loaded data.

Simple orchestration tasks can be configured within Airbyte. For example, the completion of an Airbyte job can trigger a downstream dbt Cloud job to begin running.

Orchestrating Airbyte

As pipelines mature, it makes sense to have a single home for orchestration versus a loose federation of tools triggering one another. Data teams may ultimately switch to using a dedicated orchestration tool like Prefect or Airflow to centralize their entire pipeline, provide a single home for monitoring jobs, and allow more flexible triggering of parts of the execution graph when needed.

Even when moving to more holistic infrastructure, industry-standard tools like Airbyte have some advantages. Custom Prefect tasks and Airflow operator reduce the boilerplate code data teams must write to incorporate their data ingestion jobs into bigger pipelines.

Observability

Of course, no EL solution is ever perfect. Inevitably sometimes a pipeline will fail for unforeseen reasons. When this happens, a data team will want to be notified as soon as possible and to receive as much context as possible to help them rapidly resolve the issue. This requires multiple types of observability.

Notifications

Data teams will want to learn immediately if their pipeline has broken. Airbyte Cloud customers can easily configure this with notification webhooks in conjunction with a system like Slack. Notifications can be customized depending on whether teams want to know about failures only or all pipeline runs.

Beyond basic run notifications, teams will likely also wish to monitor the ingested data for quality issues. While this is out of Airbyte’s scope, similar processes can be set up with tools like dbt and re_data to validate incoming data and centralize notifications on Slack.

Logging

Once teams are informed that something went wrong, they will want to immediately start problem-solving. Robust logging infrastructure makes this possible by capturing console outputs such as messages, warnings, and errors that arose as scheduled code was executed. Since cloud-based pipelines are so often deployed in ephemeral environments like Docker containers, teams must ensure they set themselves up to fail gracefully by finding a way to persist their logs for on-demand inspection.

Once again, Airbyte has a number of configurable solutions for logging with the Cloud offering or the open-source core project.

Despite data teams’ best efforts at dodging pain points and making the best design decisions for their pipeline, excellent strategy is in vain when paired with flawed execution. By using tooling that embraces best devops practice, data teams can continue to reliably deliver high quality data with lower maintenance overhead and unforeseen issues.

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->

The data movement infrastructure for the modern data teams.

Try a 14-day free trial

About the Author

Emily is a Senior Manager at Capital One where she has led teams focused on full-stack data products, measurement and analytics, and modeling solutions. She is particularly passionate about delivering high-impact data science projects by integrating a strategic business perspective and software engineering best practices. Outside of her day job, Emily also enjoys supporting the data community by writing (published in The R Markdown Cookbook, 97 Things Every Data Engineer Should Know, and her blog), working on pro bono projects, and serving on the editorial board for rOpenSci and a technical reviewer for the publisher Routledge. You can find Emily on github.com/emilyriederer and emilyriederer.com