Article

The Art of Abstraction in ETL: Making Sound Loading Decisions

•

April 11, 2023

•

It’s not always about the destination but the journey. That summarizes the theme of our first Art of Abstraction post which describes many pitfalls in extracting data from APIs. However, technically successful extraction is only part of the story.

The previous post discussed how to avoid problems when working with source systems outside of a data team’s control. However, once that data enters internal systems, teams face many decisions on how to process and load that data into their data platform. Each decision has implications for performance, cost, and reliability.

Continuing the grocery shopping analogy, imagine that you are buying ingredients for a cake. You might need sugar, milk, and eggs. If the store is out of eggs, you could decide to buy the available ingredients and return for eggs (and remember to only buy eggs on the next trip!). When unpacking those eggs, you could also decide whether to leave them in the carton that states their actual expiration date, put them in a shared container with the oldest at the front, or simply toss out the old eggs in favor of the new.

Similarly, for ELT, we will explore different decisions and how they interact. A good Extract-Load tool helps to highlight these ‘menus of options’, to empower data teams to consider the best options for their use-case, and to implement them with ease.

Extract-to-Load Basics

Some of the key decisions teams will face at this stage include how to format the extracted data, where to load the data, when and how to refresh the data, and sometimes how to unload it to other operational systems.

In the following sections, we will explore some of these key conceptual decision points and how they interrelate. Along the way, we will see how good EL tools can provide an abstraction over the implementation so data teams can focus on making the best design decisions.

What You Extracted

Before thinking about loading, let’s recap what one actually extracts from an API. Most APIs return data as JSON blobs, a semi-structured data format which represents an entity as a set of key-value pairs, similar to a python dictionary. For example, a JSON blob representing this post might look something like this:

{
  post_id: 1234,
  name: “The Art of Abstraction in ETL: Making Sound Loading Decisions”,
  date_published: “2023-04-07”,
  author: {
    name: “Emily Riederer”,
    twitter: “emilyriederer”
  }
}

Relative to tabular data, JSON has a few nuances. A few key issues for data teams are:

The data does not have types so the types must be inferred
Content can be nested, such as the information in the author key
Key names may be repeated, such as both the post and the author having a name field
Nulls may be implicit. If the author did not have a Twitter account, that key-value pair might not be sent at all

As an analytics engineer turns to make decisions about loading, they will have to decide the best way and time to deal with each of these issues while converting their data into a downstream analytics-ready data product.

Where to Load

‍Data Lakes versus Data Warehouses

Data must be loaded to a destination. Popular choices include data lakes (more generic object storage like the hard-drive on your computer) or data warehouses (more structured data adhering to a strict schema). These choices offer many trade-offs; for example, lakes allow you to preserve the unmodified results extracted from an API and adapt to changing data, but warehouses enforce consistency and structure that can improve downstream performance.

A standard abstraction over potential load destinations (like Airbyte Destinations) provides a unified interface across these arbitrarily unique platforms. This means data teams that can more readily publish data to any environment, swap out destinations in the case of platform migrations, or even send their data to multiple destinations simultaneously.

‍Which Data Lake and/or Warehouse?

Many different vendors provide data lakes and data warehouses including Google (e.g. Cloud Storage, BigQuery) and Amazon (e.g. S3, Redshift). However, there are countless nuances in how they deal with object management like tagging and versioning.The same abstraction benefits described above can help teams move more seamlessly between these solutions offered by different vendors and apply pre-existing knowledge of one stack to another.

What file format?

Where the data is to be loaded is also intimately tied with how it should be formatted. For example, in the case of a data lake, different file formats like Avro and Parquet can be optimized for compact data storage (through compression) and more efficient retrieval (with partitioning and Parquet’s column-based access). So, using the best file format for a given destination and use case can boost performance and cut the compute and storage costs.

What to Load

Structured or unstructured?

Data can be converted from JSON to a more structured format either before or after loading. Leaving the data in raw JSON format preserves the full lineage of the data pipeline for a more reproducible solution; however, it also puts the onus on a downstream process to add structure.

If data teams prefer to go the latter route, Airbyte provides automated basic normalization to convert JSON into tabular data. This approach addresses each of the key nuances of JSON described above with strategies for inferring column types, unnesting data, avoiding name ‘collisions’, and more. If teams wish to override Airbyte’s “best guess”, they can also extend this functionality with custom transformations.

How to track the data about your data

Whether data is loaded in structured or unstructured forms, the data extracted from a source system is not the only data of interest. Extraction itself creates information; when your data was extracted and loaded may be useful both operationally (to keep your pipeline running, as we will see shortly) and analytically.

Recognizing this, teams will want to preserve additional metadata fields along with their data. What data should be tracked? Airbyte appends metadata fields that reflect data that was extracted and loaded, and (depending on the load strategy) may also reflect for what period of time a specific record is “current” before it is superseded by more recent information. The near-universal utility of such fields makes these ripe for automation and abstraction.

How to Load

Once teams have a plan for what data to load to which location, the next natural question is how often to execute this process. Unlike the prior decision, this decision is often also high cardinality; multiple decision points exist, and the best answer varies by the nature of each data source being ingested. To let data teams focus on these decisions, Airbyte encapsulates the relevant implementation logic in its sync configurations.

Full versus Incremental Loads

The previous discussion about querying an API assumed that we would want all the records; however, routinely extracting all records from an upstream source is both unnecessary and prohibitively inefficient. For example, imagine tracking the performance of a highly popular ad using the Facebook Ads connector. It would be far more efficient to pull data on any new impressions since the last sync instead of repeatedly retrieving every impression.

Full loads sound superficially easy because we don’t have to worry about missing any records; however, these can be incredibly costly, inefficient, and sometimes even destructive of critical historical records. Incremental loads require intricate accounting to only extract, insert, or update records appropriately based on the current state of the data from prior loads. However, the metadata discussed previously enables more efficient incremental extractions by providing accessible documentation of what data was already loaded.

Override versus Append

After establishing what subset of data is to be loaded, data teams must also decide how to load this data – particularly in the case of slowly changing dimensions. For example, consider an order fulfillment data source for an e-commerce platform. The same order entity will at one point in time have a ‘In Fulfillment' status and later change to a ‘Shipped’ status.

Depending on the intended use case, data could either represent this information with one record at the order x status grain or simply one row at the order grain reflecting the current status. The former would be useful for conducting analyses such as “What is the average time to fulfillment?” whereas the latter is sufficient (and efficient) for operational monitoring such as “What percent of current orders are outstanding?”

Data teams can adapt their load strategies to meet their specific business needs with Airbyte’s options for overwriting, appending, or deduplicating loads. These strategies are implemented based on unique identifying keys and supporting metadata to select the correct records.

Wrapping up

This post explores just a few highlights on the countless decisions that data teams may encounter while loading their data. Clearly, each of these decisions has substantial implications for how downstream users can consume this data – including where it can be accessed, how much further processing is required, and what use cases it can address.

No one is better equipped than data teams to make these important strategic decisions, but when it comes to implementing them, hiring a tool like Airbyte can reduce the cognitive load of rote tasks like ingestion and state tracking.

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->

The data movement infrastructure for the modern data teams.

Try a 14-day free trial

About the Author

Emily is a Senior Manager at Capital One where she has led teams focused on full-stack data products, measurement and analytics, and modeling solutions. She is particularly passionate about delivering high-impact data science projects by integrating a strategic business perspective and software engineering best practices. Outside of her day job, Emily also enjoys supporting the data community by writing (published in The R Markdown Cookbook, 97 Things Every Data Engineer Should Know, and her blog), working on pro bono projects, and serving on the editorial board for rOpenSci and a technical reviewer for the publisher Routledge. You can find Emily on github.com/emilyriederer and emilyriederer.com