The Art of Abstraction in ETL: Dodging Data Extraction Errors
Cooking a meal versus grocery shopping. Interior decorating versus loading the moving van. Transformation versus Extract-Load. It’s human nature to get excited by flashy outcomes and, consequently, the most proximate processes that evidently created them.
This pattern repeats in the data world. Conferences, blog posts, corporate roadmaps, and even budgets focus on data transformation and the allure of “business insights” that might follow. The steps to extract and load data are sometimes discounted as a trivial exercise of scripting and scheduling a few API calls.
However, the elegance of Extract-Load is not just the outcome but the execution – the art of things not going wrong. Just as interior decorating cannot salvage a painting damaged in transit or a carefully planned menu cannot be prepared if half of the ingredients are out-of-stock, the Extract-Load steps of data processing have countless pitfalls which can sideline data teams from their ambitious agendas and aspirations.
This post is the first in a series to explain some of the underlying complexity of Extract-Load. Understanding this complexity illuminates how data integration tools like Airbyte can unburden data teams by reducing their cognitive load, speeding time-to-development, reducing the risk of future errors and outages, and letting them focus on organization-specific issues.
This post will begin at the beginning: extracting data from upstream source systems. Just like thriftily filling your grocery cart with the weekly deals or cautiously loading up the moving van to secure your items, we will explore the strategy required to extract data both comprehensively but efficiently.
Data Extraction Basics
All EL pipelines begin by extracting data from some source system. Airbyte provides a convenient abstraction layer for extracting data from many types of systems. To focus this discussion, we will focus on REST APIs since these are one of the most common sources when integrating with upstream vendor tools (e.g. Facebook Ads, Salesforce).
Successful data extraction from an API involves formulating precise API queries and executing that query at scale. Each of these steps can present many different challenges which can either cost data teams precious time writing boilerplate code or jeopardize the accuracy and completeness of the results. We will examine some of these key issues one-at-a-time.
Formulating an API Query
Despite the ubiquity of REST APIs, there’s much arbitrary uniqueness in how each is structured. Without proper care, it’s easy to run into error messages or retrieve misleading results that are exactly what you asked for and, yet, not at all what you wanted.
APIs secure confidential data behind an authentication layer. There are many different authentication strategies that an API developer might enable. These include basic authentication (unencrypted credential passing), API keys, and modern OAuth2 methods which often require retrieving an ephemeral refresh token multiple times during a long-running session.
Authentication, then, requires data teams to write rote API-specific code, and in the case of OAuth2 can require iterating between retrieving fresh refresh tokens and making your actual data-collecting API calls. Without building in the right layer of robustness and iteration, such timeouts can cause unexpected failures in long-running or sequential calls to an API.
Regardless which authentication pattern you use, you will need to pass some type of credential to the API. As with personal passwords, such credentials must be handled with care. Security can take a variety of different approaches such as storing credentials as environment variables versus hardcoding.
However, while most technologists know not to leave passwords in plain text, this is not the only way that you might inadvertently jeopardize cybersecurity when making an API call. Simpler authentication methods may reveal your credentials unless you’re sure to submit calls over secured connections; other times, rigorous logging may accidentally unmask credentials and introduce a leak that could allow bad actors to access your systems.
Data teams should play an active role in understanding what data they aim to retrieve from an API and what query parameters to apply. However, they may often be faced with a large number of options and share connectors allowing for the “wisdom of the crowd” to highlight the most common endpoints and parameters of interest.
Although it’s a cliche in data that “we can’t even define a customer”, the same types of ambiguity that exist within organizations are magnified when collecting external data that may use different nomenclature. Using common tools helps highlight the most important information and could help catch corner cases that a data team might not anticipate. For example, did you know that the GitHub API considers pull requests to be a type of issue?
Structuring Queries & Payloads
So, once you have proven that you have a right to access information (authentication) and determined what information you wish to access (endpoints), how do you effectively ask the API to retrieve it?
Even a task as commonplace as querying an API can include arbitrary uniqueness and introduce the risk of a data team retrieving the wrong results. For example, passing multiple possible values to a single query argument can be non-trivial and API-specific. Some APIs prefer a format like ?key=value1&key=value2 and others expecting ?key=value1,value2.
Using an unanticipated format might result either in API errors or, even worse, silent failures when the query is misinterpreted. For those coming from SQL-type languages, some of these structures might be unintuitive; for example, despite the ampersand in the first option shown above, both syntaxes are requesting “either or” not “both and” values in the results.
Querying APIs at Scale
So, you have a working API query! That’s great, but making a single successful call is far different than querying an API at scale to regularly extract large amounts of data. Beyond the business logic in the API query, an analytics engineer may find themselves needing to adjust their code to address challenges with:
Often, the result of your API query may be a large number of records. To transmit this data more efficiently, APIs use pagination to return discrete subsets of results. Essentially, pagination is the process of breaking a set of records returned by the query into distinct blocks and only returning one per API request – similar to how Google Search breaks results over multiple pages. For example, if your query has 120 relevant records, your first API call might return the top 50. To obtain the next 50 (on the second ‘page’) would require another API call.
While this makes retrieving query results more efficient, data engineers often have to write boilerplate code to determine whether more records exist and repeatedly retrieve the next set. Otherwise, data teams may risk extracting incomplete records.
Even a correct API query may fail due to ephemeral issues such as system outages and network issues. After an API query is made, the returned status code needs to be checked to confirm whether or not the hit was successful. If it wasn’t, the system needs to decide how to handle this error and whether to attempt the call again.
The intersection of retries and pagination creates further risk of missed data collection. For example, if you are iteratively calling an API to retrieve the next ‘page’ of results so long as a next page exists, a naive for-loop might proceed to query “page 4” even if “page 3” failed.
Both retries and pagination create situations where an API needs to be called many times to succeed in extraction. However, as a safeguard against brute-force account takeover and Denial of Service attacks, APIs often will have a rate limit which determines the maximum number of queries allowed in a given period of time. Resilient extraction processes should be conscientious of rate limits and implement approaches like backoff for retries.
As this post illustrates, data collection can require a lot of specialized knowledge – both about querying data generally and each API’s nuances specifically. Obtaining and deploying this knowledge can take substantial bandwidth from a data team. Instead of dedicating a team member to reinvent the wheel, data teams can “hire” an extraction tool like Airbyte. Airbyte’s source Connectors encapsulate both engineering best-practices and domain-specific knowledge to help navigate the risks highlighted above. Taking a peak at the documentation for a single connector (e.g. Amazon Ads), you will find detailed discussions of many of the issues described above, like how to provide credentials that the connector will appropriately pass to the API and which endpoints contain the most generally useful information across industries.
Even if Airbyte doesn’t already have an existing connector for your use-case, it provides useful, boilerplate frameworks for data teams to create new connectors that apply some of these same best-practices. For example, different pagination strategies are easily configurable without reinventing the wheel with boilerplate code.
Just because you can pack up a moving van yourself, no one would ever turn down the help of friends or movers to make the job at hand to make the job safer, faster, and more pleasant. By providing a unified abstraction layer over the countless decisions, pain points, and pitfalls described above, a good EL tool like Airbyte can bring that same value to your data team.