Using an ETL Framework vs Writing Yet Another ETL Script
The first time your engineering team needs to pull data from an external source, you may be tempted to write an ETL script in Python or Java that extracts data from the source, transforms it in memory and loads it to the destination. Five years ago, that's what I would have done too! In 2021, however, tooling has progressed such that this should be considered an antipattern and you should instead use an ETL/ELT framework.
Soon enough this simple ETL script turns into a more complex project when you make it production ready. Think for example of all the work needed to add scheduling, monitoring and debugging capabilities so you can fix it as soon as it breaks. And trust me, ETL scripts break often because you inherently don’t control the source data and API changes.
In this article, we share a common story of how this little ETL script evolves as you need to add more features to make it production ready and finally becomes an internal ETL framework to maintain. I've watched this narrative unfold in multiple high-performance engineering organizations.
At Airbyte, we believe that the future of ETL is EL(T) where the EL steps happen first and are decoupled from the T step. We’ve built an open-source EL(T) framework so you don’t have to code everything yourself. An EL(T) framework provides built-in connectors to extract and load data, a connector development kit, integrations with data transformation tools like dbt, scheduling, monitoring and a better developer experience than writing an ETL script from scratch.
Building your first ETL script
"We are doing a prototype on our new mapping product, we need to write a one-off script to pull in some map data."
Your first ETL script will normally take the form of a CLI that calls an external API to extract some data, normalize it and load it to destination. You run it once and you think it's over...
1 week later.
"For our prototype, we now need to pull the data daily, let's build a cron."
A cron job may be a good enough solution if you only need to schedule one ETL task at a time. But the moment you have one ETL job that runs long and is still running when the cron tries to schedule the next one, you’ll need to add more tooling.
For example, you may be tempted to use Airflow to schedule your ETL scripts with dependencies, or directly use Airflow transfer and transformation operators instead. While Airflow shines as a workflow orchestrator, soon enough you will need extra logic to add incremental loads and integrate data from business applications. You can read more about the challenges of using Airflow for ETL pipelines here.
1 month later.
"Wait a second, why is all of our traffic data a month old?”
As time has passed, the prototype that this data supports has matured into a production system with real customers. Sure enough, the original implementation using cron did not include any ETL monitoring to make sure the script completes successfully. It was just a "little script" after all! So the team makes sure that the ETL script has monitoring and reports whether it has succeeded or not.
1 month later.
"We are pulling data at the wrong cadence. The base map data we only want to pull once a month. The traffic data we need to pull every 5 minutes."
More engineering work goes into making the "little script" more configurable. Now there are two crons running at different cadences, and the script decides which data to pull based on some arguments that are passed in at runtime.
This is just one example of the configuration that might need to be grafted on to the script. Additional ones that crop up are: selecting subsets of columns, handling transformations for the different data types separately, or grouping the data into the right schema or directory.
Adding incremental syncs
1 month later.
"Our traffic data is falling behind, because we are trying to resend all of it every five minutes. We need to send only the new data".
If you are building an ETL script that needs to run frequently, an engineer has to store state for the cron job, so that on each run, the cron can query for only the new data by remembering how far it got last time or query directly the destination data store.
Fixing a broken connector
1 month later.
“The data provider changed the schema of the data.”
Maintaining data integration pipelines is hard because you cannot control what the data provider is going to do. This means you have to be on call for changes they made. With the right monitoring you can reduce the disruption, but you still need to put in the time.
Adding a new source
1 month later.
"We are switching map data providers, we need a 'little script' to pull data from Y"
Either way, the little ETL script now has its own database and is starting to look like an ETL framework of its own. Sound familiar?
This trajectory is commonplace among good engineering teams. At each step, they find the smallest valuable deliverable instead of over engineering for unknown future requirements (very scrum!). This trade off has historically made sense. Building in-house a fully fledged ETL pipeline with monitoring, schema evolution, etc, is hard. Thus, it did not make sense to put in all of that upfront cost, relative to just sinking a few hours to get an initial script that gets the project unblocked.
But in 2021, there is no reason to make this trade off anymore. Instead you can use an ETL framework to build your ETL pipelines.
Why use an open-source EL(T) framework?
The CDK provides an improved developer experience by providing basic implementation structure and abstracting away low-level glue boilerplate. This includes packaging, code structure, a test suite, setting up the release pipeline and several helper methods.
The benefit of writing the connector using a framework is that the monitoring features that are needed to make the data pipeline a reliable piece of a production system require no additional engineering time. Then for all of those future requirements (e.g. frequency changes, incremental syncs), they are just configuration changes in a UI!
Airbyte is focused on the OSS approach because we believe that fundamentally it is the only way to solve the problems of data integration. It is not possible for a single company to write and support every single connector themselves. This is fundamental to how the data market has changed. Every year thousands of new tech companies are created, and each one is producing data. Given this growth, only the OSS community can keep up. The Airbyte and OSS community own patching connectors, which keeps the ongoing upkeep of the connector low.
ETL/ELT frameworks that do not leverage OSS, leave their customers in a difficult position. Those users often have to choose a tool or framework that only supports 3 of the 6 connectors they need. Thus, they still have to maintain a side ETL/ELT pipeline for those other 3 data sources. They are stuck in the "little ETL script" antipattern! With an OSS approach, users can get all of the benefits of a SaaS ETL tool and also have it support all of its connectors. If the connector is not built in, it is easy to add using our EL(T) framework, so that it is all running through the same system.
Any ETL/ELT framework that cannot cover all of the sources that a user needs is only fulfilling a fraction of that promise. The developer time savings really comes when the engineer does not have to maintain any side pipelines and can put the "little ETL script" antipattern behind them. With Airbyte, our focus is making connector writing so easy, so that teams can use a framework that supports all the connectors they need and gives them all of the ETL/ELT features they deserve.
Say goodbye to the "little ETL script" antipattern!