ETL vs ELT: The Key Differences
ETL (Extract, Transform, andLoad) and ELT (Extract, Load, and Transform) are two data integration methods used to consolidate data from multiple data sources into a central repository. ETL has been the traditional method for this process, but recently, ELT solutions have gained so much popularity it has become the new standard in the industry.
In this article, we will explore the arguments for ETL vs ELT, and why ELT has been replacing ETL in the latest years.
What is ETL (Extract, Transform, Load)?
ETL involves extracting data from one or more sources, transforming it to meet the target system's requirements, and loading it into a target data store - usually a target database or data warehouse -. The ETL process has been the standard data integration method for integrating data for many years. It typically involves a series of discrete steps, with data being moved from one step to the next in a predefined sequence.
The ETL process can be quite complex and time-consuming, especially when working with large datasets. It requires a lot of upfront planning to ensure that the data is transformed correctly and loaded into the target system in the correct format.
What is ELT (Extract, Load, Transform)?
ELT involves extracting data from one or more data sources, loading it into a target data repository - usually a data warehouse, data lake or database -, and then transforming it to meet the target system's requirements. Unlike ETL, ELT does not require a predefined sequence of steps. Instead, the extracted data is loaded into the target system as quickly as possible, and then the data transformation process is applied to it in-place.
ELT solutions use modern cloud-based data warehouses that allow for massive parallel processing. This means that ELT solutions can process large amounts of data much faster than traditional ETL solutions. Additionally, ELT solutions are more flexible than ETL solutions, as they allow for data transformation to be performed on the destination system in-place.
It is the advent of the modern data warehousing technologies with Snowflake that enabled the emergence of ELT processes.
A brief history of data integration
Pre-2000: Data pipelines? Definitely an IT problem.
2000, data integration was mainly an IT problem and data pipelines were managed by dedicated IT teams. The process was all about ETL (Extract-Transform-Load), which involved acquiring data, normalizing it, and moving it to a destination.
2000 to 2010: Social networks usher in a rebirth of data.
From 2000 to 2010, social networks led to an increase in available data, but the tools were not reliable, and analytics were still being done at a small scale. Hadoop emerged for parallelizing computational workloads, but it was not until 2008 that it took hold. Product analytics tools emerged for low-volume products, but they were restricted to events coming from user behaviors.
2010 to 2015: The advent of cloud computing and ETL solutions.
From 2010 to 2015, cloud tools and SaaS solutions became more widespread, and the adoption of ETL solutions quickly followed. Building and maintaining pipelines in-house became less feasible, and compute was expensive.
2015 to 2020: The rise of the modern cloud data warehouse and ELT.
From 2015 to 2020, modern cloud data warehouses such as BigQuery, Redshift, and Snowflake emerged, offering data computation and storage at lower costs than traditional data warehouses. With their processing power, these warehouses enabled transformation to be done after loading, which allowed for the more powerful ELT paradigm. Furthermore, the number of heterogeneous data sources grew even further with Cloud tools becoming the standard.
ETL vs ELT: The similarities
Apart from replicating data from a data source - database, API, file, etc. - to a target destination system with transformation possibilities, so you see the data in the format you need at the destination level, the two data integration methods do offer a few similarities.
Both ETL and ELT involve automating the data integration process to reduce manual effort and improve efficiency. They can both offer an API or CLI, and automated scheduling. However, as we'll see below, ELT offers a lot more automation possibilities.
Both processes can offer to ensure that the data being integrated is of high quality, accurate, and consistent. They can either include that offer or support an external tool for that.
Both processes can support robust data governance tools and policies to ensure that data is secure and compliant with regulations. They can both be compliant with SOC 2, ISO, GDPR, HIPAA and more.
That said, there are several significant reasons why companies should consider using ELT solutions instead of ETL solutions.
ETL vs ELT: The key differences
Here’s a summary of everything that ELT brings that ETL doesn't cover:
Accessibility of data
- Easy access to source data through off-the-shelf no-code connectors
- Automatic schema migration to lower technical maintenance
- Standardized models throughout all pipelines
- Support both structured or unstructured data
Data integrity and reliability
- Access to raw data, insulated from transformation failure
- Create new data model without rebuilding data pipeline
- Fast processing times for short load times
- Support fully managed services to avoid building and managing infrastructure
- Lower costs, as less hardware resources
Let's go into more details.
Faster processing times and loading speed
One of the main advantages of ELT solutions is that they can process large amounts of data much more quickly than traditional ETL solutions. This is because ELT solutions load data directly in the data warehouse without transforming it, and data warehouses offer massive parallel processing to transform it fast.
Better scalability at a lower cost
As the ELT process enables to extract and load data more quickly in the cloud data warehouses or cloud data lakes, it allows for higher data replication frequencies and thus lower data size per sync. This enables data pipelines to be much more scalable. Alternatively, the ETL process will have slower syncs at lower frequency, thus high volume for each sync, which makes it more prone to scalability issues.
This also makes ELT less expensive, as they require fewer hardware resources.
Support of a lot more data sources
Building ELT connectors is much simpler. You already have a skeleton of a data pipeline with only the Extract and Load, without transforming data. An ETL pipeline requires the building and maintenance of the transformation process, and therefore is more costly engineering-wise. That explains why ETL solutions have never surpassed more than 200 connectors in their offerings even though they have been around for several decades. Alternatively, Airbyte - the fastest-growing ELT solution - has offered more than 350 connectors in less than 2.5 years after its inception. Eventually, we can imagine that ELT solutions will offer more than ten times more connectors than their ETL counterparts.
More flexibility and autonomy for data analysts with lower maintenance
With ETL, transformations are applied to the data before it is loaded into the target data warehouse or target database. This can be limiting because it means that all transformations must be predefined and specified in advance. This forces data analysts to know beforehand every way they are going to use the data, every report they are going to produce. Any change they make can be costly and requires engineering help, as data analysts don’t have the technical abilities or even access to the data pipelines. Those changes and issues can potentially affect data consumers downstream of the initial extraction.
With ELT, transformations can be performed on the data at any time at the destination level. This allows for greater flexibility and adaptability from the data analyst standpoint. They don’t have to determine beforehand exactly what insights they want to generate before deciding on the exact schema they need to get. This makes the business intelligence process incomparably more flexible.
This also lowers the burden for data engineers who don't need to edit the pipeline for every change too. Removing sensitive data to be extracted is also easier for them.
Better data integrity and reliability for safer insights
With ETL, every transformation performed on the data obscures some of the underlying information. Analysts won’t see all the data in the warehouse–only the data that was kept during the transformation phase. This is risky, as conclusions might be drawn based on data that hasn’t been properly sliced. Also, any transformation issue impacts the data pipeline and prevents data analysts from accessing any data.
Instead, with ELT, the underlying source data is directly replicated to a data warehouse, comprising a “single source of truth.” Analysts can then perform transformations on the data as needed. They will always be able to go back to the original raw data and won’t suffer from transformations that might have compromised the integrity of the data, giving them a free hand. The pipelines are also insulated from any transformation issues.
All this makes the business intelligence process incomparably more reliable.
Support of a lot more automations
Although you can programmatically control your ETL pipelines and create some automations around their scheduling, there are a few essential automations you can’t put in place, which are available with ELT:
- Standardized data models across all pipelines
- No-code or low-code data integration with connectors that are usable out-of-the-box
- Automatic schema migration (for instance, when the API source changes their endpoints)
- Fully managed services, so you don’t need to host and operate
Promotion of data literacy throughout the organization
When used in combination with a semantic layer tool and cloud-based business intelligence systems such as Looker, Mode, and Tableau, the ELT approach also broadens access to a common set of metrics and analytics across organizations. This is the foundation of what we call self-serve analytics, and business intelligence dashboards become accessible even to relatively non-technical users.
More data formats: both structured data and unstructured data
Another advantage of ELT solutions is that they can support both structured data and unstructured data, such as social media feeds, images, and videos. This is because ELT solutions can load this data into the data lake as-is. With ETL, unstructured data would need to be preprocessed and transformed before it could be loaded into the target system.
Top ETL and ELT tools
Let's look at the ETL and ELT tools that have gained most popularity in the last 2 years.
Most popular data integration tools - both ETL and ELT
Airbyte has become the open-source data integration standard in the last 2 years with more than 100 thousands deployments. It offers the largest catalog of data connectors - both sources and destinations - and provides a user-friendly interface for building data pipelines within 30 minutes with their Connector Development Kit. Airbyte is designed to be highly scalable and can run on a variety of infrastructure options, including cloud-based (Airbyte Cloud) and on-premise. Airbyte integrates with dbt for its transformation process.
Fivetran was the first cloud-based ELT solution and have grown significantly in popularity with the success of data warehouses as Snowflake. Fivetran differs from Airbyte with its closed-source approach and by offering dbt transformation packages for all their connectors.
Matillion is the self-hosted ELT alternative to Fivetran. Its verticalized approach offers its own transformation process.
What is interesting is that the top 3 tools that were the most adopted are all 3 ELT tools, no wonder given the added value of the ELT process!
Top 3 ETL Tools
It still feels important to mention the most popular ETL solutions out there, so you can visualize by yourself the differences between the ETL and ELT tools.
Talend Data Integration was initially an open-source ETL tool and is to be implemented in the customer's own infrastructure. This constraint makes it that it always had an Enterprise focus. It supports a wide range of data sources and provides built-in transformations for data cleansing and transformation.
Informatica PowerCenter was a widely-used ETL tool that supported data profiling, in addition to data cleansing and data transformation processes. It was also implemented in their customers' infrastructure,
Microsoft SQL Server Integration Services (SSIS)
Microsoft SQL Server Integration Services is the Microsoft alternative from within their Microsoft infrastructure.
No ETL tools had a cloud infrastructure with a cloud-hosted offer, if we consider Microsoft to be a private cloud platform. They all had an Enterprise focus, as they were much slower and more difficult to implement for faster-pace companies by being self-hosted.
It is to be noted that some workflow orchestrators, such as Apache Airflow, have been used to orchestrate data integration jobs, but they require a lot more engineering work for every connector you use, so this wasn’t the focus of this article.
ELT solutions offer several significant advantages over ETL solutions, including faster processing times, more flexibility and automations, better scalability, lower cost, easier maintenance, better data integrity and reliability.
These advantages have made ELT solutions an essential part of the data strategy for companies looking to integrate data from multiple sources into a central repository. If your company is considering an ETL solution, you might want to reconsider.