ETL (Extract, Transform, and Load) and ELT (Extract, Load, and Transform) are two data integration methods used to consolidate data from multiple data sources into a central repository. ETL has been the traditional method for this process, but recently, ELT solutions have gained so much popularity it has become the new standard in the industry.
In this article, we will explore the arguments for ETL vs ELT, and why ELT has been replacing ETL in the latest years.
What is ETL (Extract, Transform, Load)? ETL involves extracting data from one or more sources, transforming it to meet the target system's requirements, and loading it into a target data store - usually a target database or data warehouse -. The ETL process has been the standard data integration method for integrating data for many years. It typically involves a series of discrete steps, with data being moved from one step to the next in a predefined sequence.
The ETL process can be quite complex and time-consuming , especially when working with large datasets. It requires a lot of upfront planning to ensure that the data is transformed correctly and loaded into the target system in the correct format.
What is ELT (Extract, Load, Transform)? ELT involves extracting data from one or more data sources, loading it into a target data repository - usually a data warehouse, data lake or database -, and then transforming it to meet the target system's requirements. Unlike ETL, ELT does not require a predefined sequence of steps. Instead, the extracted data is loaded into the target system as quickly as possible, and then the data transformation process is applied to it in-place.
ELT pipeline uses modern cloud-based data warehouses that allow for massive parallel processing. This means that ELT solutions can process large amounts of data much faster than traditional ETL solutions. Additionally, ELT solutions are more flexible than ETL solutions, as they allow for data transformation to be performed on the destination system in place.
It is the advent of modern data warehousing technologies with Snowflake that enabled the emergence of ELT processes.
A Brief History of Data Integration Take a stroll through the story of data integration, uncovering its evolution and impact on how we manage data today, all while exploring the age-old question: ETL vs. ELT.
Pre-2000: Data pipelines? An IT problem.
2000, data integration was mainly an IT problem, and data pipelines were managed by dedicated IT teams. The process was all about ETL (Extract-Transform-Load), which involved acquiring data, normalizing it, and moving it to a destination.
2000 to 2010: Social networks usher in a rebirth of data.
From 2000 to 2010, social networks led to an increase in available data, but the tools were not reliable, and analytics were still being done at a small scale. Hadoop emerged for parallelizing computational workloads, but it was not until 2008 that it took hold. Product analytics tools emerged for low-volume products, but they were restricted to events coming from user behaviors.
2010 to 2015: The advent of cloud computing and ETL solutions.
From 2010 to 2015, cloud tools and SaaS solutions became more widespread, and the adoption of ETL solutions quickly followed. Building and maintaining pipelines in-house became less feasible, and computing was expensive.
2015 to 2020: The rise of the modern cloud data warehouse and ELT.
From 2015 to 2020, modern cloud data warehouses such as BigQuery, Redshift, and Snowflake emerged, offering data computation and storage at lower costs than traditional data warehouses. With their processing power, these warehouses enabled transformation to be done after loading, which allowed for the more powerful ELT paradigm. Furthermore, the number of heterogeneous data sources grew even further with Cloud tools becoming the standard.
ETL vs ELT: Similarities Apart from replicating data from a data source - database, API, file, etc. - to a target destination system with transformation possibilities, so you see the data in the format you need at the destination level, the two data integration methods do offer a few similarities. However, exploring the nuances of ETL vs. ELT reveals the distinctions in their execution and efficiency.
Automation Both ETL and ELT involve automating the data integration process to reduce manual effort and improve efficiency. They can both offer an API or CLI, and automated scheduling. However, as we'll see below, ELT offers a lot more automation possibilities.
Data Quality Both processes can offer to ensure that the data being integrated is of high quality, accurate, and consistent. They can either include that offer or support an external tool for that.
Data governance Both processes can support robust data governance tools and policies to ensure that data is secure and compliant with regulations. They can both be compliant with SOC 2, ISO, GDPR, HIPAA, and more.
That said, there are several significant reasons why companies should consider using ELT solutions instead of ETL solutions.
ETL vs ELT: Key Differences Here’s a summary of everything that ELT brings that ETL doesn't cover, highlighting the distinctive advantages and functionalities offered by each approach in the ongoing debate of ETL vs. ELT.
Accessibility of data
Easy access to source data through off-the-shelf no-code connectors Automatic schema migration to lower technical maintenance Standardized models throughout all pipelines Support both structured and unstructured data Data integrity and reliability
Access to raw data, insulated from transformation failure Create a new data model without rebuilding the data pipeline Scalability
Fast processing times for short load times Support fully managed services to avoid building and managing infrastructure Lower costs, as less hardware resources Let's go into more detail.
Faster processing times & loading speed One of the main advantages of ELT solutions is that they can process large amounts of data much more quickly than traditional ETL solutions. This is because ELT solutions load data directly in the data warehouse without transforming it, and data warehouses offer massively parallel processing to transform it fast.
Better scalability at a lower cost As the ELT process enables to extract and load data more quickly in the cloud data warehouses or cloud data lakes, it allows for higher data replication frequencies and thus lower data size per sync. This enables data pipelines to be much more scalable. Alternatively, the ETL process will have slower syncs at a lower frequency, thus high volume for each sync, which makes it more prone to scalability issues.
This also makes ELT less expensive, as they require fewer hardware resources.
Support of a lot more data sources Building ELT connectors is much simpler. You already have a skeleton of a data pipeline with only the Extract and Load, without transforming data. An ETL pipeline requires the building and maintenance of the transformation process and therefore is more costly engineering-wise. That explains why ETL solutions have never surpassed more than 200 connectors in their offerings even though they have been around for several decades. Alternatively, Airbyte - the fastest-growing ELT solution - has offered more than 350 connectors in less than 2.5 years after its inception. Eventually, we can imagine that ELT solutions will offer more than ten times more connectors than their ETL counterparts.
More flexibility & autonomy for data analysts with lower maintenance With ETL, transformations are applied to the data before it is loaded into the target data warehouse or target database. This can be limiting because it means that all transformations must be predefined and specified in advance. This forces data analysts to know beforehand every way they are going to use the data and every report they are going to produce. Any change they make can be costly and require engineering help, as data analysts don’t have the technical abilities or even access to the data pipelines. Those changes and issues can potentially affect data consumers downstream of the initial extraction.
With ELT, transformations can be performed on the data at any time at the destination level. This allows for greater flexibility and adaptability from the data analyst's standpoint. They don’t have to determine beforehand exactly what insights they want to generate before deciding on the exact schema they need to get. This makes the business intelligence process incomparably more flexible.
This also lowers the burden for data engineers who don't need to edit the pipeline for every change too. Removing sensitive data to be extracted is also easier for them.
Better data integrity & reliability for safer insights With ETL, every transformation performed on the data obscures some of the underlying information. Analysts won’t see all the data in the warehouse–only the data that was kept during the transformation phase. This is risky, as conclusions might be drawn based on data that hasn’t been properly sliced. Also, any transformation issue impacts the data pipeline and prevents data analysts from accessing any data.
Instead, with ELT, the underlying source data is directly replicated to a data warehouse, comprising a “single source of truth.” Analysts can then perform transformations on the data as needed. They will always be able to go back to the original raw data and won’t suffer from transformations that might have compromised the integrity of the data, giving them a free hand. The pipelines are also insulated from any transformation issues.
All this makes the business intelligence process incomparably more reliable.
Support of a lot more automation Although you can programmatically control your ETL pipelines and create some automations around their scheduling, there are a few essential automation you can’t put in place, which are available with ELT. Understanding the nuances of ETL vs. ELT can shed light on these key differences and help you leverage the most effective automation strategies.
Standardized data models across all pipelines No-code or low-code data integration with connectors that are usable out-of-the-box Automatic schema migration (for instance, when the API source changes its endpoints) Fully managed services, so you don’t need to host and operate Promotion of data literacy throughout the organization When used in combination with a semantic layer tool and cloud-based business intelligence systems such as Looker, Mode, and Tableau, the ELT approach also broadens access to a common set of metrics and analytics across organizations. This is the foundation of what we call self-serve analytics , and business intelligence dashboards become accessible even to relatively non-technical users.
More data formats: both structured data and unstructured data Another advantage of ELT solutions is that they can support both structured data and unstructured data, such as social media feeds, images, and videos. This is because ELT solutions can load this data into the data lake as-is. With ETL, unstructured data would need to be preprocessed and transformed before it could be loaded into the target system.
ETL vs ELT Architecture Difference Here are some key differences between ETL and ELT architectures:
Feature
ETL Architecture
ELT Architecture
Transformation
Extensive data transformation before loading.
Minimal transformation before loading, with primary transformations happening within the data warehouse.
Data Staging
Requires a separate staging area for data transformation.
May or may not use a staging area since transformation often happens in the data warehouse itself.
Data Warehouse Schema
Strict schema enforcement before loading.
Flexible schema; data can be transformed to fit the warehouse schema later.
Processing Time
Generally slower due to extensive upfront transformations.
Potentially faster as data loads quicker; transformations can be optimized for the data warehouse.
Data Quality
Ensures high data quality before loading.
It can require additional data cleansing after loading in the warehouse.
Scalability
It can be less scalable with large and complex datasets.
Generally more scalable as transformations leverage the processing power of the data warehouse.
Flexibility
Less flexible for adapting to changing data formats.
Schema changes can be accommodated more easily.
ELT vs ETL vs ETLT
Feature
ELT
ETL
ETLT
Data Processing Order
Extract, Load, Transform
Extract, Transform, Load
Extract, Transform, Load, Transform
Transformation
Transformation occurs within the target system after data loading
Data transformation occurs within the target system before data loading.
Data undergoes an initial data transformation phase before loading, followed by additional transformations with the target system.
Performance
Reduces the processing time significantly, especially in cases of big datasets.
Can have larger processing times, especially with large datasets.
Balancing pre-processing and target processing can maximize function.
Processing
Processing takes place within the target environment.
The processing happens outside the target environment.
The combination of both outside and inside processing in the target environment.
Loading
Data is first loaded into the target system and then transformed.
Data is sent to the target system after being transformed.
Data is loaded to the target system after an initial transformation phase.
Use Cases
Ideal for handling large volumes of raw data rapidly, such as in big data analytics.
Suitable for scenarios requiring complex transformations and data quality controls.
Offers flexibility for organizations needing both upfront and in-target transformations.
Top ETL and ELT tools Let's look at the ETL and ELT tools that have gained the most popularity in the last 2 years.
Most popular data integration tools - both ETL and ELT Airbyte Airbyte has become the open-source data integration standard in the last 2 years with more than 100 thousand deployments. It offers the largest catalog of data connectors - both sources and destinations - and provides a user-friendly interface for building data pipelines within 30 minutes with its Connector Development Kit. Airbyte is designed to be highly scalable and can run on a variety of infrastructure options, including cloud-based (Airbyte Cloud) and on-premise. Airbyte integrates with dbt for its transformation process.
Fivetran Fivetran was the first cloud-based ELT solution and have grown significantly in popularity with the success of data warehouses as Snowflake. Fivetran differs from Airbyte with its closed-source approach and by offering dbt transformation packages for all their connectors.
Matillion Matillion is the self-hosted ELT alternative to Fivetran. Its verticalized approach offers its own transformation process.
What is interesting is that the top 3 tools that were the most adopted are all 3 ELT tools, no wonder given the added value of the ELT process!
Top 3 ETL Tools It still feels important to mention the most popular ETL solutions out there, so you can visualize by yourself the differences between the ETL and ELT tools.
Talend Talend Data Integration was initially an open-source ETL tool and is to be implemented in the customer's own infrastructure. This constraint makes it that it always had an Enterprise focus. It supports a wide range of data sources and provides built-in transformations for data cleansing and transformation.
Informatica PowerCenter Informatica PowerCenter was a widely-used ETL tool that supported data profiling, in addition to data cleansing and data transformation processes. It was also implemented in their customers' infrastructure,
Microsoft SQL Server Integration Services (SSIS) Microsoft SQL Server Integration Services is the Microsoft alternative from within their Microsoft infrastructure.
–
No ETL tools had a cloud infrastructure with a cloud-hosted offer, if we consider Microsoft to be a private cloud platform. They all had an Enterprise focus, as they were much slower and more difficult to implement for faster-pace companies by being self-hosted.
It is to be noted that some workflow orchestrators, such as Apache Airflow, have been used to orchestrate data integration jobs, but they require a lot more engineering work for every connector you use, so this wasn’t the focus of this article.
Conclusion ELT solutions offer several significant advantages over ETL solutions, including faster processing times, more flexibility and automations, better scalability, lower cost, easier maintenance, better data integrity and reliability.
These advantages have made ELT solutions an essential part of the data strategy for companies looking to integrate data from multiple sources into a central repository. If your company is considering an ETL solution, you might want to reconsider.