Data teams increasingly use ETL pipelines to move data from a warehouse into business applications. The concept, known as "Reverse ETL," is rapidly becoming a standard component of data stacks.
As a data engineer, you may wonder, "Why would I want to move data out of the warehouse if I just spent so much time putting it into the warehouse?"
Without Reverse ETL, the data warehouse may become a silo, as the data inside it is only available to technical users who understand SQL. Furthermore, data living in dashboards or reports is intended to analyze the past rather than to drive future actions. Data engineering teams are looking for ways to empower non-technical business users by providing them with actionable data within the applications they already use, known as data operationalization.
Keep reading to learn more about Reverse ETL, why it’s technically different from traditional ETL, and the most prevalent use cases.
A Brief Story of Data Integration: ETL vs. ELT
The data engineering world is constantly evolving, especially regarding data integration implementations.
First, we had the ETL (extract, transform, load) approach to data pipelines, which essentially involves extracting data from different sources, transforming it, and loading it into – most commonly – a data warehouse where it can be used as a source of dashboards and reports.
ETL is a classic paradigm (it emerged in the 1970s!). Recently, it has shifted to ELT (extract, load, transform). The ELT philosophy dictates that data should be untouched – apart from minor cleaning and filtering – as it moves through the extraction and loading stages so that the raw data is always accessible in the data warehouse.
In an ELT pipeline, data enrichment and transformation occur inside the data warehouse, not in a processing server like in the case of ETL pipelines. The shift has been primarily made possible thanks to the appearance of cloud-based data warehouses like Redshift, BigQuery, or Snowflake.
The qualities described above make ELT pipelines more adaptable, efficient, and scalable than ETL pipelines, particularly when integrating massive amounts of data, processing structured and unstructured data, and enabling different types of data analytics and data science.
But the rise of the ELT paradigm does not mean that ETLs have been completely replaced. Some situations require transformations before data lands anywhere, for example, to handle personal identifiable information (PII) (which often involves removing, masking, or encrypting data fields).
So, What is a Reverse ETL?
Reverse ETL is the flip side of the ETL/ELT. With Reverse ETL, the data warehouse becomes the source rather than the destination. Data is taken from the warehouse, transformed to match the destination's data formatting requirements, and loaded into an application – for example, a CRM like Salesforce – to enable action.
Reverse ETL is the process of syncing data from a source like a data warehouse to a business application so it can be used by marketing, sales, support, and other teams in the tools they use.
Hence, a Reverse ETL “operationalizes” data throughout an organization by putting data back into business applications. In that sense, a Reverse ETL is not something that comes to replace the ETL or ELT data pipelines; it’s instead used in conjunction as part of a data stack.
In a way, the Reverse ETL concept is not new to data engineers, who have been enabling data movement warehouses to business applications for a long time. As Maxime Beauchemin mentions in his article, Reverse ETL “appears to be a modern new means of addressing a subset of what was formerly known as Master Data Management (MDM).”
But why can't data engineers use conventional solutions to move data out of the warehouse into business applications? Isn’t it as easy as reversing the arrow from application -> warehouse to warehouse -> application? Well, it turns out that Reverse ETL is way more challenging to get right than conventional ETL/ELT. Let’s see why.
Technical Differences Between ETL and Reverse ETL
It feels like ETL approaches should work for Reverse ETL, right? Both tackle transferring data from and to systems like Salesforce, Mailchimp, Snowflake, or BigQuery. But, as Brian Leonard – founder of Grouparoo, a Reverse ETL company recently acquired by Airbyte – shared with us, “To read from Zendesk, you have to know about Zendesk. To write to Zendesk, you have to know about Zendesk. But generally, writing to an API is harder than reading from it.”
That’s where the real challenge of Reverse ETL arises: Sending data to a business application via an API is hard, and companies today use dozens of business applications for their day-to-day operations. “Handling the one-by-one nature of it is challenging,” says Brian about the challenges of implementing Reverse ETL for several destinations.
But there’s more, so let’s unpack the technical differences between ETL and Reverse ETL and categorize them into different areas.
In this context, we refer to the synchronization mode as the way to extract and load data from the source to the destination.
In an ETL/ELT, there can be full or incremental data extraction. In a full data extraction, as the name suggests, all data from the source is extracted every time. On the other hand, in incremental data extraction, only the new or updated records since the last sync are taken into account.
When dealing with significant amounts of data, incremental extraction is an excellent choice to increase performance. Change Data Capture (CDC) techniques identify the new and updated data in an ETL pipeline. But CDC is difficult to apply in Reverse ETL, as the warehouse typically doesn’t provide a transaction log or “updated_at” columns.
Furthermore, in an ETL pipeline, you can use the UPSERT operation to merge data into a database or data warehouse, as most support it. The UPSERT ensures that if the data already exists in the destination, it won’t be inserted again. But most APIs don’t have that; instead, they have different endpoints for creating and updating records.
Not having upserting mechanisms for most APIs means that, in a Reverse ETL, you must keep track of what needs to be updated and what needs to be created.
“We would often search for a set of, for example, email addresses. So, we search for 50 email addresses and get them back with their IDs; for 30 of those, we might do individual updates. Then for 20 of those, we might do individual creates, or if there's a “batch create” feature, we create them at once. So, in general, there's just a lot more bookkeeping”, shares Brian.
Data quality and transformations
With ETL, you go from specific to general, meaning that you extract data from different specific sources to then integrate it into a common destination. Writing to a data warehouse is relatively easy since warehouses are designed to store massive volumes of data from several distinct sources. On the other hand, with Reverse ETL, you go from general to specific, having to conform to each business application API.
Because each API has a somewhat different interface, each Reverse ETL pipeline requires specific code. For example, the transformations necessary to transfer data to Mailchimp differ significantly from the ones needed for Salesforce.
Additionally, Reverse ETL pipelines must ensure that data conforms to the API specification, which, in turn, may require more data transformations than traditional ETL.
“For example, in Mailchimp, if you try to write an invalid email address, even if it looks like an email address, it will error. For example, Mailchimp will reject it because it knows it's not a real domain. So there's so much more validation and knowledge of that destination”, says Brian. “You're keeping track of every record a little more than you are in the ELT model. These data validation issues are much less common in the ELT space.”
Failures and job re-execution
ETL pipelines write to data warehouse tables, which means that if there’s any failure, you can technically drop the tables and simply re-execute the ETL job to recover or correct corrupted data.
In Reverse ETL, it might not be as easy. As Brian explains, “Normally, ETL batch jobs are what we call in Computer Science: idempotent or functional, which means that no matter how often you run them, they should produce the same results. The real question is, are the Reverse ETL destinations idempotent? If I add this person to MailChimp, and then I add them again, will the request fail, or will it send another welcome email? Just because somebody had to re-run something, you don't want to send this person another email; that's not good.”
Hence, job re-executions need to be handled with more care in Reverse ETL as there may be unwanted side effects.
What About the Difference Between Customer Data Platform and Reverse ETL?
A Customer Data Platform (CDP) is a system that combines and stores customer data from various sources. As previously described, Reverse ETL refers to moving data from a warehouse to business applications. In short, a CDP is a platform, while Reverse ETL is a process.
CDPs, in particular, are tools that assist in developing a single view of a customer by aggregating data from numerous sources. CDPs are designed with a specific goal, making them beneficial primarily to marketing and sales teams. In contrast, reverse ETL can be applied to any form of data, not simply customer data.
Because of the data architecture that enables Reverse ETL, all your data is centralized in a data warehouse, allowing you to combine it with data from other sources for more complex analysis. Meanwhile, CDPs only provide you access to consumer data, which limits the kind of analysis you can run.
Reverse ETL is merely a process, so it has the disadvantage of not providing an interface for data activation, instead relying on the interface of third-party applications it loads data into. This is where a CDP holds an advantage and why it often makes sense to have both Reverse ETL and a CDP as part of a data stack.
Typical Reverse ETL Use Cases
When building an ETL pipeline, data engineers usually don’t need to know precisely how stakeholders will use the data. Their main concern is to get all the required data to the warehouse in a performant and scalable way.
With Reverse ETL, engineers need to be more familiar with the use-case of the data. “When writing to Braze, you have some use cases in mind. You're often writing a likelihood to churn score, lifetime value score, a segment the client should be in, or the latest product they bought. So, Reverse ETL generally is more collaborative with marketing, sales, and support”, says Brian.
Let’s talk about real-life examples illustrating how business teams use Reverse ETL. We got common use cases in marketing, sales, and support.
Personalization might be the most prevalent use case for B2C companies. Specifically, Reverse ETL helps with email personalization. Several companies are using newsletters to communicate with potential and existing customers, usually by creating data-driven email flows, which can vary in complexity. “Starting with ‘hi, first name’ up to complicated workflows,” shares Brian.
Companies can enable personalized messages by sending customer data from the warehouse to Mailchimp, which allows them to do things like:
- Sending the Spanish newsletter to the Spanish-speaking people and the English newsletter to the English-speaking people.
- Identifying who has bought something and who hasn’t, so they get different emails.
- Detecting who purchased a specific item and, 30 days later, they get an automatic email asking, “do you want to repurchase this item?”.
In the past, SaaS companies were offering their products, but to get them, you needed to talk to a salesperson as there was no sign-up feature. Those companies had everything they knew about a potential customer in Salesforce, and the account executives' entries would show a progression like “Called prospect at 5:00 pm, they’re very interested”, “Prospect had a demo meeting, it went well, we are negotiating a contract” and then finally “Signed them up.” And that could be the end of the story.
As more and more companies are transitioning to product-led growth, which is self-serve, they’re interested in knowing more about the actual product usage. “They are essentially asking ‘how can we power Salesforce with data from the actual product’?” says Brian.
With Reverse ETL, companies can get product usage data into Salesforce and merge it to have a complete product usage view. For example, they can see when a person signs up, does a particular action, or spends a certain amount of money.
Another increasingly popular use case for Reverse ETL is customer support. By having a complete view of a customer, agents can provide more accurate help.
“Get phone number data into Zendesk or similar so that when a client calls for support, the information can automatically come up on the agent’s screen, and they'll know that Brian is calling and he recently bought a razor, or he's on the premium sales package and can automatically get routed to the best support agents,” says Brian to exemplify a typical Zendesk Reverse ETL use case.
Reverse ETL and the Data Hierarchy of Needs
The hierarchy of needs is a well-known psychological notion developed by Abraham Maslow. It states that individuals must first meet their fundamental requirements before focusing on more complex ambitions.
When comparing Maslow’s hierarchy of needs to a data stack, we realize that the most basic layers in the hierarchy have been addressed: data storage (with data warehouses), data integration (with ETL/ELT), data modeling, and reporting. Some mature platforms and tools fulfill the needs of the areas mentioned earlier. However, there’s an area that’s failing behind: data operationalization.
The rise of a new generation data stack reflects an essential trend: companies must transfer data capabilities out of centralized silos (data warehouses) and embed them inside teams across business units. “You get to a point where marketers say, ‘thanks for the report on Metabase, but I'd like to get this data into Customer.io”, says Brian.
In this sense, the future of the modern data stack includes Reverse ETL solutions that address data operationalization, or in other words, close the operational analytics loop.
In terms of tools, this is what the data stack we picture looks like:
- Data storage: A data warehouse that can store data in one place like BigQuery, Snowflake, or Redshift.
- Data integration: ELT tools like Airbyte to integrate your data sources into data warehouses.
- Data modeling: A transformation tool like dbt.
- Data reporting: BI and reporting tools like Looker or Metabase.
- Data operationalization: A Reverse ETL tool to pull data out of the warehouse, validate it, and load it into business applications.
Empowering Business users' to take action requires having fresh, actionable data available whenever they need it in frontline applications like Braze, Salesforce, or Mailchimp.
In this blog post, you learned what a Reverse ETL is, how it is technically different from ETL/ELT, and where it fits in a data stack. The most important takeaways as to why companies benefit from Reverse ETL are:
- Sending data to business apps may ensure that all systems have a consistent view of the customer.
- Reverse ETL enables business teams to access the data they require within their apps, from sales to marketing to product.
- Data operationalization is made possible through Reverse ETL.
If you enjoyed this blog post, you might want to check out Airbyte’s blog. You can also join the conversation on our community Slack Channel, participate in discussions on Airbyte’s discourse, or sign up for our newsletter. Furthermore, if you are interested in Airbyte as a fully managed service, you can try Airbyte Cloud for free!