Whether building data pipelines for AI, analytics, or data engineering, there are two very common design mistakes that almost everyone makes at least once.
The first common mistake comes from designing everything as a single “ETL” operation, running Extract, Transform, and Load as a single combined step. The second common mistake is to assume that the best place for processing data is also the best place to host it, omitting the important ‘publish’ step.
To address these challenges, I introduce the “ELTP” architecture: Extract, Load, Transform, and Publish. This simple and scalable model is crafted to prevent these common mistakes. Below, we'll delve into the nuances of ELTP and explore its broad applicability for AI, analytics, and big data.
From ETL to ELT to ELTP - A brief Introduction
In order to understand the benefits of the ELTP architecture, it’s important to first understand some basic context around ETL and ELT.
The original “ETL” approach to data movement combined business logic and data replication into a single operation. This was inherently fragile, because ETL processes would break whenever business logic changed, and any breakage would prevent source data from loading at all. Since many of the transforms required aggregating and joining across datasets, memory errors were frequent and new failure types could arise at any time, as data volumes grew.
To address these issues and others, modern data engineering best practices now advocate for an “ELT” or “EL+T” approach, where the replication (Extract and Load) is performed first, as a standalone process. Then, business logic and transformations are applied only after the raw data is safely landed in a database or data lake which we control. In this model, the Extract-Load Pair, or “EL Pair”, are executed as a single step in the process, with composable Extract and Load connectors working together to replicate data efficiently from source to destination.
While ELT architecture is more stable and more scalable than the ETL model, ELT is still lacking, in that it doesn’t specify a method for delivering data to downstream users and systems - what is sometimes called “Reverse ETL”. To address this gap, we introduce the 'Publish' step, leading to the “EL+T+P” or “ELTP” model. In full, the ELTP model encompasses the following steps:
- Extract data from raw data sources.
- Load data into durable storage, such as a data warehouse or data lake.
- Transform raw data into new datasets.
- Publish data to downstream users and business applications.
This design preserves the benefits of ELT while also giving a framework for ensuring that data is efficiently published to downstream consumers and applications.
Meet the “Publish” Step
The Publish framing is a simple way to describe how transformed data can be delivered to a variety of downstream destinations, including: external SaaS applications, file stores, CRM systems, AI vector stores, and even other databases. Publish steps implement the same protocols and paradigms of other Extract-Load pairs used for raw data collection, and therefore can benefit from the same advantages discussed above for EL pairs.
The biggest difference between the initial Extract-Load and the final Publish is that in the latter case, the EL process is extracting from our data platform and loading to an external system or external publish location.
Note that in this two-stage EL architecture, we can add any number of custom transformations between the initial EL step that loads the raw data and the second EL step that publishes transformed data to downstream applications.
There are several reasons for adding a Publish step to one’s data architecture, namely: sending data files to external systems, decentralizing analytic queries, and publishing to downstream applications and indexes.
Publish Data Files to External Systems
The oldest example of a “publish” operation is to send data files to partner companies, vendors, or regulatory agencies. For example, a Marketing Agency might request a regular upload of your customer data file, which they in turn would process and provide back a new file with added market segmentation data. These publish operations often have externally-defined file formats and upload location requirements which are controlled by the third party.
Here is a simple example, where we publish one file to a partner via S3 and another file to a vendor using SFTP:
Publish to Empower Analytics
While I was at Amazon Video, I built and maintained systems that processed terabytes of data daily. To achieve that kind of scale, my team developed ELT pipelines on AWS Redshift, carefully right-sizing our clusters to efficiently execute all our transformations.
But when it came time to consume the data, we ran into problems. Our system couldn’t handle heavy query traffic, and one runaway query could cause significant performance issues for every other user. Our clusters optimized for heavy-duty processing tasks weren’t necessarily equipped for ad-hoc analysis or concurrency.
The solution was to add a final “publish” operation, where we could deliver data to our users exactly how and where they needed it. We wouldn’t bar users from querying the main cluster, but downstream consumers were incentivized to use Tableau and S3 datasets because those provide better performance and the ability for consumers to right-size their own analytics infrastructure according to their specific requirements. Meanwhile, the central database was better insulated from runaway queries that might otherwise lead to costly outages.
Publish to SaaS Applications
A popular use case for Publish operations are the so-called “Reverse ETL” use cases, where transformed data is Published to one or more business applications or SaaS providers. Popular examples include Publishing data to Salesforce for business operations, Hubspot to keep customer CRM data up-to-date, and Active Directory to keep employee and organizational data up-to-date.
Publish to Vector Stores as AI Long-Term Memory (RAG)
A common method of building AI chatbots is to Publish text data to a specialized indexing system called a “vector store”, which can in turn be used by the chatbot in a process called “RAG” (or “retrieval-augmented generation”). With RAG, the chatbot dynamically retrieves relevant information from a vector store as if retrieving from its long-term “memory”. It uses this information to generate responses that are informed by the data it has retrieved, combining pre-existing knowledge with the context provided by the user's query.
To run this process using ELTP, we follow the same basic steps as before:
1. Extract text from our raw document sources,
2. Load the raw text data to long-term storage,
3. Transform the data by splitting large text into smaller chunks and calculating a numeric representation of each chunk (often called a “vector” or “embedding”)
4. Publish the text data to our vector store indexes, where the chatbot can then make use of this data when answering questions it wouldn’t otherwise be able to answer.
ELTP in Practice
The ELTP framing offers a practical and holistic model for what otherwise might be expressed as “ELT + Reverse ETL” - while addressing additional use cases that traditionally would not be in scope for Reverse ETL. Here we’ll discuss a few advanced use cases and real world design and deployment considerations.
Publish Destinations are more than Reverse ETL
All Reverse ETL destinations are Publish-type destinations, but not all Publish destinations are Reverse ETL. The term “Reverse ETL” can still be useful when communicating the unique challenges of publishing to a SaaS system like Salesforce and Hubspot, but the term falls down when describing a publish flow to a partner’s SFTP site or to an external database. Whether publishing to Salesforce, SFTP, S3, or SharePoint, the Publish framing offers a more robust and holistic architectural model that is not tied to the “pickiness” of API-type destinations.
Plan for ELTP, but Don’t Fear ETL
Sometimes all that’s needed is a good proof of concept. For example, in AI and LLM applications, tutorials will often walk users through this now-classic data processing flow for RAG applications: Extract text from documents like PDFs or web sites, Transform into embedded document chunks, and directly Load to the vector store. Tutorials covering this approach can be found here and here, for instance.
As a tutorial or a proof of concept (POC) data flow, there’s nothing wrong with this ETL-like flow:
As shown in the diagram above, the decision to skip the database Load is very convenient and improves accessibility. The author has less surface area to maintain and doesn’t need to worry about introducing durable long-term storage into the architecture. This reduces the likelihood that someone following along will be blocked by not having access to the same or similar options for the long-term storage layer.
However, if and when our POC is successful, we will very soon want to migrate to the ELTP architecture, loading our raw data into long-term storage and then publishing to the vector store as a separate step. Here is a diagram of the same solution using ELTP:
Adding a database for long-term storage decouples the Extract/Load step from the Publish operation, and gives us the ability to add any number of transformations in between the initial EL operation and the Publish operation - for instance dbt models or custom Python scripts to clean the data and to apply business logic.
Our speed of iteration is also greatly improved with this design. When we are ready to experiment with different embedding models and tune the chunking parameters, we already have the raw text extracted and ready to go. We can even run side-by-side experiments comparing different transformation and publication options, without re-extracting the raw data and without making any changes to the “production” data flow.
The ELTP model allows us to scale more effectively as well: from tens of documents to thousands or even millions of documents. Each step in the process can run on hardware specialized for that step, and all of the replication steps (EL and Publish) can operate incrementally: extracting only new or updated documents when sending data downstream. As business logic evolves and as bugs are identified, we can rerun or restate our transformations without needing to re-extract the text from the upstream data source.
The upshot of all these ELTP benefits is that ELTP allows more opportunities for iteration, while also providing scaling and performance benefits. While perhaps counter-intuitive, adding the durable storage layer actually reduces our total cost of ownership (TCO) versus running the ETL or “POC” model which performs Extract-Transform-Load all in one step.
ELTP for AI Use Cases
With the rapid growth in AI applications, many people now find themselves entering the world of data engineering for the first time, specifically for the purpose of building an AI application. The ELTP model provides these new data practitioners an opportunity to learn from data engineering mistakes of the past, while not having to reinvent the wheel.
Vector Store Publish Destinations
Airbyte is an open source data movement tool which can be used for both “EL” and “Publish” steps in the ELTP architecture. Airbyte currently supports several vector store destinations, including Pinecone, Milvus, Chroma, and Weaviate. These can be used to directly load data using an ETL or ELTP architecture - according to your own preferences and requirements. All of these vector store destinations provide access to inline text splitting and vector embeddings using a declarative, configuration-based approach.
Text Document Source Connectors (New!)
Whereas traditional data pipelines prioritized numeric and tabular data, AI and LLM data flows greatly benefit from text data in “unstructured” document formats like PDF, Word, Markdown, and Google Docs.
This week, Airbyte is also launching source connectors to extract “text-as-data” from those unstructured documents and others. Text documents are a powerful new class of data source for data pipelines, and the new text document connectors enable the same ELTP, ELT, and ETL architectures across all of your data, whether for AI or for analytics.
For more information, please see the companion blog post announcement.
Simplifying ELTP Best Practices with Open Source
Modern data movement tools such as Airbyte allow you to implement robust ELTP data pipelines easily and quickly. Airbyte offers a low-code UI, an API and a Terraform provider for those who prefer to build pipelines programmatically. Airbyte being an open source tool carries the additional benefit of letting users build and modify connectors if the source or destination they want to reach is not yet officially supported.
With the ability to create your own open source connectors when needed, the same tooling and design techniques can be applied to almost any imaginable data pipeline
Airbyte aims to provide data teams with an open-source standard in data movement. We’re currently investing heavily in Publish connectors for AI and LLM use cases. In 2024, we will expand to cover additional Reverse ETL and ELTP workflows and use cases, to simplify and commoditize data movement in any direction.
While the ELT model is a great improvement over ETL, it doesn’t solve for the last mile of data delivery. ELTP is a model that handles the heavy lifting while also providing a real solution for product delivery.
Whether you are building the next great AI application or a more traditional data analytics stack, you can set yourself up for success by understanding the ELTP model and designing your architecture with these best practices in mind.
Interested in more data content? Join us for the upcoming ELTP webinar for AI or check out our blog, where we publish articles in-house and from the community on data engineering, databases, AI infrastructure and more.