Airbyte supports several different data replication modes, and it may not always be immediately obvious which one to use. Once you have finished this article, you will be able to choose the best replication option for your use case!
This article will first guide you through background material that will provide a basis for understanding Airbyte replication. Then you will find an overview of the available replication modes, and a discussion of the benefits and trade-offs of each one. As you read this article, you will find answers to the following questions:
- What is Airbyte’s ELT approach to data integration?
- Why is ELT preferred over ETL?
- What is a cursor?
- What is a primary key used for?
- What is the difference between full refresh replication and incremental sync replication?
- What does it mean to append data rather than overwrite it in the destination?
- How is incremental sync with deduped history different from incremental sync with append?
- What are the advantages of log-based change data capture (CDC) replication versus standard replication?
- Which replication mode should you choose?
Before getting into a detailed comparison between the different replication modes, it is helpful to discuss relevant background information including: Airbyte’s ELT (extract, load, transform) approach to data replication; ELT versus ETL (extract, transform, load); how Airbyte periodically executes sync runs; what a cursor is; and why a primary key may be required.
An overview of Airbyte’s approach to data replication
ELT (extract, load, transform) and ETL (extract, transform, load) are the two most popular data integration techniques that can be used to move data into a destination such as a database, a data warehouse, a data lake, or a data lakehouse. In these data integration techniques, the transform (T) step performs operations such as cleansing, normalizing, structuring, sorting, joining, or enriching the data.
Airbyte is an advocate of the ELT data integration approach, which is a method that extracts (E) data from a source system, and loads (L) raw data into a destination system before it transforms (T) the data. In other words, in the ELT approach the transformation (T) of the data is done at the destination after data has been loaded. The raw data that contains the data from a source record is stored in the destination as a JSON blob.
Airbyte leverages dbt (data build tool) to manage and create SQL code that is used for transforming raw data in the destination. This step is sometimes referred to as normalization. A high level view of the data processing flow is given in the following illustration:
It is worth noting that the above illustration displays a core tenet of ELT philosophy, which is that data should be untouched as it moves through the extracting and loading stages so that the raw data is always accessible in the destination. Because an unmodified version of the data exists in the destination, it can be re-transformed in the future without the need for a resync of data from source systems.
Why ELT is preferred over ETL
In the traditional ETL (extract, transform, load) approach, data is transformed before being sent to the destination, as demonstrated in the following image:
ETL has several disadvantages compared to ELT, including the following:
- Generally, only transformed data is stored in the destination system, and so analysts must know beforehand every way they are going to use the data, and every report they are going to produce.
- Modifications to requirements can be costly, and often require re-ingesting data from source systems.
- Every transformation that is performed on the data may obscure some of the underlying information, and analysts only see what was kept during the transformation phase.
- Building an ETL-based data pipeline is often beyond the technical capabilities of analysts.
Historically ETL was once preferred over ELT for the following no-longer-valid reasons:
- ETL could achieve cost savings by removing unwanted data before sending it to the destination – however, with the plummeting cost of cloud-based computation and storage the value of this proposition is greatly reduced.
- Because ETL transforms data before it is stored, it avoids the complexity of transforming data after sending it to the destination – however, new tools such as dbt (data build tool) make it preferable and easy to transform data in the destination.
Airbyte advocates the ELT approach and this is the approach that is discussed in the remainder of this article.
What is a sync run
Airbyte replication can be thought of as a loop which periodically requests records from a data source and sends them to a destination. Each iteration of this loop is referred to as a sync run, which is discussed in more detail in How we scale workflow orchestration with Temporal.
What is a cursor
In the context of incremental replication, a cursor can be thought of as a pointer into a source data set which is used to keep track of the most recent record that has been sent from that source to a destination. It is used to ensure that already-sent records will not be sent again in the future.
In each sync run a query is executed on the source to select records to replicate, and this query is constructed to include the cursor such that only records that are newer than the cursor (i.e. not previously replicated) are returned.
In the case of CDC replication a cursor does not need to be specified because timestamps and/or sequence numbers in the transaction log, which is used as the source for CDC replication, serve the same purpose.
Cursors are discussed in more detail in the tutorial: Explore Airbyte’s incremental data synchronization.
What is a primary key used for
A primary key uniquely identifies each record in the source data table. In some replication modes this is required, and is used to ensure that multiple updates to a single record in the source system will be correctly applied to a single record in the destination system. Furthermore, if incremental change data capture (CDC) replication is used, then a primary key must be defined in the source. A more detailed discussion about how a primary key is used can be found in the tutorial: Explore Airbyte’s incremental data synchronization.
Full refresh sync vs incremental sync
The main difference between full refresh replication and incremental sync replication is which records are read from the source and sent to the destination in each sync run:
The image below compares full refresh replication versus incremental sync replication – this diagram represents records that are inserted at various times: t1, t2, and t3 with each insert followed by a sync run. Newly inserted records are annotated on the source with a red box, and records that are sent to the destination are annotated with a bright green box.
Full refresh replication modes
Full refresh replication supports two modes: full refresh - overwrite and full refresh - append. These two modes are the same in terms of selecting records for replication from the source. The difference between these two modes is in how the data is written into the destination:
The image below demonstrates the differences between full refresh - overwrite and full refresh - append replication. As with the previous illustration, this diagram represents records that are inserted at various times: t1, t2, and t3, with each insert followed by a sync run. New records are annotated on the source with with a red box, records that are sent to the destination in a given sync run are annotated with bright green box, and (in the case of full refresh - append) records that were sent in a previous sync are annotated in the destination with a gray box.
For a detailed analysis of full refresh replication, you may be interested in reading the tutorial: Explore Airbyte's full refresh data synchronization.
A comparison of incremental sync replication modes
Incremental sync replication supports two modes: incremental sync - append and incremental sync - deduped history. Extraction of records from the source is the same for these two modes – only records that have been inserted or updated since the previous sync run are transmitted to the destination. The difference between these two modes is how the data is written into the destination:
The image below demonstrates the differences between incremental sync - append and incremental sync - deduped history replication. This diagram represents records that are inserted at time t1 and t2, followed by an updated to an existing record (with id=2) at t3, and a deletion of a record (id=1) at t4. New, modified, and deleted records are annotated on the source with a red box, and records that are sent to the destination in a given sync run are annotated with bright green box.
The DELETE at time=t4 in the above image highlights a known limitation of incremental replication – delete operations are not correctly transmitted. This is because incremental replication periodically executes queries on a source system for new or updated records, and then transmits the results of these queries to the destination. However, because a query cannot return deleted records, standard incremental replication does not correctly transmit deletions from the source to the destination.This is one reason why Change Data Capture (CDC) synchronization is sometimes preferred. Alternatively soft deletes may be considered.
For a detailed analysis of incremental sync replication, you may be interested in reading the tutorial: Explore Airbyte’s incremental data synchronization.
How is CDC different from other replication modes?
CDC replication can be used in conjunction with incremental replication. The main difference for CDC incremental replication versus standard incremental replication is in how modifications on the source database are detected and transmitted to the destination database.
Non-CDC (i.e. standard) incremental replication periodically executes queries on a source system for new or updated records, and then transmits the results of these queries to the destination. However, because a query cannot return deleted records, standard incremental replication does not correctly transmit deletions from the source to the destination. Additionally, the records that are transmitted in each sync only represent the state of the source database at the moment the sync is executed – any intermediate changes (such as multiple updates to a single record) will not be correctly captured. Furthermore, because each incremental sync executes queries against the source to detect modifications, it requires the source data to have a suitable cursor field such as updated_at, which is used to keep track of which records have already been replicated.
On the other hand, CDC incremental replication reads a log of the changes that have been made to the source database and transmits these changes to the destination. Because changes are read from a transaction log (referred to as a write ahead log or WAL in Postgres) when using CDC, it is not necessary for the source data to have a suitable cursor field. Additionally, intermediate changes and deletions on the source are correctly transmitted to the destination because they are logged just like any other modification to the source data.
To support CDC, Airbyte uses Debezium internally. A high-level example of CDC replication from a Postgres source is shown in the image below.
The differences between CDC incremental replication and non-CDC incremental replication are summarized in the table below:
ℹ️ CDC replication can also be used in conjunction with full refresh replication. However, as mentioned in the CDC documentation, full refresh replication mode is done with the same process both with and without CDC, and so it does not merit additional discussion.
Change data capture (CDC) incremental replication
Change data capture (CDC) can be used in conjunction with incremental sync - append and incremental sync - deduped history. A log of changes (the transaction log) on the source is read to determine what changes to transmit to the destination, and for efficiency each change is only transmitted once. Reading records is the same for these two modes – the difference between them is how data is stored in the destination:
The image below demonstrates the differences between CDC + incremental sync - append and CDC + incremental sync - deduped history replication.This diagram represents records that are inserted at time t1 and t2, followed by an updated to an existing record (with id=2) at t3, and a deletion of a record (id=1) at t4.
ℹ️ When comparing to the standard (non-CDC) incremental sync that was previously discussed, pay special attention to the handling of the DELETE operation. In the CDC case, the DELETE is detected and transmitted to the destination.
In addition to correctly transmitting DELETEs as shown above, CDC replication also correctly transmits intermediate states (not demonstrated in this article), such as a single record being updated multiple times between sync runs. For a detailed analysis of CDC replication, you may be interested in reading the tutorial: Explore Airbyte's Change Data Capture (CDC) synchronization.
Which replication mode to choose
The replication mode that you choose to use will depend on your requirements. Below is a table that compares the various modes without CDC (left) and with CDC (right):
In this article you learned about Airbyte’s approach to ELT followed by the main differences between different replication modes. If you are interested in further exploration of how Airbyte-replicated data looks as well as an overview of the SQL that is used for transforming/normalizing data, you may be interested in consulting the tutorial series on Airbyte’s synchronization modes, which includes the following articles:
- Explore Airbyte's full refresh data synchronization
- Explore Airbyte’s incremental data synchronization
- Explore Airbyte's Change Data Capture (CDC) synchronization
You may also be interested in other articles on Airbyte’s blog, or in Airbyte tutorials. You can also join the conversation on our community Slack Channel, participate in discussions on Airbyte’s discourse, or sign up for our newsletter. Furthermore, if you are interested in Airbyte as a fully managed service, you can try Airbyte Cloud.