Article

An overview of Airbyte’s replication modes

•

October 7, 2022

•

10 minutes

Airbyte supports several different data replication modes, and it may not always be immediately obvious which one to use. Once you have finished this article, you will be able to choose the best replication option for your use case!

This article will first guide you through background material that will provide a basis for understanding Airbyte replication. Then you will find an overview of the available replication modes, and a discussion of the benefits and trade-offs of each one. As you read this article, you will find answers to the following questions:

What is Airbyte’s ELT approach to data integration?
Why is ELT preferred over ETL?
What is a cursor?
What is a primary key used for?
What is the difference between full refresh replication and incremental sync replication?
What does it mean to append data rather than overwrite it in the destination?
How is incremental sync with deduped history different from incremental sync with append?
What are the advantages of log-based change data capture (CDC) replication versus standard replication?
Which replication mode should you choose?

Background

Before getting into a detailed comparison between the different replication modes, it is helpful to discuss relevant background information including: Airbyte’s ELT (extract, load, transform) approach to data replication; ELT versus ETL (extract, transform, load); how Airbyte periodically executes sync runs; what a cursor is; and why a primary key may be required.

An overview of Airbyte’s approach to data replication

ELT (extract, load, transform) and ETL (extract, transform, load) are the two most popular data integration techniques that can be used to move data into a destination such as a database, a data warehouse, a data lake, or a data lakehouse. In these data integration techniques, the transform (T) step performs operations such as cleansing, normalizing, structuring, sorting, joining, or enriching the data.

Airbyte is an advocate of the ELT data integration approach, which is a method that extracts (E) data from a source system, and loads (L) raw data into a destination system before it transforms (T) the data. In other words, in the ELT approach the transformation (T) of the data is done at the destination after data has been loaded. The raw data that contains the data from a source record is stored in the destination as a JSON blob.

Airbyte leverages dbt (data build tool) to manage and create SQL code that is used for transforming raw data in the destination. This step is sometimes referred to as normalization. A high level view of the data processing flow is given in the following illustration:

It is worth noting that the above illustration displays a core tenet of ELT philosophy, which is that data should be untouched as it moves through the extracting and loading stages so that the raw data is always accessible in the destination. Because an unmodified version of the data exists in the destination, it can be re-transformed in the future without the need for a resync of data from source systems.

Why ELT is preferred over ETL

In the traditional ETL (extract, transform, load) approach, data is transformed before being sent to the destination, as demonstrated in the following image:

ETL has several disadvantages compared to ELT, including the following:

Generally, only transformed data is stored in the destination system, and so analysts must know beforehand every way they are going to use the data, and every report they are going to produce.
Modifications to requirements can be costly, and often require re-ingesting data from source systems.
Every transformation that is performed on the data may obscure some of the underlying information, and analysts only see what was kept during the transformation phase.
Building an ETL-based data pipeline is often beyond the technical capabilities of analysts.

Historically ETL was once preferred over ELT for the following no-longer-valid reasons:

ETL could achieve cost savings by removing unwanted data before sending it to the destination – however, with the plummeting cost of cloud-based computation and storage the value of this proposition is greatly reduced.
Because ETL transforms data before it is stored, it avoids the complexity of transforming data after sending it to the destination – however, new data replication tools such as dbt (data build tool) make it preferable and easy to transform data in the destination.

Airbyte advocates the ELT approach and this is the approach that is discussed in the remainder of this article.

What is a sync run

Airbyte replication can be thought of as a loop which periodically requests records from a data source and sends them to a destination. Each iteration of this loop is referred to as a sync run, which is discussed in more detail in How we scale workflow orchestration with Temporal.

What is a cursor

In the context of incremental replication, a cursor can be thought of as a pointer into a source data set which is used to keep track of the most recent record that has been sent from that source to a destination. It is used to ensure that already-sent records will not be sent again in the future.

In each sync run a query is executed on the source to select records to replicate, and this query is constructed to include the cursor such that only records that are newer than the cursor (i.e. not previously replicated) are returned.

In the case of CDC replication a cursor does not need to be specified because timestamps and/or sequence numbers in the transaction log, which is used as the source for CDC replication, serve the same purpose.

Cursors are discussed in more detail in the tutorial: Explore Airbyte’s incremental data synchronization.

What is a primary key used for

A primary key uniquely identifies each record in the source data table. In some replication modes this is required, and is used to ensure that multiple updates to a single record in the source system will be correctly applied to a single record in the destination system. Furthermore, if incremental change data capture (CDC) replication is used, then a primary key must be defined in the source. A more detailed discussion about how a primary key is used can be found in the tutorial: Explore Airbyte’s incremental data synchronization.

Full refresh sync vs incremental sync

The main difference between full refresh replication and incremental sync replication is which records are read from the source and sent to the destination in each sync run:

Full refresh replication	Incremental sync replication
The entire data set will be retrieved from the source and sent to the destination on each sync run.	Only records that have been inserted or updated in the source system since the previous sync run are sent to the destination.

The image below compares full refresh replication versus incremental sync replication – this diagram represents records that are inserted at various times: t1, t2, and t3 with each insert followed by a sync run. Newly inserted records are annotated on the source with a red box, and records that are sent to the destination are annotated with a bright green box.

Full refresh replication modes

Full refresh replication supports two modes: full refresh - overwrite and full refresh - append. These two modes are the same in terms of selecting records for replication from the source. The difference between these two modes is in how the data is written into the destination:

Full refresh - overwrite	Full refresh - append
Raw and final destination tables are overwritten with the full data set in each sync run.	The entire data set is appended to the raw and final destination tables in each sync run.

The image below demonstrates the differences between full refresh - overwrite and full refresh - append replication. As with the previous illustration, this diagram represents records that are inserted at various times: t1, t2, and t3, with each insert followed by a sync run. New records are annotated on the source with with a red box, records that are sent to the destination in a given sync run are annotated with bright green box, and (in the case of full refresh - append) records that were sent in a previous sync are annotated in the destination with a gray box.

For a detailed analysis of full refresh replication, you may be interested in reading the tutorial: Explore Airbyte's full refresh data synchronization.

A comparison of incremental sync replication modes

Incremental sync replication supports two modes: incremental sync - append and incremental sync - deduped history. Extraction of records from the source is the same for these two modes – only records that have been inserted or updated since the previous sync run are transmitted to the destination. The difference between these two modes is how the data is written into the destination:

Incremental sync - append

Incremental sync - deduped history

Data that has been retrieved from the source is appended to the raw and final destination tables in each sync run.

Because incremental sync retrieves both newly inserted as well as updated records on each sync run, if there is an update made to a record in the source between sync runs, there will be multiple copies of that record in the destination.*

* This is demonstrated in the image below at time=t3.

In addition to the raw and final destination tables, this mode creates an intermediate history (SCD) table.

The history table contains a copy of each record that has been sent to the destination by Airbyte. This table serves as the input source for computing the final (deduplicated) table.

As opposed to incremental sync - append replication, the final (deduplicated) table contains a single copy of each record that exists in the source. *

* This is demonstrated in the image below at time=t3.

The image below demonstrates the differences between incremental sync - append and incremental sync - deduped history replication. This diagram represents records that are inserted at time t1 and t2, followed by an updated to an existing record (with id=2) at t3, and a deletion of a record (id=1) at t4. New, modified, and deleted records are annotated on the source with a red box, and records that are sent to the destination in a given sync run are annotated with bright green box.

The DELETE at time=t4 in the above image highlights a known limitation of incremental replication – delete operations are not correctly transmitted. This is because incremental replication periodically executes queries on a source system for new or updated records, and then transmits the results of these queries to the destination. However, because a query cannot return deleted records, standard incremental replication does not correctly transmit deletions from the source to the destination.This is one reason why Change Data Capture (CDC) synchronization is sometimes preferred. Alternatively soft deletes may be considered.

For a detailed analysis of incremental sync replication, you may be interested in reading the tutorial: Explore Airbyte’s incremental data synchronization.

How is CDC different from other replication modes?

CDC replication can be used in conjunction with incremental replication. The main difference for CDC incremental replication versus standard incremental replication is in how modifications on the source database are detected and transmitted to the destination database.

Non-CDC (i.e. standard) incremental replication periodically executes queries on a source system for new or updated records, and then transmits the results of these queries to the destination. However, because a query cannot return deleted records, standard incremental replication does not correctly transmit deletions from the source to the destination. Additionally, the records that are transmitted in each sync only represent the state of the source database at the moment the sync is executed – any intermediate changes (such as multiple updates to a single record) will not be correctly captured. Furthermore, because each incremental sync executes queries against the source to detect modifications, it requires the source data to have a suitable cursor field such as updated_at, which is used to keep track of which records have already been replicated.

On the other hand, CDC incremental replication reads a log of the changes that have been made to the source database and transmits these changes to the destination. Because changes are read from a transaction log (referred to as a write ahead log or WAL in Postgres) when using CDC, it is not necessary for the source data to have a suitable cursor field. Additionally, intermediate changes and deletions on the source are correctly transmitted to the destination because they are logged just like any other modification to the source data.

To support CDC, Airbyte uses Debezium internally. A high-level example of CDC replication from a Postgres source is shown in the image below.

The differences between CDC incremental replication and non-CDC incremental replication are summarized in the table below:

	CDC incremental sync	Standard incremental sync
Document retrieval process	Transaction log / Write-ahead log	SELECT statements
Cursor field required	No	Yes
Supports deletion	Yes	No
Transmission of intermediate states	Yes	No
Primary key required	Yes	Yes - for incremental deduped history. No for incremental append.

‍

ℹ️ CDC replication can also be used in conjunction with full refresh replication. However, as mentioned in the CDC documentation, full refresh replication mode is done with the same process both with and without CDC, and so it does not merit additional discussion.

Change data capture (CDC) incremental replication

Change data capture (CDC) can be used in conjunction with incremental sync - append and incremental sync - deduped history. A log of changes (the transaction log) on the source is read to determine what changes to transmit to the destination, and for efficiency each change is only transmitted once. Reading records is the same for these two modes – the difference between them is how data is stored in the destination:

CDC - incremental sync - append

CDC - incremental sync - deduped history

New data that has been retrieved from the source is appended to the raw and final destination tables in each sync run.

Because every change on the source is written into the transaction log, each change will be reflected in the raw and final destination table. This includes intermediate states (such as multiple updates to a single document) and deletions.

In addition to the raw and final destination tables, this mode creates an intermediate history (SCD) table.

The history table contains a copy of all changes (i.e. insert, update, or delete) that have been sent to the destination by Airbyte. This table serves as the input source for computing the final (deduplicated) table.

As opposed to CDC - incremental sync - append replication, the final deduplicated (deduped) table contains a single copy of each record that exists in the source

Additionally, as shown in the image below where time=t4, a record that is deleted on the source is correctly removed from the final deduplicated table on the destination.

The image below demonstrates the differences between CDC + incremental sync - append and CDC + incremental sync - deduped history replication.This diagram represents records that are inserted at time t1 and t2, followed by an updated to an existing record (with id=2) at t3, and a deletion of a record (id=1) at t4.

ℹ️ When comparing to the standard (non-CDC) incremental sync that was previously discussed, pay special attention to the handling of the DELETE operation. In the CDC case, the DELETE is detected and transmitted to the destination.

In addition to correctly transmitting DELETEs as shown above, CDC replication also correctly transmits intermediate states (not demonstrated in this article), such as a single record being updated multiple times between sync runs. For a detailed analysis of CDC replication, you may be interested in reading the tutorial: Explore Airbyte's Change Data Capture (CDC) synchronization.

Which replication mode to choose

The replication mode that you choose to use will depend on your requirements. Below is a table that compares the various modes without CDC (left) and with CDC (right):

	Standard (Non-CDC)	CDC
Full refresh - overwrite	✅ Deletes are handled ✅ Good for small data volume ✅ Final data does not have duplicates	✅ Same as Standard (Non-CDC)
Full refresh - overwrite	⚠️ No history is stored ⚠️ Inefficient for large data ⚠️ Inefficient for high frequency syncs	⚠️ Same as Standard (Non-CDC)
Full refresh - append	✅ Stores history ✅ Good for small data volume	✅ Same as Standard (Non-CDC)
Full refresh - append	⚠️ Storage may explode ⚠️ Inefficient for large data ⚠️ Inefficient for high frequency syncs ⚠️ Final data will contain duplicates	⚠️ Same as Standard (Non-CDC)
incremental - append	✅ Efficient for high frequency syncs ✅ Stores history ✅ Primary key not required	✅ Efficient for high frequency syncs ✅ Stores history ✅ Deletes are transmitted and logged ✅ Intermediate states captured ✅ No cursor field is required
incremental - append	⚠️ A suitable cursor must be available ⚠️ Final data may contain duplicates ⚠️ Deletes are not detected ⚠️ Intermediate states not captured	⚠️ Primary key required ⚠️ Final data may contain duplicates ⚠️ Not available for all sources ⚠️ Requires access to the transaction log
Incremental - deduped history	✅ Efficient for high frequency syncs ✅ Stores history ✅ De-duplicates final data	✅ Efficient for high frequency syncs ✅ Stores history ✅ Deletes are handled ✅ Intermediate states captured ✅ De-duplicates final data ✅ No cursor field is required
Incremental - deduped history	⚠️ Primary key required ⚠️ A suitable cursor must be available ⚠️ Deletes are not detected ⚠️ Intermediate states not captured	⚠️ Primary key required ⚠️ Not available for all sources ⚠️ Requires access to the transaction log

Conclusion

In this article you learned about Airbyte’s approach to ELT followed by the main differences between different replication modes. If you are interested in further exploration of how Airbyte-replicated data looks as well as an overview of the SQL that is used for transforming/normalizing data, you may be interested in consulting the tutorial series on Airbyte’s synchronization modes, which includes the following articles:

You may also be interested in other articles on Airbyte’s blog, or in Airbyte tutorials. You can also join the conversation on our community Slack Channel, participate in discussions on Airbyte’s discourse, or sign up for our newsletter. Furthermore, if you are interested in Airbyte as a fully managed service, you can try Airbyte Cloud.

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->

The data movement infrastructure for the modern data teams.

Try a 14-day free trial

About the Author

Alex is a Data Engineer and Technical writer at Airbyte, and was formerly employed as a Product Manager and as a Principal Consulting Architect at Elastic. Alex has an MBA from the IESE Business School in Barcelona, and a Masters degree in Electrical Engineering from the University of Toronto. Alex has a personal blog at https://alexmarquardt.com.