Learn the inner workings of Airbyte’s full refresh overwrite and full refresh append synchronization modes.
Your organization likely collects large amounts of data in various systems, and you are probably already aware that accessing and analyzing data that is spread across multiple systems can be a challenge. To address this challenge, data integration can be used to create a unified view of your organization's data. In order for your data integration strategy to provide the maximum value, you need to have a consistent and up-to-date unified view of your data, which means that data from various systems needs to be in sync.
Data synchronization involves replicating data from a source system into a destination system so that the destination always has an up-to-date copy of the source data. Data consistency is achieved by continually synchronizing modifications in the source system to the destination system. Data replication tools help in copying and distributing data to different locations, ensuring redundancy and availability. These tools play a crucial role in maintaining data integrity and facilitating efficient data management.
Data synchronization can be done in several ways including: full replication, incremental replication, or change data capture (CDC) replication.
This tutorial is the first in a three part series on exploring Airbyte synchronization modes. The purpose of this series is to provide you with step-by-step instructions so that you can see for yourself how Airbyte-replicated data looks for different synchronization modes, and to help you understand how each synchronization mode functions behind the scenes.
This tutorial focuses on Airbyte’s full refresh synchronization mode, and is followed-up with the following related tutorials:
In this tutorial you will:
ℹ️ If you are not sure which replication mode is the most appropriate for your use case, you may be interested in reading An overview of Airbyte's replication modes before continuing with this tutorial.
This tutorial was written in July 2022 and due to the fast-changing nature of Airbyte may become less relevant at some point in the future. The following tools/versions were used:
Airbyte is an ELT (extract, load, transform) tool as opposed to an ETL (extract, transform, load) tool. This means that raw data is first loaded into a destination and then after the data has been written, it goes through a transformation stage.
In Airbyte, data that is extracted from a single record on the source system is sent to the destination as a single JSON blob, where it remains unmodified. This conforms to a core tenet of the ELT philosophy, which is that data should be untouched as it moves through the extracting and loading stages so that raw data is always accessible – if an unmodified version of the data exists in the destination, it can be re-transformed in the future without the need for a resync of data from source systems.
As a more concrete example, if your source data is in a table that looks as follows:
Then Airbyte stores the following JSON objects in the destination:
For destinations that support normalization, SQL is then executed to transform the raw JSON back into a relational table. Internally Airbyte leverages dbt to create the SQL used for normalizing and/or transforming data. The data synchronization and normalization process is shown below:
Full refresh synchronization always retrieves all available data requested from the source and writes it in the destination, regardless of whether the data has previously been synced or not. This contrasts with incremental sync modes, which only replicate data that has been inserted or updated since the previous sync. Full refresh synchronization has two variants, overwrite and append, and it may alternatively be referred to as full load data synchronization.
In the full refresh | overwrite data synchronization mode, new syncs will overwrite all data in the destination table with a new copy of the source data. One of the big advantages of this approach is that data that has been deleted from the source will also be deleted in the destination. This is in contrast to incremental sync modes, for which this is not always the case.
In the full refresh | append data synchronization variant mode, every new sync will append the entire data set onto the data from previous syncs. This may be useful for historical report generation or for regulatory compliance.
ℹ️ Because every full refresh sync operation copies the entire source data set, this approach may be infeasible for large amounts of data, and other replication modes (such as incremental sync) are likely to be more appropriate. For example, if the source data set contains 10MB of data, and the replication frequency is every 5 minutes, then in one year there will have been 105120 replication syncs. This means that there will be 105120 full loads of data from the source to the destination, which means transmitting 10MB * 105120 = 1051200 MB or approximately 1TB of data. Furthermore, if you are making use of full refresh | append synchronization, then you would also end up with 1TB of data stored in your destination, even if the source data only contains 10MB.
Now that I have discussed the theory, you are ready to see how full load ELT data synchronization looks in practice!
Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.
To keep the focus of this article on the exploration and discussion of replication modes rather than on setup, Postgres is used as both the source and destination. Other supported databases or application APIs would follow a similar methodology.
The setup of the source and destination that I present below is similar to the setup presented in Replicate Data Between PostgreSQL Databases, with minor differences corresponding to loading of simple data that is used for this particular tutorial.
Start a source Postgres container running at port 2000 on localhost as follows:
Start a destination Postgres container running at port 3000 on localhost as follows:
In this section I demonstrate the steps for logging into the source Postgres database and populating it with simple data that will be used for demonstrating the various replication modes.
ℹ️ Note that I create a table with just a few columns, which should be sufficient to understand the main points of this article. A real dataset would likely have many more columns.
You can open a shell on the source container and login to the source Postgres database with the following commands:
You should now see a posgres=# prompt, which indicates that you are inside the Postgres terminal.
From the Postgres terminal, create a simple table called full_refresh_demo as follow:
You can populate the full_refresh_demo table with some data as follows:
Have a look at the new table by executing:
Which should respond with a table that looks as follows:
Now that you have data in your source database, you are ready to set up a source connector in Airbyte!
An Airbyte source will be used in your Airbyte connection(s) to retrieve data from the source database that you have just populated. Create a Postgres source called Postgres-source, as shown below:
In this section you will create an Airbyte connection that reads data from the Postgres source and drives it into the Postgres destination using the full refresh | overwrite replication mode. In this mode, all data in the source database is sent to the destination database, regardless of whether it has been synced before or not. You can see this for yourself by following along the steps presented below.
Create a new connection that will be used for verifying the functionality of full refresh | overwrite synchronization. Select the Postgres-source source, which you previously defined, as the source for this connection.
And select Postgres-destination, which you previously defined, as the destination for this connection.
You will see a set up page as shown below. Set the name of the connection to full-refresh-overwrite, and the destination stream prefix as overwrite_ as shown below.
After you set up the connection, you should see that a first sync has automatically started. Once it completes, you should see a message indicating that two records have been emitted as follows:
.
Now that the first sync has completed you can take a look at the Postgres destination to see how the replicated data looks. Execute the following command to open a shell to the destination Postgres database:
You can view the tables in the destination Postgres database by executing the following command from the Postgres shell that you have just opened .
Which should respond with the following:
ℹ️ Notice that there are two tables. As discussed earlier, Airbyte converts each source record into a JSON blob that contains all of your data, and writes it into the _airbyte_raw_overwrite_full_refresh_demo table. This is then normalized into a separate table.
Have a look at the _airbyte_raw_overwrite_full_refresh_demo table by executing the following command:
Which should respond with a table similar to the following:
In addition to the field containing the source data there are two additional fields:
If you have Basic Normalization enabled, Airbyte will have created a normalized table by extracting data from the JSON object. You can view the normalized data by executing the following command:
Which should respond with a table similar to the following:
You can see that this normalized table contains columns corresponding to each of the columns in the source table as expected (id and name). In addition to the Airbyte-created fields that appeared in the raw data table, the normalized table also has a few more airbyte-generated columns:
Let's go into detail about how this data has been normalized in the next section!
Airbyte executes SQL commands on the destination database to normalize your data. This is covered conceptually in the following pages in Airbyte’s documentation.
To help you to understand the differences between different synchronization modes you can look at the actual SQL that is used for normalizing your data. In order to find the SQL code that has been executed, you need to find the job ID and attempt ID for the sync that you are interested in. You can get this by looking in the Sync History, as shown below.
In this case the job ID is 97 and the attempt ID is 0, as annotated by the red square in the image above.
Create a bash shell into the airbyte-server docker container as follows:
Using the job ID and the attempt ID, you can find the SQL that has been used for executing the normalization. Based on the job ID of 97 and attempt ID of 0 (extracted above), I execute the following:
Which returns the SQL that has been used for converting the raw data into the normalized table. In order to explain what the SQL does to get your data normalized, I have broken it into four main sections below.
ℹ️ Airbyte leverages dbt to create the SQL for normalizing the data, and makes use of common table expressions (or CTEs) to break it into more manageable pieces.
The first section of SQL extracts the source fields from the JSON blob.
The next portion of the SQL is responsible for casting each of the fields that has been extracted from the JSON to the correct type.
The next section is responsible for adding in an md5 hash from the fields that have been passed in the source JSON. This article about md5 hashing on the dbt blog discusses the value of this.
Finally, the last part of the SQL selects all of the fields to be written into the overwrite_full_refresh_demo__dbt_tmp table.
ℹ️ New SQL is generated for each sync, but for the full refresh | overwrite mode the SQL is identical every time. If you want to see this for yourself, then in the subsequent synchronizations you can note the job ID and attempt ID and view the SQL as you have just done above.
The astute reader may wonder why this SQL creates a table called "postgres".public."overwrite_full_refresh_demo__dbt_tmp", while the table that you see in the Postgres destination is simply called overwrite_full_refresh_demo. To understand it better, take a look in the logs, which in my case (for this job ID and attempt ID) are located at /tmp/workspace/97/0/logs/dbt.log on the airbyte-server container. In the log file you should find an alter table/rename command that has renamed the table, similar to the following:
You will now insert a new record in the source data, and then see how this is replicated to the destination. Go back to the terminal where you opened a shell to the source Postgres database, and add a new record and then view the source table as follows:
The source table should look as follows:
Execute a new sync by clicking on Sync Now in the connection page.
Once the sync has completed, you will see that 3 records have been emitted. This is because full refresh synchronization modes extract and emit all of the records in the source data set, regardless of whether they have previously been replicated to the destination or not.
Have a look at the _airbyte_raw_overwrite_full_refresh_demo table by running the following:
Which should look similar to the following:
Notice that the _airbyte_emitted_at field for every record has been updated – this is further confirmation that all of the records in this table have been overwritten – not just the new record that you added to the source table. This is expected with the full refresh synchronization mode.
The normalized data in the overwrite_full_refresh_demo table can be seen by executing:
Which should respond with a table that looks similar to the following:
Every record in the normalized table has been overwritten, as can be seen by comparing the value of the _airbyte_normalized_at field to the value that each row had for that field prior to the sync, as shown earlier in this tutorial.
Update a record on the full_refresh_demo table on the source Postgres database, and view the source table by running the following commands:
And the source full_refresh_demo table should now look as follows:
As before, execute a new sync by clicking on Sync Now in the connection pane and wait for it to complete. The UI will indicate that 3 records have been emitted as follows:
Have a look at the _airbyte_raw_overwrite_full_refresh_demo table in the destination Postgres database with the following command:
Which should respond with a table similar to the following
Once again, by comparing the value of _airbyte_emitted_at to the prior values (shown earlier in this tutorial), you can see that every record has been overwritten, not just the record that was updated.
Have a look at the normalized data in the overwrite_full_refresh_demo table with the following command:
Which should be similar to the following:
Once again the _airbyte_normalized_at field for every record has been updated – in other words all of the records in this table have been completely overwritten/re-normalized as expected.
See what happens if you delete a record on the source database by executing the following commands:
The source full_refresh_demo table should look similar to:
Execute a Sync Now in the connection page, and wait for the sync to complete. Once it is complete, you should see that two records have been emitted, as shown below.
Have a look at the raw data by executing:
Which should respond with something similar to:
And have a look at the normalized data by running:
Which should be similar to the following:
You can see that the _airbyte_emitted_at timestamp has been updated for all of the documents in both the raw data and in the normalized data, which confirms that every record in both tables have been overwritten. Additionally the _airbyte_normalized_at timestamp confirms that every record has been re-normalized.
What will happen if you execute a sync on data that hasn’t been modified since the previous sync? Find out by clicking on the Sync now button! Once the sync has completed, you will see that two records have been emitted as shown below.
Have a look at the raw data by executing:
Which should respond with something similar to:
And have a look at the normalized data by running:
Which should be similar to the following:
Looking at the _airbyte_emitted_at timestamp and the _airbyte_normalized_at timestamp confirms that every record has been overwritten and re-normalized. In other words, even if no data was modified in the Postgres source, the entire table is copied and overwritten in the destination.
To drive data into the Postgres database that you created earlier, you will first need to define a destination. Configure an Airbyte destination destination called Postgres-destination as follows:
In this section you will create an Airbyte connection that reads data from the Postgres source and drives it into the Postgres destination using the full refresh | append replication mode. In this mode, as with the full refresh | overwrite replication mode, all data in the source database is sent to the destination database, regardless of whether it has been synced before or not. However, in the append variant, new syncs will take all data from the most recent sync and append it to the raw destination table. If normalization is enabled, then records that have recently been appended to the raw table will be normalized and appended to the normalized table. This is easiest to see for yourself via a hands-on example, as presented below.
Set up a new connection that will demonstrate full refresh | append functionality, using the connectors that you created earlier in this tutorial.
First, select Postgres-source as the source for this connection.
Then select Postgres-destination as the destination for this connection.
Then create a new connection and name it full-refresh-append, set the prefix to append_ , and select full refresh | append as the sync mode, as shown below.
A sync should automatically start after you create the connection. Wait for the sync to complete, and you should see a message like the following:
If you don’t already have a shell open to your Postgres destination, execute the following commands:
You can view the tables in the destination Postgres database by executing the following command from the Postgres shell that you have just opened .
Which should respond with the following:
Two new tables have been created, _airbyte_raw_append_full_refresh_demo and append_full_refresh_demo.
You can look at the raw data as follows:
Which should respond with a table that looks very similar to the raw table that you saw created with full refresh | overwrite replication, as follows:
You can look at the normalized data with the following command:
Which should return a table that looks similar to the following:
So far, the replication functionality appears to be the same as what you saw in the section about full refresh | overwrite replication.
As with the previous tutorial, you can see the job ID and attempt ID, which you will use for finding and viewing the SQL, as shown below:
If you don’t already have bash shell into the airbyte-server docker container, create one as follows
In my case the job ID of 103 and the attempt ID of 0 can be used to view the generated SQL that has been executed on the Postgres destination to normalize the raw data. Have a look at the generated SQL by replacing your job ID and attempt ID in the path below.
This shows you the SQL that has been used for converting the raw data into the normalized table. In this initial synchronization run, the SQL that is generated is basically identical to the SQL that was demonstrated earlier for the full refresh | overwrite sync. One small difference is that the SQL directly creates the append_full_refresh_demo table, rather than creating a temporary table and renaming it. The top of the SQL file looks as follows:
However, you will see that follow-up synchronizations have different SQL generated. I will discuss this in more detail after the next sync.
If you don’t have a terminal open on the Postgres source database, open one as follows:
Add a new record to the source Postgres database as follows:
And view the source table by executing:
The source table should look as follows:
Execute a new sync by clicking on Sync Now in the connection pane and wait for the sync to complete.
Once complete, you should see that three records have been emitted. Also take note of the job ID and attempt ID, which are 104 and 0 for this run.
Go back to the destination Postgres terminal, and have a look at _airbyte_raw_append_full_refresh_demo as follows:
Which should look similar to the following:
This is quite different from the destination table that was generated after inserting an additional record into the source data table in the previous section on the full refresh | overwrite mode! It contains all of the data from the first sync, as well as all of the data from the second sync. You can see from the _airbyte_emitted_at timestamp which records were sent in the first sync versus the second sync.
The corresponding normalized data in the append_full_refresh_demo table can be seen by executing:
Which should respond with a table that looks similar to the following:
The normalized destination table has copies of all of the records that existed from the first sync along with all of the records that have just been inserted in the second sync.
If you look at the _airbyte_normalized_at field, it appears that every row in the normalized table has been created by running the normalization SQL on the rows in the raw data table. Let’s look at this in more detail.
You can view the normalization code by executing the following command on the airbyte-server container (replacing the job ID and attempt ID with values you extracted from the UI). In my case I execute the following:
Which responds with the following SQL:
This is totally different from the SQL that was used in the first full refresh | append sync!
The first command finds the records that exist in a temporary table, and then deletes these records from the normalized table. The very next command inserts all of the records from that same temporary table into the normalized table.
To understand what has happened, you can see how that temporary table was created. Execute the following command on the airbyte-server container (replacing the job ID and attempt ID with values you extracted from the UI):
You should be able to find the SQL that was used to create the temporary table. The SQL is very similar to SQL from the previous normalizations, and the beginning of the SQL code block looks as follows:
However at the very end of this SQL block there is a SELECT statement which limits the records that will be selected for normalization from the raw table to only those that have an _airbyte_emitted_at timestamp that is greater or equal to the highest _airbyte_emitted_at value in any of the records that are already in the normalized data.
This reduces the number of records which will be selected for normalization to only the records that were sent in the most recent sync, along with records from the previous sync. This is much more efficient than re-normalizing all of the rows in the raw table.
ℹ️ Similar SQL is generated for each subsequent sync that is done during the full refresh | append mode. If you want to see this for yourself, then in the subsequent synchronizations you can note the job ID and attempt ID and view the SQL code and the dbt log file as you have just done.
Update a record on the source full_refresh_demo table and look at the resulting table with the following commands:
And the source table should now look as follows:
As before, execute a new sync by clicking on Sync Now in the connection panel, and wait for it to complete.
Once complete you should see that again, three records have been emitted. Have a look at the _airbyte_raw_append_full_refresh_demo table as follows.
Which should look similar to the following:
The corresponding normalized data in the append_full_refresh_demo table can be seen by executing:
Which should respond with a table that looks similar to the following:
Every time a sync is performed, the entire source full_refresh_demo table is appended to the destination table. Additionally, if you look at the _airbyte_normalized_at values, you will notice that only the documents from the current and previous sync have been normalized, as expected based on the previous analysis of the SQL normalization code.
See what happens if you delete a record on the source database by executing the following commands:
The source full_refresh_demo table should look similar to:
Then perform a sync by clicking on the Sync Now button in the connection panel.
Once the sync has completed you will see that two records were emitted — and once again, you can look at the raw data in _airbyte_raw_append_full_refresh_demo with:
Which now looks as follows:
And the then execute the following to see the normalized data:
The append_full_refresh_demo table should look similar to the following:
Again, the entire source table has been appended into the destination. Additionally, you can see that only the data from the current and the previous sync have been normalized, as expected based on the previous analysis of the normalization code.
It should be clear by now that regardless of the action that is taken on the source table, whenever a sync is performed, the entire table is appended to the destination table. Let’s do one last sync, without modifying any data on the source, just to be sure.
Which responds with the following
And look at the normalized data with:
Which should respond with something similar to:
As expected, even if no source data is modified, a copy of the entire source table is appended onto the destination tables. Additionally, by looking at the value of _airbyte_normalized_at you can see that only the data from the current and previous syncs have been normalized.
In this tutorial I have guided you through a hands-on deep-dive into Airbyte’s full refresh synchronization modes. The main focus of this article has been to discover how replicated data will look, and to understand the SQL code that is executed to transform raw data into normalized data.
While it may be feasible to use Airbyte’s full refresh | overwrite or full refresh | append modes to repeatedly sync an entire source table that contains a small amount of data, this approach may become infeasible for large (100s of MB or TB of data) data sets or for data that is frequently synchronized. For these cases, other replication modes may be more appropriate.
This tutorial is followed-up with the following related tutorials:
You may be interested in other Airbyte tutorials, and in Airbyte’s blog. You can also join the conversation on our community Slack Channel, participate in discussions on Airbyte’s discourse, or sign up for our newsletter. Furthermore, if you are interested in Airbyte as a fully managed service, you can try Airbyte Cloud for free!
Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.