PostgreSQL database replication has several purposes. It can help with high availability (if a server fails, a backup one can take over) and load-balancing (distributing traffic across multiple servers that provide the same data).
Enabling analytics is another increasingly popular purpose of database replication. Imagine that you have a Django application that uses PostgreSQL as an operational database. If you’d like to enable reporting and analytics on top of that database, it’s best to replicate it to another server not to interfere with the production application.
No matter your replication purpose, this tutorial will show you how to easily replicate data from a PostgreSQL source to a PostgreSQL destination using Airbyte.
I also introduce PostgreSQL native replication and the advantages of using Airbyte. If you want to skip those sections, feel free to jump to the prerequisites.
If you prefer to follow along with a video instead, we have created one, too!
When it comes to PostgreSQL replication, there are various techniques to achieve it. Such techniques can be divided into disk-based, log-based, and SQL-based replication. We will primarily focus on log-based replication techniques.
At the same time, PostgreSQL provides several kinds of log-based replication: log-shipping, streaming, and logical replication.
Log-shipping replication handles synchronization by allowing a primary server to change the data and standby server(s) to track changes in the primary server. The primary server is in continuous archiving mode, generating Write-Ahead Logging (WAL) files that the standby reads to keep in sync. Therefore, the standby is in recovery mode (which is read-only).
Because no modifications to the database tables are necessary to activate this functionality, it has a lower administrative overhead than alternative replication methods.
Streaming replication is similar to log-shipping, with the difference that the standby connects to the primary, which transmits WAL records in real-time, rather than waiting for the WAL file to be full.
The fundamental disadvantage of log-shipping and streaming replication is that the standby server remains in recovery mode, and modifications are not permitted. Any changes made to the primary server are replicated on standby, making it impossible to replicate tables or databases selectively.
Logical replication is PostgreSQL’s technique for replicating data objects and modifications based on their identity, often a primary key. This type of replication also uses WAL files to build a stream of logical data changes, and it implements a publish-and-subscribe approach. Changes in the publisher are delivered to subscribers in real-time.
Unlike log-shipping and streaming, logical replication allows for per-table replication.
The most significant disadvantage of logical replication is that schema changes are not replicated and must be kept in sync manually. Tables lacking primary keys or unique identifiers will also have issues. You can read more about logical replication restrictions here.
If PostgreSQL supports replication natively, why would you use Airbyte? Built-in Postgres replication solutions may offer a minimal overhead but, as we saw before, they don’t come without drawbacks.
The main advantage of using Airbyte for PostgreSQL replication is flexibility: You can use different replication methods for each table within the same database.
It’s not uncommon to have different replication needs for different tables in your databases: some of your tables may be small, while others are huge. Some tables may have primary keys, while others do not. You may need to track deletes for some tables, while you may need to keep just a snapshot for others.
Suppose you want to use logical replication, but some of your tables don’t have primary keys. When using Airbyte, you can choose to replicate tables without a primary key using full or incremental replication; meanwhile, other tables in the same database can use logical replication.
Now that you have some background let’s get hands-on!
If you don’t have a readily available Postgres database to sync, here are some quick instructions. Run the following commands in a new terminal window to start backgrounded source and destination databases:
Add two tables with a few rows to the source database:
You now have a Postgres database ready to be replicated.
Alternatively, you can use a local Postgres database on your computer: use host.docker.internal (if you are on Mac) as the host instead of localhost when setting up the source and destination.
If you didn’t do it before when deploying Airbyte, go to http://localhost:8000 to launch the UI.
Then, click on sources and add a new source. As the connector type, select Postgres. If you used our instructions above to create a Postgres database, fill in the following configuration fields.
We will also use “No Tunnel” as the SSH option. If you want to use SSH, check out our documentation.
Click on "Set up source", and now you’re ready to configure a destination.
Go to destinations and add a new one. Select Postgres as the destination type and fill in with the following details.
That’s it. You have configured the source and destination.
Go to connections and create a new connection. Then, select the existing Postgres source you have just created and then do the same for the destination. Once you’re done, you can set up the connection as follows.
Then, it’s time to configure the streams, which in this case are the tables in our database. If we expand the tables, we can see the columns they have. We can also see they’re part of the public namespace or schema. The destination schema will be also public.
Now, you should select a sync mode. I chose Full refresh | Overwrite to sync the cities table and Incremental | Append for the users table since it has an id column (primary key) suitable for the incremental cursor field. The most important thing to note is that you can have different sync modes for each table! Learn more about sync modes in our documentation.
Once you’re ready, save the changes. Then, you can run your first sync by clicking on “Sync now.” You can check your run logs to verify everything is going well. Just wait for the sync to be completed, and that’s it! You’ve synchronized two Postgres databases.
Now, let's verify that this worked. Let's output the contents of the destination database:
You should see the rows from the source database inside the destination database. Note that Airbyte added extra columns which contain metadata used to support synchronization.
In this tutorial, you learned about PostgreSQL native replication and how to synchronize two Postgres databases with Airbyte. Apart from being straightforward, Airbyte is also highly flexible when it comes to selecting different synchronization methods for your replication.
If you enjoyed using Airbyte, you might want to check our fully managed solution: Airbyte Cloud.
We also invite you to join the conversation on our community Slack Channel to share your ideas with thousands of data engineers and help make everyone’s project a success. With Airbyte, the integration possibilities are endless, and we can't wait to see what you will build.
Learn how to create an Airflow DAG (directed acyclic graph) that triggers Airbyte synchronizations.
Learn how Airbyte’s incremental synchronization replication modes work.