Learn how to build an ELT pipeline to replicate data from Postgres to Redshift in AWS.
Data engineers are tasked to move data between storage systems like applications, databases, data warehouses, and data lakes. One of the most popular data replication use cases includes moving data from a relational database to a data warehouse to enable analytics. This article shows how to build an ELT pipeline to replicate data from Postgres running in Amazon RDS to Redshift using Airbyte, an open-source data integration platform.
Postgres and Redshift store data differently. To replicate data between the two systems, you will need at least to extract the Postgres data, transform it to a format that is compatible with Redshift and load it to Redshift.
Airbyte does not only provide connectors for various sources and destinations (including Postgres and Redshift), but it is also an ELT framework with monitoring, scheduling, logging, and support for multiple data replication modes.
One question that might come to your mind is, “Why to move data from Postgres to Redshift?”. To answer this question, it’s important to understand the differences between Online Transactional Processing (OLTP) and Online Analytical Processing (OLAP) databases.
OLTP databases capture and store transactional-oriented data in real-time. It is mostly used for the effective running of the day-to-day activities of an organization. When OLTP databases are used for complex analysis, it slows down the website's operations. Examples of OLTP databases are Postgres, MySQL, and MySQL.
OLAP databases are mostly used by data scientists, analysts, or business intelligence experts for analysis, research, insights generation, and reporting. These types of databases allow fast execution of complex analytical queries. These types of databases are commonly referred to as data warehouses and can store large volumes of data and run complex queries on them to generate insights. OLAP data warehouses are AWS Redshift, Google BigQuery, Azure Synapse, and Snowflake.
This tutorial aims to show how to replicate data from a transactional database (Postgres) to an analytical data warehouse (Redshift) in a fast and effective way, using Airbyte.
To set up a Postgres database as an Airbyte source, you must host your local Postgres database in the cloud. To do this, there are many options. You can use Heroku, ElephantSQL, HelioHost, Amazon RDS, CloudSQL, and many other database cloud hosting platforms or run your Postgres database on a machine you will secure. For this tutorial, you will use the Amazon RDS instance to host your Postgres database.
Before creating an Amazon RDS instance, the first thing you need to do is to create a security group. This is important because it will allow Airbyte’s Cloud IP and other IP addresses you specify, to connect to your Amazon RDS database.
To do this, you go to EC2 >> Security Groups >> Create Security Group. In the Inbound Rules section, you will allow two IP addresses from your Postgres instance.
In the above image, the security group was given the tag name of Airbyte-Rule. This tag name will come in handy when you want to allocate your Amazon RDS database instance to the security group. The first inbound rule allows Airbyte Cloud IP (34.106.109.131/32 at the time of writing) to connect to the remote Amazon RDS instance. The second inbound rule allows your local IP address to connect to the Amazon RDS instance.
The next thing you will do is create an Amazon RDS database instance. To do this, search your Amazon console for Amazon RDS.
You then go to Databases >> Create database.
After clicking on create database, you configure your database. You choose the standard to create method and the Postgres engine option.
You will also have to allocate the security group you just created to the Amazon RDS instance. To do this, you go to Connectivity >> Security group. You will then attach the Airbyte-Rule security group.
It is important to grant your database public access while creating it. This will enable your local Postgres database to communicate with the remote Postgres database hosted on AWS.
After granting the database public access, create the database. You may have to wait for a while because it takes some time to create the database and allocate compute resources. Once the database becomes available, you will be able to see endpoints you can use to connect to your remote Postgres database.
In the previous section, you created an Amazon RDS instance which will enable you to run a Postgres database on the cloud. In this section, you will connect to the Amazon RDS instance you just created. You will also create a table in your Postgres database and populate the table with data.
Locally you can install pgAdmin to connect to the Amazon RDS instance with the endpoints from the remote database instance.
In the connection tab, you specify your hostname, port number, username, database name, and password. All these parameters were configured by you when creating the Amazon RDS database instance. Once saved, you have successfully connected to your remote instance.
The dataset used in this tutorial can be downloaded here. The dataset contains the details of all consumer complaints from a store. The image below shows the SQL query for creating a schema and a table. This table will be populated with data from the CSV file in the link above.
In the code block above, you created a schema called rds_schema and a table called complaints. This table will be populated with the records from the CSV file. To populate the table with the CSV file, you will head over to psql terminal and run the command:
After executing this code block, the data will be successfully loaded into the Postgres database. You can preview the loaded data by executing the code:
You can also get information on the number of rows in the table.
In the image above, notice that the table contains about 65,499 records.
To create the Postgres source, head over to Airbyte Cloud and create a new source. Enter the source name, host address, database name, the schema you created (rds_schema), username and password. All these details were used when creating your Amazon RDS instance.
If you use the default SSL mode of disable, your test might fail. You will get an error message of Unsecured connection. To solve this error, you will have to set SSL Modes to require. SSL Modes are ways of ensuring a secure browser connection, and Amazon RDS supports SSL encryption by default. All data connections in Airbyte Cloud are encrypted.
To set up a Redshift destination, you need a Redshift cluster. You can create a Redshift cluster by following the steps in the link here. Once your cluster is created, head over to Properties >> Network and security settings tab.
In the VPC security group section, you will see the security group of your Redshift cluster. You want to allow Airbyte Cloud's remote IP address to connect to the Redshift cluster. To do this, modify the security group and allow inbound connections from Airbyte.
Another thing you would have to do is to enable public accessibility from Redshift. This will allow Airbyte Cloud to connect to your Amazon Redshift cluster easily. To enable public accessibility, go to Redshift cluster >> Actions >> Modify public accessible setting.
If you go to your Redshift cluster information, you will see some connection endpoints. These endpoints are used to connect Airbyte Cloud to the Amazon Redshift cluster.
To set up a Redshift destination, go to Airbyte Cloud >> Destinations. Enter your Amazon Redshift hostname. Note that your hostname is your endpoint listed at the top of the page. Note that your endpoint URL must end with amazon.com.
The remaining parameters such as database, username, and password were set up during your Amazon Redshift cluster configuration. To check your database configurations, go to the Properties tab in Redshift.
Redshift, by default, comes with a database called dev and an admin user called awsuser. You can configure this while setting up the cluster. For simplicity's sake, leave the settings as default.
Now, to set up an Amazon Redshift destination, go to Airbyte Cloud and select Redshift as your destination. You also have to enter your hostname, database name, username, and password.
Once the source and destination settings have been configured, the next thing to do is to go to Airbyte Cloud and set up a connection.
Once you set up your source and destination, you will automatically see the schema created and the complaints table. The Full Refresh sync mode is the easiest to get started. There are four types of sync modes in Airbyte.
After setting up a connection, the next thing you have to do is to run the sync.
From the image above, the sync was successful, and exactly 65,499 records were synced from Postgres to Redshift. The time it took for about sixty-five thousand records was about three minutes. Now let us check the Redshift query editor to see if the tables were fully loaded.
In your Redshift query editor, let’s preview the data and display the first five records.
When you run the query, the first five records of the source data in Postgres get displayed in Amazon Redshift.
Logically, if the process works correctly, the total number of records in your source database is meant to be the total number of records in your target database. Let us run some validity checks to see if this process works. Recall, the total number of records from your Postgres database was 65,499. Let us confirm if that is the total number of records in your Amazon Redshift data warehouse.
From this, you can validate that all the data was synced from your source database to your destination data warehouse.
So far, you have seen how easy it is to set up Airbyte and use it for your ETL tasks. All it takes is just for you to set up a source system and a destination system. After setting this up, you can grab a coffee and relax, and let Airbyte do the rest for you. In this article, you learned how to replicate data from Postgres to Amazon Redshift by:
Speed and precision are vital for your business to run smoothly and effectively. Data replication should not be a drawback to your organization. As you have seen, it’s easy to get started with Airbyte Cloud to replicate your Postgres database to Redshift for analytics. Join the conversation at Airbyte’s Community Slack to share your experience and challenges with other data engineers.
Learn how to modify the dbt code used by Airbyte to partition and cluster BigQuery tables.
Learn how to use Airbyte’s Python CDK to write a source connector that extracts data from the Webflow API.
Learn how Airbyte’s incremental synchronization replication modes work.