Warehouses and Lakes
Databases

How to load data from Postgres to Redshift

Learn how to use Airbyte to synchronize your Postgres data into Redshift within minutes.

TL;DR

This can be done by building a data pipeline manually, usually a Python script (you can leverage a tool as Apache Airflow for this). This process can take more than a full week of development. Or it can be done in minutes on Airbyte in three easy steps:

  1. set up Postgres as a source connector (using Auth, or usually an API key)
  2. set up Redshift as a destination connector
  3. define which data you want to transfer and how frequently

You can choose to self-host the pipeline using Airbyte Open Source or have it managed for you with Airbyte Cloud.

This tutorial’s purpose is to show you how.

What is Postgres

An object-relational database management system, PostgreSQL is able to handle a wide range of workloads, supports multiple standards, and is cross-platform, running on numerous operating systems including Microsoft Windows, Solaris, Linux, and FreeBSD. It is highly extensible, and supports more than 12 procedural languages, Spatial data support, Gin and GIST Indexes, and more. Many webs, mobile, and analytics applications use PostgreSQL as the primary data warehouse or data store.

What is Redshift

A fully managed data warehouse service in the Amazon Web Services (AWS) cloud, Amazon Redshift is designed for storage and analysis of large-scale datasets. Redshift allows businesses to scale from a few hundred gigabytes to more than a petabyte (a million gigabytes), and utilizes ML techniques to analyze queries, offering businesses new insights from their data. Users can query and combine exabytes of data using standard SQL, and easily save their query results to their S3 data lake.

Integrate Postgres with Redshift in minutes

Try for free now

Prerequisites

  1. A Postgres account to transfer your customer data automatically from.
  2. A Redshift account.
  3. An active Airbyte Cloud account, or you can also choose to use Airbyte Open Source locally. You can follow the instructions to set up Airbyte on your system using docker-compose.

Airbyte is an open-source data integration platform that consolidates and streamlines the process of extracting and loading data from multiple data sources to data warehouses. It offers pre-built connectors, including Postgres and Redshift, for seamless data migration.

When using Airbyte to move data from Postgres to Redshift, it extracts data from Postgres using the source connector, converts it into a format Redshift can ingest using the provided schema, and then loads it into Redshift via the destination connector. This allows businesses to leverage their Postgres data for advanced analytics and insights within Redshift, simplifying the ETL process and saving significant time and resources.

Step 1: Set up Postgres as a source connector

1. Open your PostgreSQL database and create a new user with the necessary permissions to access the data you want to replicate.

2. Obtain the hostname or IP address of your PostgreSQL server and the port number it is listening on.

3. Create a new database in PostgreSQL that will be used to store the replicated data.

4. Obtain the name of the database you just created.

5. In Airbyte, navigate to the PostgreSQL source connector and click on "Create Connection".

6. Enter a name for your connection and fill in the required fields, including the hostname or IP address, port number, database name, username, and password.

7. Test the connection to ensure that Airbyte can successfully connect to your PostgreSQL database.

8. Select the tables or views you want to replicate and configure any necessary settings, such as the replication frequency and the replication method.

9. Save your configuration and start the replication process.

10. Monitor the replication process to ensure that it is running smoothly and troubleshoot any issues that arise.

Step 2: Set up Redshift as a destination connector

1. First, log in to your Airbyte account and navigate to the "Destinations" tab on the left-hand side of the screen.
2. Click on the "Add Destination" button and select "Redshift" from the list of available connectors.
3. Enter your Redshift database credentials, including the host, port, database name, username, and password.
4. Choose the schema you want to use for your data in Redshift.
5. Select the tables you want to sync from your source connector to Redshift.
6. Map the fields from your source connector to the corresponding fields in Redshift.
7. Choose the sync mode you want to use, either "append" or "replace."
8. Set up any additional options or filters you want to use for your sync.
9. Test your connection to ensure that your data is syncing correctly.
10. Once you are satisfied with your settings, save your configuration and start your sync.

Step 3: Set up a connection to sync your Postgres data to Redshift

Once you've successfully connected Postgres as a data source and Redshift as a destination in Airbyte, you can set up a data pipeline between them with the following steps:

  1. Create a new connection: On the Airbyte dashboard, navigate to the 'Connections' tab and click the '+ New Connection' button.
  2. Choose your source: Select Postgres from the dropdown list of your configured sources.
  3. Select your destination: Choose Redshift from the dropdown list of your configured destinations.
  4. Configure your sync: Define the frequency of your data syncs based on your business needs. Airbyte allows both manual and automatic scheduling for your data refreshes.
  5. Select the data to sync: Choose the specific Postgres objects you want to import data from towards Redshift. You can sync all data or select specific tables and fields.
  6. Select the sync mode for your streams: Choose between full refreshes or incremental syncs (with deduplication if you want), and this for all streams or at the stream level. Incremental is only available for streams that have a primary cursor.
  7. Test your connection: Click the 'Test Connection' button to make sure that your setup works. If the connection test is successful, save your configuration.
  8. Start the sync: If the test passes, click 'Set Up Connection'. Airbyte will start moving data from Postgres to Redshift according to your settings.

Remember, Airbyte keeps your data in sync at the frequency you determine, ensuring your Redshift data warehouse is always up-to-date with your Postgres data.

Use Cases to transfer your Postgres data to Redshift

Integrating data from Postgres to Redshift provides several benefits. Here are a few use cases:

  1. Advanced Analytics: Redshift’s powerful data processing capabilities enable you to perform complex queries and data analysis on your Postgres data, extracting insights that wouldn't be possible within Postgres alone.
  2. Data Consolidation: If you're using multiple other sources along with Postgres, syncing to Redshift allows you to centralize your data for a holistic view of your operations, and to set up a change data capture process so you never have any discrepancies in your data again.
  3. Historical Data Analysis: Postgres has limits on historical data. Syncing data to Redshift allows for long-term data retention and analysis of historical trends over time.
  4. Data Security and Compliance: Redshift provides robust data security features. Syncing Postgres data to Redshift ensures your data is secured and allows for advanced data governance and compliance management.
  5. Scalability: Redshift can handle large volumes of data without affecting performance, providing an ideal solution for growing businesses with expanding Postgres data.
  6. Data Science and Machine Learning: By having Postgres data in Redshift, you can apply machine learning models to your data for predictive analytics, customer segmentation, and more.
  7. Reporting and Visualization: While Postgres provides reporting tools, data visualization tools like Tableau, PowerBI, Looker (Google Data Studio) can connect to Redshift, providing more advanced business intelligence options. If you have a Postgres table that needs to be converted to a Redshift table, Airbyte can do that automatically.

Wrapping Up

To summarize, this tutorial has shown you how to:

  1. Configure a Postgres account as an Airbyte data source connector.
  2. Configure Redshift as a data destination connector.
  3. Create an Airbyte data pipeline that will automatically be moving data directly from Postgres to Redshift after you set a schedule

With Airbyte, creating data pipelines take minutes, and the data integration possibilities are endless. Airbyte supports the largest catalog of API tools, databases, and files, among other sources. Airbyte's connectors are open-source, so you can add any custom objects to the connector, or even build a new connector from scratch without any local dev environment or any data engineer within 10 minutes with the no-code connector builder.

We look forward to seeing you make use of it! We invite you to join the conversation on our community Slack Channel, or sign up for our newsletter. You should also check out other Airbyte tutorials, and Airbyte’s content hub!

What should you do next?

Hope you enjoyed the reading. Here are the 3 ways we can help you in your data journey:

flag icon
Easily address your data movement needs with Airbyte Cloud
Take the first step towards extensible data movement infrastructure that will give a ton of time back to your data team. 
Get started with Airbyte for free
high five icon
Talk to a data infrastructure expert
Get a free consultation with an Airbyte expert to significantly improve your data movement infrastructure. 
Talk to sales
stars sparkling
Improve your data infrastructure knowledge
Subscribe to our monthly newsletter and get the community’s new enlightening content along with Airbyte’s progress in their mission to solve data integration once and for all.
Subscribe to newsletter

Connectors Used

Data engineers are tasked to move data between storage systems like applications, databases, data warehouses, and data lakes. One of the most popular data replication use cases includes moving data from a relational database to a data warehouse to enable analytics. This article shows how to build an ELT pipeline to replicate data from Postgres to Redshift using Airbyte, an open-source data integration platform.

Postgres and Redshift store data differently. To replicate data between the two systems, you will need at least to extract the Postgres data, transform it to a format that is compatible with Redshift and load it to Redshift.

Airbyte does not only provide connectors for various sources and destinations (including Postgres and Redshift), but it is also an ELT framework with monitoring, scheduling, logging, and support for multiple data replication modes.

Why replicate data from Postgres to Redshift

One question that might come to your mind is, “Why to move data from Postgres to Redshift?”. To answer this question, it’s important to understand the differences between Online Transactional Processing (OLTP) and Online Analytical Processing (OLAP) databases.

OLTP databases capture and store transactional-oriented data in real-time. It is mostly used for the effective running of the day-to-day activities of an organization. When OLTP databases are used for complex analysis, it slows down the website's operations. Examples of OLTP databases are Postgres, MySQL, and MySQL.

OLAP databases are mostly used by data scientists, analysts, or business intelligence experts for analysis, research, insights generation, and reporting. These types of databases allow fast execution of complex analytical queries. These types of databases are commonly referred to as data warehouses and can store large volumes of data and run complex queries on them to generate insights. OLAP data warehouses are AWS Redshift, Google BigQuery, Azure Synapse, and Snowflake.

This tutorial aims to show how to replicate data from a transactional database (Postgres) to an analytical data warehouse (Redshift) in a fast and effective way, using Airbyte.


{{COMPONENT_CTA}}

Methods to Move Data From Postgres to Redshift

  • Method 1: Connecting Postgres to Redshift using Airbyte.
  • Method 2: Connecting Postgres to Redshift manually.

Method 1: Connecting Postgres to Redshift using Airbyte

Prerequisites

  1. You will need to deploy Airbyte. Airbyte Cloud comes with free credits to get started. You can sign up for Airbyte Cloud or deploy Airbyte Open Source.
  2. You will need an AWS account.
  3. You will need an instance of the Postgres database, which you can remotely connect to. One of the very essential data engineering skills you will learn in this article is also how to host your local Postgres database on an Amazon RDS database instance.
  4. You will need an instance of the Amazon Redshift cluster that you can connect to remotely.

Set up an Amazon RDS database

To set up a Postgres database as an Airbyte source, you must host your local Postgres database in the cloud. To do this, there are many options. You can use Heroku, ElephantSQL, HelioHost, Amazon RDS, CloudSQL, and many other database cloud hosting platforms or run your Postgres database on a machine you will secure. For this tutorial, you will use the Amazon RDS instance to host your Postgres database.

Before creating an Amazon RDS instance, the first thing you need to do is to create a security group. This is important because it will allow Airbyte’s Cloud IP and other IP addresses you specify, to connect to your Amazon RDS database.

To do this, you go to EC2 >> Security Groups >> Create Security Group. In the Inbound Rules section, you will allow two IP addresses from your Postgres instance.

In the above image, the security group was given the tag name of Airbyte-Rule. This tag name will come in handy when you want to allocate your Amazon RDS database instance to the security group. The first inbound rule allows Airbyte Cloud IP (34.106.109.131/32 at the time of writing) to connect to the remote Amazon RDS instance. The second inbound rule allows your local IP address to connect to the Amazon RDS instance.

The next thing you will do is create an Amazon RDS database instance. To do this, search your Amazon console for Amazon RDS.

You then go to Databases >> Create database.

After clicking on create database, you configure your database. You choose the standard to create method and the Postgres engine option.

You will also have to allocate the security group you just created to the Amazon RDS instance. To do this, you go to Connectivity >> Security group. You will then attach the Airbyte-Rule security group.

It is important to grant your database public access while creating it. This will enable your local Postgres database to communicate with the remote Postgres database hosted on AWS.

After granting the database public access, create the database. You may have to wait for a while because it takes some time to create the database and allocate compute resources. Once the database becomes available, you will be able to see endpoints you can use to connect to your remote Postgres database.

Create a Postgres table and populate the table

In the previous section, you created an Amazon RDS instance which will enable you to run a Postgres database on the cloud. In this section, you will connect to the Amazon RDS instance you just created. You will also create a table in your Postgres database and populate the table with data.

Locally you can install pgAdmin to connect to the Amazon RDS instance with the endpoints from the remote database instance.

In the connection tab, you specify your hostname, port number, username, database name, and password. All these parameters were configured by you when creating the Amazon RDS database instance. Once saved, you have successfully connected to your remote instance.

The dataset used in this tutorial can be downloaded here. The dataset contains the details of all consumer complaints from a store. The image below shows the SQL query for creating a schema and a table. This table will be populated with data from the CSV file in the link above.


create schema rds_schema;
create table rds_schema.complaints("Date Received" VARCHAR(50),
   "Product Name" VARCHAR(100),
   "Sub Product" VARCHAR(100),
   "Issue" VARCHAR(100),
   "Sub Issue" VARCHAR(100),
   "Consumer Complaint Narrative" TEXT,
   "Company Public Response" VARCHAR(200),
   "Company" VARCHAR(100),
   "State Name" VARCHAR(100),
   "Zip Code" VARCHAR(100),
   "TAGS" VARCHAR(50),
   "Consumer Consent Provided" VARCHAR(50),
   "Submitted via" VARCHAR(50),
   "Date Sent to Company" VARCHAR(50),
   "Company Response to Consumer" VARCHAR(100),
   "Timely Response" VARCHAR(100),
   "Consumer Disputed" VARCHAR(50),
   "Complaint ID" INT);

In the code block above, you created a schema called rds_schema and a table called complaints. This table will be populated with the records from the CSV file. To populate the table with the CSV file, you will head over to psql terminal and run the command:


\copy rds_schema.complaints FROM '{path_to_the_csv_file}' with (format csv, header true, delimiter ',');

After executing this code block, the data will be successfully loaded into the Postgres database.  You can preview the loaded data by executing the code:


select * from rds_schema.complaints limit 10:

You can also get information on the number of rows in the table.

In the image above, notice that the table contains about 65,499 records.

Set up a Postgres source

To create the Postgres source, head over to Airbyte Cloud and create a new source. Enter the source name, host address, database name, the schema you created (rds_schema), username and password. All these details were used when creating your Amazon RDS instance.

If you use the default SSL mode of disable, your test might fail. You will get an error message of Unsecured connection. To solve this error, you will have to set SSL Modes to require. SSL Modes are ways of ensuring a secure browser connection, and Amazon RDS supports SSL encryption by default. All data connections in Airbyte Cloud are encrypted.



Method 2: Connecting Postgres to Redshift manually

Moving data from PostgreSQL to Amazon Redshift without using third-party connectors or integrations involves several steps, including exporting data from PostgreSQL, preparing Amazon S3 for data storage, uploading data to S3, and finally copying data into Redshift. Below is a detailed step-by-step guide to accomplish this task:

Step 1: Export Data from PostgreSQL

1. Connect to your PostgreSQL Database:

Use `psql` or any PostgreSQL client to connect to your database.

2. Choose the Data to Export:

Determine which tables or data you want to move to Redshift.

3. Export the Data:

Use the `COPY` command to export the data to a CSV file. For example:

```sql

COPY (SELECT * FROM your_table_name) TO '/path/to/your/output.csv' DELIMITER ',' CSV HEADER;

```

Replace `/path/to/your/output.csv` with the path where you want to save your file and `your_table_name` with the name of the table you're exporting.

4. Compress the Data (Optional):

To save space and upload time, you can compress the CSV file using gzip:

```bash

gzip /path/to/your/output.csv

```

This will create a file named `output.csv.gz`.

Step 2: Prepare Amazon S3

1. Create an S3 Bucket:

- Log in to your AWS Management Console.

- Navigate to the S3 service and create a new bucket.

- Set the name and region for the bucket.

2. Set Permissions:

- Ensure that the bucket has the necessary permissions so that Redshift can access the data.

- You may need to attach an IAM policy to your Redshift cluster's role for access to the S3 bucket.

Step 3: Upload Data to Amazon S3

1. Install AWS CLI:

If you haven't already, install the AWS Command Line Interface (CLI) on your machine.

2. Configure AWS CLI:

Run `aws configure` to set up your AWS credentials and default region.

3. Upload the File:

Use the `aws s3 cp` command to upload your data file to the S3 bucket:

```bash

aws s3 cp /path/to/your/output.csv.gz s3://your-bucket-name/

```

Replace `/path/to/your/output.csv.gz` with the path to your compressed file and `your-bucket-name` with the name of your S3 bucket.

Step 4: Copy Data into Amazon Redshift

1. Connect to Your Redshift Cluster:

Use a SQL client that supports Redshift to connect to your cluster.

2. Create a Table:

Create a table in Redshift that matches the schema of the PostgreSQL data you're importing. For example:

```sql

CREATE TABLE your_redshift_table (

column1 datatype,

column2 datatype,

...

);

```

3. Copy Data from S3:

Use the `COPY` command in Redshift to load data from the S3 bucket into your Redshift table:

```sql

COPY your_redshift_table

FROM 's3://your-bucket-name/output.csv.gz'

CREDENTIALS 'aws_iam_role=your-iam-role-arn'

DELIMITER ','

IGNOREHEADER 1

GZIP

REGION 'your-region';

```

Replace `your_redshift_table` with the name of your table in Redshift, `your-bucket-name` with the name of your S3 bucket, `your-iam-role-arn` with the ARN of the IAM role that has access to S3, and `your-region` with the region of your S3 bucket.

4. Verify the Data:

After the `COPY` command has completed, run some queries to verify that the data was imported correctly.

Step 5: Clean Up

1. Remove Temporary Files:

Once you've confirmed the data is in Redshift, you can delete the CSV files from your local machine and the S3 bucket to avoid incurring storage costs.

2. Monitor Your Redshift Cluster:

Check the performance and storage metrics of your Redshift cluster to ensure it handles the new data well.

By following these steps, you can move data from PostgreSQL to Amazon Redshift without the need for third-party connectors or integrations. Always remember to handle credentials and access permissions with care to maintain the security of your data.

Set up a Redshift destination

To set up a Redshift destination, you need a Redshift cluster. You can create a Redshift cluster by following the steps in the link here. Once your cluster is created, head over to Properties >> Network and security settings tab.

In the VPC security group section, you will see the security group of your Redshift cluster. You want to allow Airbyte Cloud's remote IP address to connect to the Redshift cluster. To do this, modify the security group and allow inbound connections from Airbyte.

Another thing you would have to do is to enable public accessibility from Redshift. This will allow Airbyte Cloud to connect to your Amazon Redshift cluster easily. To enable public accessibility, go to Redshift cluster >> Actions >> Modify public accessible setting.

If you go to your Redshift cluster information, you will see some connection endpoints. These endpoints are used to connect Airbyte Cloud to the Amazon Redshift cluster.

To set up a Redshift destination, go to Airbyte Cloud >> Destinations. Enter your Amazon Redshift hostname. Note that your hostname is your endpoint listed at the top of the page. Note that your endpoint URL must end with amazon.com.

The remaining parameters such as database, username, and password were set up during your Amazon Redshift cluster configuration. To check your database configurations, go to the Properties tab in Redshift.

Redshift, by default, comes with a database called dev and an admin user called awsuser. You can configure this while setting up the cluster. For simplicity's sake, leave the settings as default.

Now, to set up an Amazon Redshift destination, go to Airbyte Cloud and select Redshift as your destination. You also have to enter your hostname, database name, username, and password.

Set up a Postgres to Redshift connection

Once the source and destination settings have been configured, the next thing to do is to go to Airbyte Cloud and set up a connection.

Once you set up your source and destination, you will automatically see the schema created and the complaints table. The Full Refresh sync mode is the easiest to get started. There are four types of sync modes in Airbyte.

  • Full refresh - append: Full refresh receives all the available data from the source, regardless of whether it has been previously synced. The full refresh - append takes all the available data from a particular sync and appends it to the previously synced data. This makes each sync create duplicated records of the already existing data.
  • Full refresh - overwrite: This is another type of sync in Airbyte. In full refresh - overwrite mode, new syncs will overwrite all the data in the destination table and push the new data.
  • Incremental sync - append:. In incremental sync, only new or modified data will be synced to its destination. In incremental append mode, refetching data you have already synced is not allowed. This prevents duplicated data, as in the case of full refresh append. Here, each update creates a new row with the updated data.
  • Incremental sync - deduped history: This is similar to incremental sync append. The only difference is that here, each change updates the rows that have been modified instead of creating a new row containing the modified data.

Run the sync

After setting up a connection, the next thing you have to do is to run the sync.

From the image above, the sync was successful, and exactly 65,499 records were synced from Postgres to Redshift. The time it took for about sixty-five thousand records was about three minutes. Now let us check the Redshift query editor to see if the tables were fully loaded.

Verify the results

In your Redshift query editor, let’s preview the data and display the first five records.

When you run the query, the first five records of the source data in Postgres get displayed in Amazon Redshift.

Logically, if the process works correctly,  the total number of records in your source database is meant to be the total number of records in your target database. Let us run some validity checks to see if this process works. Recall, the total number of records from your Postgres database was 65,499. Let us confirm if that is the total number of records in your Amazon Redshift data warehouse.

From this, you can validate that all the data was synced from your source database to your destination data warehouse.

Conclusion

So far, you have seen how easy it is to set up Airbyte and use it for your ETL tasks. All it takes is just for you to set up a source system and a destination system. After setting this up, you can grab a coffee and relax, and let Airbyte do the rest for you. In this article, you learned how to replicate data from Postgres to Amazon Redshift by:

  1. Configuring a Postgres source
  2. Configuring a Redshift destination
  3. Setting up a connection
  4. Running the sync

Speed and precision are vital for your business to run smoothly and effectively. Data replication should not be a drawback to your organization. As you have seen, it’s easy to get started with Airbyte Cloud to replicate your Postgres database to Redshift for analytics. Join the conversation at Airbyte’s Community Slack to share your experience and challenges with other data engineers.

What should you do next?

Hope you enjoyed the reading. Here are the 3 ways we can help you in your data journey:

flag icon
Easily address your data movement needs with Airbyte Cloud
Take the first step towards extensible data movement infrastructure that will give a ton of time back to your data team. 
Get started with Airbyte for free
high five icon
Talk to a data infrastructure expert
Get a free consultation with an Airbyte expert to significantly improve your data movement infrastructure. 
Talk to sales
stars sparkling
Improve your data infrastructure knowledge
Subscribe to our monthly newsletter and get the community’s new enlightening content along with Airbyte’s progress in their mission to solve data integration once and for all.
Subscribe to newsletter

Connectors Used

Frequently Asked Questions

What data can you extract from Postgres?

PostgreSQL gives access to a wide range of data types, including:  

1. Numeric data types: This includes integers, floating-point numbers, and decimal numbers.  

2. Character data types: This includes strings, text, and character arrays.  

3. Date and time data types: This includes dates, times, and timestamps.  

4. Boolean data types: This includes true/false values.  

5. Network address data types: This includes IP addresses and MAC addresses.  

6. Geometric data types: This includes points, lines, and polygons.  

7. Array data types: This includes arrays of any of the above data types.  

8. JSON and JSONB data types: This includes JSON objects and arrays.  

9. XML data types: This includes XML documents.  

10. Composite data types: This includes user-defined data types that can contain multiple fields of different data types.  

Overall, PostgreSQL's API provides access to a wide range of data types, making it a versatile and powerful tool for data management and analysis.

What data can you transfer to Redshift?

You can transfer a wide variety of data to Redshift. This usually includes structured, semi-structured, and unstructured data like transaction records, log files, JSON data, CSV files, and more, allowing robust, scalable data integration and analysis.

What are top ETL tools to transfer data from Postgres to Redshift?

The most prominent ETL tools to transfer data from Postgres to Redshift include:

  • Airbyte
  • Fivetran
  • Stitch
  • Matillion
  • Talend Data Integration

These tools help in extracting data from Postgres and various sources (APIs, databases, and more), transforming it efficiently, and loading it into Redshift and other databases, data warehouses and data lakes, enhancing data management capabilities.