Building your pipeline or Using Airbyte
Airbyte is the only open solution empowering data teams to meet all their growing custom business demands in the new AI era.
- Inconsistent and inaccurate data
- Laborious and expensive
- Brittle and inflexible
- Reliable and accurate
- Extensible and scalable for all your needs
- Deployed and governed your way
Start syncing with Airbyte in 3 easy steps within 10 minutes
Take a virtual tour
Demo video of Airbyte Cloud
Demo video of AI Connector Builder
What sets Airbyte Apart
Modern GenAI Workflows
Move Large Volumes, Fast
An Extensible Open-Source Standard
Full Control & Security
Fully Featured & Integrated
Enterprise Support with SLAs
What our users say
"The intake layer of Datadog’s self-serve analytics platform is largely built on Airbyte.Airbyte’s ease of use and extensibility allowed any team in the company to push their data into the platform - without assistance from the data team!"
“Airbyte helped us accelerate our progress by years, compared to our competitors. We don’t need to worry about connectors and focus on creating value for our users instead of building infrastructure. That’s priceless. The time and energy saved allows us to disrupt and grow faster.”
“We chose Airbyte for its ease of use, its pricing scalability and its absence of vendor lock-in. Having a lean team makes them our top criteria. The value of being able to scale and execute at a high level by maximizing resources is immense”
1. Connect to your PostgreSQL Database:
Use `psql` or any PostgreSQL client to connect to your database.
2. Choose the Data to Export:
Determine which tables or data you want to move to Redshift.
3. Export the Data:
Use the `COPY` command to export the data to a CSV file. For example:
```sql
COPY (SELECT * FROM your_table_name) TO '/path/to/your/output.csv' DELIMITER ',' CSV HEADER;
```
Replace `/path/to/your/output.csv` with the path where you want to save your file and `your_table_name` with the name of the table you're exporting.
4. Compress the Data (Optional):
To save space and upload time, you can compress the CSV file using gzip:
```bash
gzip /path/to/your/output.csv
```
This will create a file named `output.csv.gz`.
1. Create an S3 Bucket:
- Log in to your AWS Management Console.
- Navigate to the S3 service and create a new bucket.
- Set the name and region for the bucket.
2. Set Permissions:
- Ensure that the bucket has the necessary permissions so that Redshift can access the data.
- You may need to attach an IAM policy to your Redshift cluster's role for access to the S3 bucket.
1. Install AWS CLI:
If you haven't already, install the AWS Command Line Interface (CLI) on your machine.
2. Configure AWS CLI:
Run `aws configure` to set up your AWS credentials and default region.
3. Upload the File:
Use the `aws s3 cp` command to upload your data file to the S3 bucket:
```bash
aws s3 cp /path/to/your/output.csv.gz s3://your-bucket-name/
```
Replace `/path/to/your/output.csv.gz` with the path to your compressed file and `your-bucket-name` with the name of your S3 bucket.
1. Connect to Your Redshift Cluster:
Use a SQL client that supports Redshift to connect to your cluster.
2. Create a Table:
Create a table in Redshift that matches the schema of the PostgreSQL data you're importing. For example:
```sql
CREATE TABLE your_redshift_table (
column1 datatype,
column2 datatype,
...
);
```
3. Copy Data from S3:
Use the `COPY` command in Redshift to load data from the S3 bucket into your Redshift table:
```sql
COPY your_redshift_table
FROM 's3://your-bucket-name/output.csv.gz'
CREDENTIALS 'aws_iam_role=your-iam-role-arn'
DELIMITER ','
IGNOREHEADER 1
GZIP
REGION 'your-region';
```
Replace `your_redshift_table` with the name of your table in Redshift, `your-bucket-name` with the name of your S3 bucket, `your-iam-role-arn` with the ARN of the IAM role that has access to S3, and `your-region` with the region of your S3 bucket.
4. Verify the Data:
After the `COPY` command has completed, run some queries to verify that the data was imported correctly.
1. Remove Temporary Files:
Once you've confirmed the data is in Redshift, you can delete the CSV files from your local machine and the S3 bucket to avoid incurring storage costs.
2. Monitor Your Redshift Cluster:
Check the performance and storage metrics of your Redshift cluster to ensure it handles the new data well.
By following these steps, you can move data from PostgreSQL to Amazon Redshift without the need for third-party connectors or integrations. Always remember to handle credentials and access permissions with care to maintain the security of your data.
FAQs
What is ETL?
ETL, an acronym for Extract, Transform, Load, is a vital data integration process. It involves extracting data from diverse sources, transforming it into a usable format, and loading it into a database, data warehouse or data lake. This process enables meaningful data analysis, enhancing business intelligence.
An object-relational database management system, PostgreSQL is able to handle a wide range of workloads, supports multiple standards, and is cross-platform, running on numerous operating systems including Microsoft Windows, Solaris, Linux, and FreeBSD. It is highly extensible, and supports more than 12 procedural languages, Spatial data support, Gin and GIST Indexes, and more. Many webs, mobile, and analytics applications use PostgreSQL as the primary data warehouse or data store.
PostgreSQL gives access to a wide range of data types, including:
1. Numeric data types: This includes integers, floating-point numbers, and decimal numbers.
2. Character data types: This includes strings, text, and character arrays.
3. Date and time data types: This includes dates, times, and timestamps.
4. Boolean data types: This includes true/false values.
5. Network address data types: This includes IP addresses and MAC addresses.
6. Geometric data types: This includes points, lines, and polygons.
7. Array data types: This includes arrays of any of the above data types.
8. JSON and JSONB data types: This includes JSON objects and arrays.
9. XML data types: This includes XML documents.
10. Composite data types: This includes user-defined data types that can contain multiple fields of different data types.
Overall, PostgreSQL's API provides access to a wide range of data types, making it a versatile and powerful tool for data management and analysis.
What is ELT?
ELT, standing for Extract, Load, Transform, is a modern take on the traditional ETL data integration process. In ELT, data is first extracted from various sources, loaded directly into a data warehouse, and then transformed. This approach enhances data processing speed, analytical flexibility and autonomy.
Difference between ETL and ELT?
ETL and ELT are critical data integration strategies with key differences. ETL (Extract, Transform, Load) transforms data before loading, ideal for structured data. In contrast, ELT (Extract, Load, Transform) loads data before transformation, perfect for processing large, diverse data sets in modern data warehouses. ELT is becoming the new standard as it offers a lot more flexibility and autonomy to data analysts.
An object-relational database management system, PostgreSQL is able to handle a wide range of workloads, supports multiple standards, and is cross-platform, running on numerous operating systems including Microsoft Windows, Solaris, Linux, and FreeBSD. It is highly extensible, and supports more than 12 procedural languages, Spatial data support, Gin and GIST Indexes, and more. Many webs, mobile, and analytics applications use PostgreSQL as the primary data warehouse or data store.
A fully managed data warehouse service in the Amazon Web Services (AWS) cloud, Amazon Redshift is designed for storage and analysis of large-scale datasets. Redshift allows businesses to scale from a few hundred gigabytes to more than a petabyte (a million gigabytes), and utilizes ML techniques to analyze queries, offering businesses new insights from their data. Users can query and combine exabytes of data using standard SQL, and easily save their query results to their S3 data lake.
1. Open your PostgreSQL database and create a new user with the necessary permissions to access the data you want to replicate.
2. Obtain the hostname or IP address of your PostgreSQL server and the port number it is listening on.
3. Create a new database in PostgreSQL that will be used to store the replicated data.
4. Obtain the name of the database you just created.
5. In Airbyte, navigate to the PostgreSQL source connector and click on "Create Connection".
6. Enter a name for your connection and fill in the required fields, including the hostname or IP address, port number, database name, username, and password.
7. Test the connection to ensure that Airbyte can successfully connect to your PostgreSQL database.
8. Select the tables or views you want to replicate and configure any necessary settings, such as the replication frequency and the replication method.
9. Save your configuration and start the replication process.
10. Monitor the replication process to ensure that it is running smoothly and troubleshoot any issues that arise.
1. First, log in to your Airbyte account and navigate to the "Destinations" tab on the left-hand side of the screen.
2. Click on the "Add Destination" button and select "Redshift" from the list of available connectors.
3. Enter your Redshift database credentials, including the host, port, database name, username, and password.
4. Choose the schema you want to use for your data in Redshift.
5. Select the tables you want to sync from your source connector to Redshift.
6. Map the fields from your source connector to the corresponding fields in Redshift.
7. Choose the sync mode you want to use, either "append" or "replace."
8. Set up any additional options or filters you want to use for your sync.
9. Test your connection to ensure that your data is syncing correctly.
10. Once you are satisfied with your settings, save your configuration and start your sync.
With Airbyte, creating data pipelines take minutes, and the data integration possibilities are endless. Airbyte supports the largest catalog of API tools, databases, and files, among other sources. Airbyte's connectors are open-source, so you can add any custom objects to the connector, or even build a new connector from scratch without any local dev environment or any data engineer within 10 minutes with the no-code connector builder.
We look forward to seeing you make use of it! We invite you to join the conversation on our community Slack Channel, or sign up for our newsletter. You should also check out other Airbyte tutorials, and Airbyte’s content hub!
What should you do next?
Hope you enjoyed the reading. Here are the 3 ways we can help you in your data journey:
Data engineers are tasked to move data between storage systems like applications, databases, data warehouses, and data lakes. One of the most popular data replication use cases includes moving data from a relational database to a data warehouse to enable analytics. This article shows how to build an ELT pipeline to replicate data from Postgres to Redshift using Airbyte, an open-source data integration platform.
Postgres and Redshift store data differently. To replicate data between the two systems, you will need at least to extract the Postgres data, transform it to a format that is compatible with Redshift and load it to Redshift.
Airbyte does not only provide connectors for various sources and destinations (including Postgres and Redshift), but it is also an ELT framework with monitoring, scheduling, logging, and support for multiple data replication modes.
Why replicate data from Postgres to Redshift
One question that might come to your mind is, “Why to move data from Postgres to Redshift?”. To answer this question, it’s important to understand the differences between Online Transactional Processing (OLTP) and Online Analytical Processing (OLAP) databases.
OLTP databases capture and store transactional-oriented data in real-time. It is mostly used for the effective running of the day-to-day activities of an organization. When OLTP databases are used for complex analysis, it slows down the website's operations. Examples of OLTP databases are Postgres, MySQL, and MySQL.
OLAP databases are mostly used by data scientists, analysts, or business intelligence experts for analysis, research, insights generation, and reporting. These types of databases allow fast execution of complex analytical queries. These types of databases are commonly referred to as data warehouses and can store large volumes of data and run complex queries on them to generate insights. OLAP data warehouses are AWS Redshift, Google BigQuery, Azure Synapse, and Snowflake.
This tutorial aims to show how to replicate data from a transactional database (Postgres) to an analytical data warehouse (Redshift) in a fast and effective way, using Airbyte.
{{COMPONENT_CTA}}
Methods to Move Data From Postgres to Redshift
- Method 1: Connecting Postgres to Redshift using Airbyte.
- Method 2: Connecting Postgres to Redshift manually.
Method 1: Connecting Postgres to Redshift using Airbyte
Prerequisites
- You will need to deploy Airbyte. Airbyte Cloud comes with free credits to get started. You can sign up for Airbyte Cloud or deploy Airbyte Open Source.
- You will need an AWS account.
- You will need an instance of the Postgres database, which you can remotely connect to. One of the very essential data engineering skills you will learn in this article is also how to host your local Postgres database on an Amazon RDS database instance.
- You will need an instance of the Amazon Redshift cluster that you can connect to remotely.
Set up an Amazon RDS database
To set up a Postgres database as an Airbyte source, you must host your local Postgres database in the cloud. To do this, there are many options. You can use Heroku, ElephantSQL, HelioHost, Amazon RDS, CloudSQL, and many other database cloud hosting platforms or run your Postgres database on a machine you will secure. For this tutorial, you will use the Amazon RDS instance to host your Postgres database.
Before creating an Amazon RDS instance, the first thing you need to do is to create a security group. This is important because it will allow Airbyte’s Cloud IP and other IP addresses you specify, to connect to your Amazon RDS database.
To do this, you go to EC2 >> Security Groups >> Create Security Group. In the Inbound Rules section, you will allow two IP addresses from your Postgres instance.
In the above image, the security group was given the tag name of Airbyte-Rule. This tag name will come in handy when you want to allocate your Amazon RDS database instance to the security group. The first inbound rule allows Airbyte Cloud IP (34.106.109.131/32 at the time of writing) to connect to the remote Amazon RDS instance. The second inbound rule allows your local IP address to connect to the Amazon RDS instance.
The next thing you will do is create an Amazon RDS database instance. To do this, search your Amazon console for Amazon RDS.
You then go to Databases >> Create database.
After clicking on create database, you configure your database. You choose the standard to create method and the Postgres engine option.
You will also have to allocate the security group you just created to the Amazon RDS instance. To do this, you go to Connectivity >> Security group. You will then attach the Airbyte-Rule security group.
It is important to grant your database public access while creating it. This will enable your local Postgres database to communicate with the remote Postgres database hosted on AWS.
After granting the database public access, create the database. You may have to wait for a while because it takes some time to create the database and allocate compute resources. Once the database becomes available, you will be able to see endpoints you can use to connect to your remote Postgres database.
Create a Postgres table and populate the table
In the previous section, you created an Amazon RDS instance which will enable you to run a Postgres database on the cloud. In this section, you will connect to the Amazon RDS instance you just created. You will also create a table in your Postgres database and populate the table with data.
Locally you can install pgAdmin to connect to the Amazon RDS instance with the endpoints from the remote database instance.
In the connection tab, you specify your hostname, port number, username, database name, and password. All these parameters were configured by you when creating the Amazon RDS database instance. Once saved, you have successfully connected to your remote instance.
The dataset used in this tutorial can be downloaded here. The dataset contains the details of all consumer complaints from a store. The image below shows the SQL query for creating a schema and a table. This table will be populated with data from the CSV file in the link above.
create schema rds_schema;
create table rds_schema.complaints("Date Received" VARCHAR(50),
"Product Name" VARCHAR(100),
"Sub Product" VARCHAR(100),
"Issue" VARCHAR(100),
"Sub Issue" VARCHAR(100),
"Consumer Complaint Narrative" TEXT,
"Company Public Response" VARCHAR(200),
"Company" VARCHAR(100),
"State Name" VARCHAR(100),
"Zip Code" VARCHAR(100),
"TAGS" VARCHAR(50),
"Consumer Consent Provided" VARCHAR(50),
"Submitted via" VARCHAR(50),
"Date Sent to Company" VARCHAR(50),
"Company Response to Consumer" VARCHAR(100),
"Timely Response" VARCHAR(100),
"Consumer Disputed" VARCHAR(50),
"Complaint ID" INT);
In the code block above, you created a schema called rds_schema and a table called complaints. This table will be populated with the records from the CSV file. To populate the table with the CSV file, you will head over to psql terminal and run the command:
\copy rds_schema.complaints FROM '{path_to_the_csv_file}' with (format csv, header true, delimiter ',');
After executing this code block, the data will be successfully loaded into the Postgres database. You can preview the loaded data by executing the code:
select * from rds_schema.complaints limit 10:
You can also get information on the number of rows in the table.
In the image above, notice that the table contains about 65,499 records.
Set up a Postgres source
To create the Postgres source, head over to Airbyte Cloud and create a new source. Enter the source name, host address, database name, the schema you created (rds_schema), username and password. All these details were used when creating your Amazon RDS instance.
If you use the default SSL mode of disable, your test might fail. You will get an error message of Unsecured connection. To solve this error, you will have to set SSL Modes to require. SSL Modes are ways of ensuring a secure browser connection, and Amazon RDS supports SSL encryption by default. All data connections in Airbyte Cloud are encrypted.
Method 2: Connecting Postgres to Redshift manually
Moving data from PostgreSQL to Amazon Redshift without using third-party connectors or integrations involves several steps, including exporting data from PostgreSQL, preparing Amazon S3 for data storage, uploading data to S3, and finally copying data into Redshift. Below is a detailed step-by-step guide to accomplish this task:
Step 1: Export Data from PostgreSQL
1. Connect to your PostgreSQL Database:
Use `psql` or any PostgreSQL client to connect to your database.
2. Choose the Data to Export:
Determine which tables or data you want to move to Redshift.
3. Export the Data:
Use the `COPY` command to export the data to a CSV file. For example:
```sql
COPY (SELECT * FROM your_table_name) TO '/path/to/your/output.csv' DELIMITER ',' CSV HEADER;
```
Replace `/path/to/your/output.csv` with the path where you want to save your file and `your_table_name` with the name of the table you're exporting.
4. Compress the Data (Optional):
To save space and upload time, you can compress the CSV file using gzip:
```bash
gzip /path/to/your/output.csv
```
This will create a file named `output.csv.gz`.
Step 2: Prepare Amazon S3
1. Create an S3 Bucket:
- Log in to your AWS Management Console.
- Navigate to the S3 service and create a new bucket.
- Set the name and region for the bucket.
2. Set Permissions:
- Ensure that the bucket has the necessary permissions so that Redshift can access the data.
- You may need to attach an IAM policy to your Redshift cluster's role for access to the S3 bucket.
Step 3: Upload Data to Amazon S3
1. Install AWS CLI:
If you haven't already, install the AWS Command Line Interface (CLI) on your machine.
2. Configure AWS CLI:
Run `aws configure` to set up your AWS credentials and default region.
3. Upload the File:
Use the `aws s3 cp` command to upload your data file to the S3 bucket:
```bash
aws s3 cp /path/to/your/output.csv.gz s3://your-bucket-name/
```
Replace `/path/to/your/output.csv.gz` with the path to your compressed file and `your-bucket-name` with the name of your S3 bucket.
Step 4: Copy Data into Amazon Redshift
1. Connect to Your Redshift Cluster:
Use a SQL client that supports Redshift to connect to your cluster.
2. Create a Table:
Create a table in Redshift that matches the schema of the PostgreSQL data you're importing. For example:
```sql
CREATE TABLE your_redshift_table (
column1 datatype,
column2 datatype,
...
);
```
3. Copy Data from S3:
Use the `COPY` command in Redshift to load data from the S3 bucket into your Redshift table:
```sql
COPY your_redshift_table
FROM 's3://your-bucket-name/output.csv.gz'
CREDENTIALS 'aws_iam_role=your-iam-role-arn'
DELIMITER ','
IGNOREHEADER 1
GZIP
REGION 'your-region';
```
Replace `your_redshift_table` with the name of your table in Redshift, `your-bucket-name` with the name of your S3 bucket, `your-iam-role-arn` with the ARN of the IAM role that has access to S3, and `your-region` with the region of your S3 bucket.
4. Verify the Data:
After the `COPY` command has completed, run some queries to verify that the data was imported correctly.
Step 5: Clean Up
1. Remove Temporary Files:
Once you've confirmed the data is in Redshift, you can delete the CSV files from your local machine and the S3 bucket to avoid incurring storage costs.
2. Monitor Your Redshift Cluster:
Check the performance and storage metrics of your Redshift cluster to ensure it handles the new data well.
By following these steps, you can move data from PostgreSQL to Amazon Redshift without the need for third-party connectors or integrations. Always remember to handle credentials and access permissions with care to maintain the security of your data.
Set up a Redshift destination
To set up a Redshift destination, you need a Redshift cluster. You can create a Redshift cluster by following the steps in the link here. Once your cluster is created, head over to Properties >> Network and security settings tab.
In the VPC security group section, you will see the security group of your Redshift cluster. You want to allow Airbyte Cloud's remote IP address to connect to the Redshift cluster. To do this, modify the security group and allow inbound connections from Airbyte.
Another thing you would have to do is to enable public accessibility from Redshift. This will allow Airbyte Cloud to connect to your Amazon Redshift cluster easily. To enable public accessibility, go to Redshift cluster >> Actions >> Modify public accessible setting.
If you go to your Redshift cluster information, you will see some connection endpoints. These endpoints are used to connect Airbyte Cloud to the Amazon Redshift cluster.
To set up a Redshift destination, go to Airbyte Cloud >> Destinations. Enter your Amazon Redshift hostname. Note that your hostname is your endpoint listed at the top of the page. Note that your endpoint URL must end with amazon.com.
The remaining parameters such as database, username, and password were set up during your Amazon Redshift cluster configuration. To check your database configurations, go to the Properties tab in Redshift.
Redshift, by default, comes with a database called dev and an admin user called awsuser. You can configure this while setting up the cluster. For simplicity's sake, leave the settings as default.
Now, to set up an Amazon Redshift destination, go to Airbyte Cloud and select Redshift as your destination. You also have to enter your hostname, database name, username, and password.
Set up a Postgres to Redshift connection
Once the source and destination settings have been configured, the next thing to do is to go to Airbyte Cloud and set up a connection.
Once you set up your source and destination, you will automatically see the schema created and the complaints table. The Full Refresh sync mode is the easiest to get started. There are four types of sync modes in Airbyte.
- Full refresh - append: Full refresh receives all the available data from the source, regardless of whether it has been previously synced. The full refresh - append takes all the available data from a particular sync and appends it to the previously synced data. This makes each sync create duplicated records of the already existing data.
- Full refresh - overwrite: This is another type of sync in Airbyte. In full refresh - overwrite mode, new syncs will overwrite all the data in the destination table and push the new data.
- Incremental sync - append:. In incremental sync, only new or modified data will be synced to its destination. In incremental append mode, refetching data you have already synced is not allowed. This prevents duplicated data, as in the case of full refresh append. Here, each update creates a new row with the updated data.
- Incremental sync - deduped history: This is similar to incremental sync append. The only difference is that here, each change updates the rows that have been modified instead of creating a new row containing the modified data.
Run the sync
After setting up a connection, the next thing you have to do is to run the sync.
From the image above, the sync was successful, and exactly 65,499 records were synced from Postgres to Redshift. The time it took for about sixty-five thousand records was about three minutes. Now let us check the Redshift query editor to see if the tables were fully loaded.
Verify the results
In your Redshift query editor, let’s preview the data and display the first five records.
When you run the query, the first five records of the source data in Postgres get displayed in Amazon Redshift.
Logically, if the process works correctly, the total number of records in your source database is meant to be the total number of records in your target database. Let us run some validity checks to see if this process works. Recall, the total number of records from your Postgres database was 65,499. Let us confirm if that is the total number of records in your Amazon Redshift data warehouse.
From this, you can validate that all the data was synced from your source database to your destination data warehouse.
Conclusion
So far, you have seen how easy it is to set up Airbyte and use it for your ETL tasks. All it takes is just for you to set up a source system and a destination system. After setting this up, you can grab a coffee and relax, and let Airbyte do the rest for you. In this article, you learned how to replicate data from Postgres to Amazon Redshift by:
- Configuring a Postgres source
- Configuring a Redshift destination
- Setting up a connection
- Running the sync
Speed and precision are vital for your business to run smoothly and effectively. Data replication should not be a drawback to your organization. As you have seen, it’s easy to get started with Airbyte Cloud to replicate your Postgres database to Redshift for analytics. Join the conversation at Airbyte’s Community Slack to share your experience and challenges with other data engineers.
What should you do next?
Hope you enjoyed the reading. Here are the 3 ways we can help you in your data journey:
Ready to get started?
Frequently Asked Questions
PostgreSQL gives access to a wide range of data types, including:
1. Numeric data types: This includes integers, floating-point numbers, and decimal numbers.
2. Character data types: This includes strings, text, and character arrays.
3. Date and time data types: This includes dates, times, and timestamps.
4. Boolean data types: This includes true/false values.
5. Network address data types: This includes IP addresses and MAC addresses.
6. Geometric data types: This includes points, lines, and polygons.
7. Array data types: This includes arrays of any of the above data types.
8. JSON and JSONB data types: This includes JSON objects and arrays.
9. XML data types: This includes XML documents.
10. Composite data types: This includes user-defined data types that can contain multiple fields of different data types.
Overall, PostgreSQL's API provides access to a wide range of data types, making it a versatile and powerful tool for data management and analysis.
What should you do next?
Hope you enjoyed the reading. Here are the 3 ways we can help you in your data journey: