All tutorials
No items found.

Export Postgres data to CSV, JSON, Parquet and Avro files in S3

Learn how to easily export Postgres data to CSV, JSON, Parquet, and Avro file formats stored in AWS S3.

Transactional databases like Postgres are an excellent choice for OLTP workloads. Still, sometimes you need to serialize your business data into a different format, like an apache Parquet column-oriented format that is better suited for analytical workloads. 

Then you can easily run analytics jobs on your S3 data lake. Exporting your Postgres data to file formats like Parquet, Avro, JSON, CSV, etc., can empower Data Analysts and Data Scientists to work on your operational data without impacting production workloads. It’s common to export data from Postgres to S3 to efficiently store big data, convert data to “self-describing” file formats and run analytical workloads.

Store big data efficiently
Big data file formats can reduce file sizes significantly using data compression techniques, making it possible to store more data using less space. In some cases, you may want to offload historical data that is not queries anymore from your production database to a data lake built on S3.

Run fast analytics workloads
You can run analytical workloads faster on column-based file formats like Parquet.

Airbyte is a data integration platform with open-source connectors for the most popular databases, data warehouses and data lake storages. In this tutorial, you will learn easily set up a data pipeline from Postgres to AWS S3 using different file formats like CSV, JSON, Avro, and Parquet with Airbyte.

Prerequisites

  • An active AWS account to set up S3 Bucket.
  • A Postgres database. In this tutorial, we use Heroku to host our PostgreSQL database. In that case, you need Heroku CLI installed locally on your system.
  • An Airbyte Cloud account. You can also deploy Airbyte yourself with docker-compose.

Step 1: Set up a Postgres database (optional)

You can skip this step if you already have a PostgreSQL. For this tutorial, you will use Heroku's managed PostgreSQL add-on to create a database host. 

You first need to create a new Heroku app to host a new database on Heroku. Login to your Heroku dashboard and choose to create a new app.  Also, you can change the Region to have the app's servers closer to your location.

To add a PostgreSQL database to your app, install Heroku’s Postgres Add-on. Go to Resource and look for the Heroku PostgreSQL add-on.

The next step asks you to pick up a pricing plan for this resource. To find the credentials and the connection URL for the PostgreSQL database, you need to navigate to the Resources tab in your app's dashboard again and select the Heroku Postgres resource.

This will bring you to the configuration screen for your database. Click on the Settings tab and View Credentials to find the database credentials. Take a note of the Host, Database, User, Post, and Password. You will need these details while setting up the Airbyte source.

To test out if you can set up a connection to the newly created database. Copy the URI provided by Heroku inside Credentials and run the following command.


heroku pg:psql --app app-name

Step 2: Set up an S3 bucket

You will need to create an S3 bucket to store the files exported from Postgres. Here are some simple steps to set up an S3 bucket if you don’t already have one.

  1. Sign in to the AWS Management Console and open the Amazon S3 console.
  2. Choose Create bucket.
  3. In Bucket name, enter a DNS-compliant name for your bucket.
  4. In Region, choose the AWS Region; choose a Region close to you to minimize latency and costs and address regulatory requirements.
  5. Choose the recommended option for Object Ownership.
  6. Block all public access.
  7. Optionally, you can choose to add any tags.
  8. Choose Create bucket.

Once you create the bucket, take a note of the region, you will need this later on to set up the S3 destination on Airbyte.

Create a new IAM policy with write access to S3

Head over to IAM > Policies > Create Policy. Give the group a name and attach a policy with write access to S3.

Set up S3 users

Before you can start using your bucket, you need to have an appropriate set up for who can access it. For this, you need to create a new user group.

  1. In IAM, pick User groups, then Create a group.
  2. Give this group a rememberable name like “airbyte-group.”
  3. Make sure to attach the S3 Policy you just created.
  4. Choose to “Add a user”.
  5. For the AWS Access type, choose the Access key - Programmatic access to obtain the access key ID and secret access key; you will need both these keys to set up your Airbyte destination.

Step 3: Set up a Postgres Airbyte source

To set up a new Postgres source, go to Sources in your Airbyte cloud account and click on New Source. Enter the Host, Port, DB Name, Username, and Password obtained from Heroku like this.

For the database replication method, go with the Standard option; you also choose to go with Postgres logical replication (CDC) if you want to capture all changes on the source database, but that requires more configuration.

Once satisfied with your settings, click on Set up source and wait while Airbyte tests the connection to the database.

Step 4: Set up an S3 Airbyte destination

Next, create a new S3 destination to copy your Postgres data.

  1. Enter the S3 Key Id and, S3 Access key & Bucket Name obtained while setting up the bucket in Step 2 above.
  2. To have consistency choose the S3 bucket path & Output Format to be the same. E.g., if you choose the output format to be CSV, then enter the bucket path as CSV.
  3. For the S3 Bucket region, refer to Step 2 above.
  4. Choose Set up destination after verifying all the data you just entered.

Step 5: Set up a connection from Postgres to S3

Next, you will set up a connection from your Postgres database to the S3 bucket. Head over to Connections and pick New Connection. Select the Postgres source from the existing sources and the S3 destination you just created.

Then you need to set up the connection settings. Airbyte has correctly found the sample table while setting up Postgres. Choose the Full refresh, overwrite sync mode. This will delete any existing file before importing again.

Once satisfied with the configuration, choose Set up connection and wait for Airbyte to import data from your Postgres table to the S3 bucket. Once the import is done, go back to your AWS console and select the S3 bucket where this data is exported. The path of the imported CSV should have the following format:

bucket_name/path/schema/table_name/YYYY_MM_DD_Millseconds_since_epoch.csv.

If you download the CSV file, you will see that the data was correctly imported.


"_airbyte_ab_id","_airbyte_emitted_at","_airbyte_data" "29e44ed4-1ae3-4f98-b138-7e82c573bb48",1660315412694,"{""id"":1,""first_name"":""Andrew"",""last_name"":""Adams"",""age"":23,""email"":""andrew@chinookcorp.com"",""address"":""11120 Jasper Ave NW "",""updated_at"":null}" "cc5e3768-5e5c-4188-865f-aebc0112dcde",1660315412694,"{""id"":2,""first_name"":""Nancy"",""last_name"":""Edwards"",""age"":31,""email"":""ednan@nan.com"",""address"":""825 8 Ave SW "",""updated_at"":null}"
…

Export Postgres data to Avro

Apache Avro is a row-based file format widely used for data serialization. Avro stores data and schema on the same file or message, so you can use it to stream your data. If you want to export your whole data from Postgres, and read it from elwsewhere, Avro is a very efficient data serialization format. To export Postgres data to Avro, update the Output Format in the Airbyte S3 destination settings or create a completely new connection to convert the data in Avro format.

Here is a sample output of Airbyte Avro output. You can read more about how Airbyte serializes data to Avro in the S3 destination docs.

Export Postgres data to Parquet

Apache Parquet is an open source big data file format built to handle flat columnar storage data formats. It is built to support very efficient compression and encoding schemes, making it an ideal choice for building analytical pipelines. Airbyte gives out-of-the-box support for converting data from a transactional database like Postgres to Parquet file format.

Here is the sample Aibyte Parquet conversion for our data in the Employee table. You can read more about how Airbyte serializes data to Parquet in the S3 destination docs.

Export Postgres data to JSON

Airbyte supports converting data to JSONL while the destination is S3. You can also export the JSON file in the GZIP compression format for faster uploads/downloads. The output file will have the .jsonl.gz extension if you enable compression.

Airbyte exports the JSONL data in the following format.


{
 "_airbyte_ab_id": "",
 "_airbyte_emitted_at": "",
 "_airbyte_data": ""
}

Here is how the Employees data looks like after the JSONL conversion. Notice the actual data is inside the _airbyte_data key.


{"_airbyte_ab_id":"48613626-765a-4cfb-aad6-e9714455fe9d","_airbyte_emitted_at":1660370568080,"_airbyte_data":{"id":1,"first_name":"Andrew","last_name":"Adams","age":23,"email":"andrew@chinookcorp.com","address":"11120 Jasper Ave NW ","updated_at":null}} {"_airbyte_ab_id":"a5758765-5098-4c7e-a0f6-ece39ef51f98","_airbyte_emitted_at":1660370568080,"_airbyte_data":{"id":2,"first_name":"Nancy","last_name":"Edwards","age":31,"email":"ednan@nan.com","address":"825 8 Ave SW ","updated_at":null}} {"_airbyte_ab_id":"b1f1072e-5c78-46cc-9cf4-eaae255ce7b6","_airbyte_emitted_at":1660370568080,"_airbyte_data":{"id":3,"first_name":"Jane","last_name":"Peacock","age":43,"email":"peaj@pj.com","address":"1111 6 Ave SW 
","updated_at":null}}
…

Conclusion

To summarize,  this tutorial has shown you how to:

  1. Configure a Postgres database using Heroku Postgres.
  2. Configure an AWS S3 Bucket.
  3. Create an Airbyte connection that automatically exports your Postgres table data to S3 using popular file formats: Parquet, Avro and JSON.

With Airbyte, the data integration possibilities are endless! We invite you to join the conversation with on our community Slack with thousands of data engineers or sign up for our data engineering newsletter to stay in touch.

Open-source data integration

Get all your ELT data pipelines running in minutes with Airbyte.

Similar use cases

Validate data replication pipelines with data-diff

Learn to replicate data from Postgres to Snowflake with Airbyte, and compare replicated data with data-diff.

Version control Airbyte configurations with Octavia CLI

Use Octavia CLI to import, edit, and apply Airbyte application configurations to replicate data from Postgres to BigQuery.

Load Data into Delta Lake on Databricks Lakehouse

Learn how to load data to a Databricks Lakehouse and run simple analytics.