No items found.

Replicate data from a PostgreSQL database to Snowflake

Learn how to replicate data from an OnLine Transactional Processing (OLTP) database like PostgreSQL, to an OnLine Analytical Processing (OLAP) data warehouse like Snowflake.

Imagine you’re replicating a database containing millions of records for a fast-growing online store to a data warehouse for analysis. Your data replication solution has to be fast, efficient, and cost-effective. 

In this tutorial, you’ll learn how to use Airbyte to transfer data from an OnLine Transactional Processing (OLTP) database, PostgreSQL, to an OnLine Analytical Processing (OLAP) data warehouse, Snowflake. You’ll also learn how to deal with edge cases, like partitioned tables and custom data types.

OLTP vs OLAP

Before you start replicating data, it's important to understand what OLTP and OLAP mean.

OLTP is a form of data storage and processing that executes a number of concurrent transactions. OLTP workloads are optimized for database inserts. Example application areas of OLTP include online banking, online shopping, order entry, and sending text messages. OLTP follows ACID principles. In basic terms, this means a transaction to the database either completes or fails.

OLAP is a form of data storage and processing that focuses on storing denormalized data to make querying easier when carrying out data analysis. OLAP systems are optimized for data analysis. This form of storage is used in data warehouses and data lakess. Data analysts can make queries on historical data from an OLAP environment to make better analytical decisions.

Snowflake is an OLAP is a cloud-based data warehouse. There is no hardware to configure, no software to install, and no maintenance required. Snowflake ships with an integrated data warehouse, as well as analytics tools for running OLAP workloads. These features make Snowflake a great choice for OLAP workloads.

There are two main sync modes to replicate data from OLTP to OLAP systems. 

  • A Full refresh replicates all of the data from the OLTP source to the OLAP destination.
  • An Incremental sync replicates data at intervals from a OLTP source to a OLAP destination as new data is added.

Airbyte is a data integration tool that allows you to replicate data to data warehouses, lakes, and databases. Airbyte supports both full refresh and incremental syncs. In this tutorial, you will use full refresh mode to replicate data from PostgreSQL to Snowflake.

Pre-requisites

Sowflake supports a rich set of data types, including semi-structured data types such as JSON and XML. This means that most tables won’t require extra data type conversion. For tables with data types that Snowflake doesn’t support, Airbyte converts that data to an Airbyte-specific string type which ultimately ends up as a VARCHAR type on Snowflake.

Airbyte streams data from a source, builds it into temp files, and delivers the data to the destination. You can choose to use your local machine for storing temp files as part of the replication process, or use an external staging area like Amazon S3 or Google Cloud Storage (GCS). An external staging area offers faster replication speeds than your local machine. 

Before you can start replicating data, you’ll need to ensure the following conditions are met:

  1. A staging area (Amazon S3 or Google Cloud Storage) for fast and efficient loading of data from PostgreSQL to Snowflake.
  2. An instance of PostgreSQL. Follow these instructions to install PostgreSQL if you don't have it locally.
  3. Docker and Docker Compose installed.

Now that you understand the prerequisites, you’ll go over the steps of replicating a simple PostgreSQL database to Snowflake with Airbyte.

Set up PostgreSQL

The first step is to create and populate a local PostgreSQL instance. You’ll need to clone the below GitHub repo. The repo contains all the code and files you’ll use throughout the tutorial.


git clone https://github.com/vicradon/postgres-airbyte-snowflake-tut.git

After cloning the repo, navigate to the clone repo. Then create a new database.


createdb airbyte_tut -U postgres

After creating the database, run the command below to create a table, movies and populate it with data.


psql -d airbyte_tut -f movies.sql -U postgres

Now, enter the psql shell as user postgres, psql -U postgres, then connect to the airbyte_tut database, \c airbyte_tut. Next, create a dedicated read-only user that will have access to the relevant tables in the airbyte_tut database.


CREATE USER airbyte PASSWORD 'your_password_here';

Then, give the Airbyte user access to the airbtye_tut database.


GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO airbyte;


Set Up Snowflake

When creating a Snowflake account you’ll need to pick a Snowflake edition and a cloud provider as part of the account creation process. Next, you’ll receive an email containing your login URL similar to https://id.region.cloud-provider.snowflakecomputing.com/console/login. Make sure to bookmark this URL for future logins.

After you create your username and password you will be redirected to the Snowflake dashboard after your account has been successfully created. The worksheet area will be the primary place you’ll run scripts for creating and modifying resources.

You need to create the destination database, user, role, and schema on Snowflake where the sync will occur. Airbyte already provides a script for running this automation. Paste the snippet below into your worksheet area on Snowflake. Make sure to change the airbyte_password variable to your preferred password before running the script.


-- set variables (these need to be uppercase)
set airbyte_role = 'AIRBYTE_ROLE';
set airbyte_username = 'AIRBYTE_USER';
set airbyte_warehouse = 'AIRBYTE_WAREHOUSE';
set airbyte_database = 'AIRBYTE_DATABASE';
set airbyte_schema = 'AIRBYTE_SCHEMA';

-- set user password
set airbyte_password ='YOUR_AIRBYTE_PASSWORD';

begin;

-- create Airbyte role
use role securityadmin;
create role if not exists identifier($airbyte_role);
grant role identifier($airbyte_role) to role SYSADMIN;

-- create Airbyte user
create user if not exists identifier($airbyte_username)
password = $airbyte_password
default_role = $airbyte_role
default_warehouse = $airbyte_warehouse;

grant role identifier($airbyte_role) to user identifier($airbyte_username);

-- change role to sysadmin for warehouse/database steps
use role sysadmin;

-- create Airbyte warehouse
create warehouse if not exists identifier($airbyte_warehouse)
warehouse_size = xsmall
warehouse_type = standard
auto_suspend = 60
auto_resume = true
initially_suspended = true;

-- create Airbyte database
create database if not exists identifier($airbyte_database);

-- grant Airbyte warehouse access
grant USAGE
on warehouse identifier($airbyte_warehouse)
to role identifier($airbyte_role);

-- grant Airbyte database access
grant OWNERSHIP
on database identifier($airbyte_database)
to role identifier($airbyte_role);

commit;

begin;

USE DATABASE identifier($airbyte_database);

-- create schema for Airbyte data
CREATE SCHEMA IF NOT EXISTS identifier($airbyte_schema);

commit;

begin;

-- grant Airbyte schema access
grant OWNERSHIP
on schema identifier($airbyte_schema)
to role identifier($airbyte_role);

commit;

Select All queries and run the script.

Set Up a PostgreSQL Source

Ensure you have Docker Compose on the environment where you want to run Airbyte. Now, run the commands below to initiate Airbyte setup:


git clone https://github.com/airbytehq/airbyte.git
cd airbyte
docker-compose up

You can then open your app on http://localhost:8000. After inputting your email, you’ll be presented with an onboarding experience to create a first Airbyte connection, starting with the source.

Select Postgres as your source and set up the connection parameters as explained in the Postgres source documentation.

Set Up a Snowflake Destination

After inputting the name and destination type, you’ll be presented with several required fields that Airbyte will use to connect to the Snowflake destination. Fill in the form with the default details.


Host =YOUR_SNOWFLAKE_HOST (https://id.region.cloud-provider.snowflakecomputing.com)
Role = AIRBYTE_ROLE
Warehouse = AIRBYTE_WAREHOUSE
Database = AIRBYTE_DATABASE
Schema = AIRBYTE_SCHEMA
User = AIRBYTE_USER
Password = password


When you get to the loading method section, select your preferred loading method and input the required details. The default loading method, uses your local machine to generate the temp files. An example using AWS S3 is shown below.

Setup a Connection from PostgreSQL to Snowflake

After verifying your connection, you’ll be redirected to set it up. Choose a manual sync frequency and set the namespace configuration to Destination Connector Settings.

Ensure the `movies` table shows up. If it doesn’t, check that the table is present in the `airbyte_tut` database and refresh the schema. Complete the process by clicking Set up connection.

Then you can On your Airbyte dashboard, navigate to the Connections page and click on the first connection.


This will redirect you to a page with more details about that connection. Click Sync now to start synchronizing data.

This action will trigger a sync to Snowflake. When the sync completes, 18 records will be pushed to Snowflake.


Handling Special Cases

Next we will cover three special cases to replicate partitioned tables, custom data types and working with geospatial data.

Replicating PostgreSQL Partitioned Tables to Snowflake

Replicating tables with thousands of rows can be a slow process, so tables are often broken down into logical units called partitions. Using Airbyte, it’s possible to partition the tables by user-signup date, then push each of these partitions to Snowflake.

To begin, navigate to the tutorial folder and execute the table-partitioning.sql file. This loads data into a users table and creates two partitions. To replicate this partitioned table to Snowflake, you have two options:

1. Export the table as a single view.

2. Export each partition as separate files.

The first option involves selecting the users table only. The second involves selecting each partition. 

It’s important to know that privileges applied to a user don’t affect partitioned tables after the privileges have been applied, you have to reassign privileges. Simply rerun the original command you used to assign privileges.


GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO airbyte;

The advantage of exporting each partition is the speed of parallel execution is that each table will be replicated in its own stream. After the replication process is complete, the tables can then be combined into a single table on Snowflake.

To get each partitioned table into Snowflake, navigate to the Settings tab on the Airbyte-Snowflake connection. Click Update latest source schema to load the new tables. 

Finally, save the changes to restart the migration.

Replicating Custom Data Types from PostgreSQL to Snowflake 

Custom data types (also called user-defined types) in PostgreSQL are abstractions of the basic data types. A custom data type could be a basic data type like VARCHAR, with the addition of a check to ensure that imputed data has no spaces. An example is a countries table defined by the following query.


CREATE TABLE countries (
    id serial primary key,
    code VARCHAR NOT NULL UNIQUE,
    name VARCHAR,
    CHECK (
        code !~ 's'
    )
)

You could define a custom data type for the code field in the table using the CREATE DOMAIN statement.


CREATE DOMAIN COUNTRY_CODE AS 
   VARCHAR NOT NULL UNIQUE CHECK (value !~ 's');

You can then use this data type in the normal way.


CREATE TABLE countries (
    id serial primary key,
    code COUNTRY_CODE,
    name VARCHAR
)

Snowflake does not support user-defined types, so Airbyte converts all user-defined types to VARCHAR before replicating the data to Snowflake to handle this. To demonstrate the process of replicating custom data types to Snowflake, navigate to the tutorial folder and load a table with user-defined types.


i user-defined-types.sql

After the table is created and populated with data, update your schema with the latest table.

Save the configuration to restart the sync process. When the sync succeeds, check Snowflake for the new data. 


You’ll notice that the data type of the code column is VARCHAR. This is due to the transformation that occurred on the data.

PostgreSQL has support for unstructured data through: JSON, JSONB, HStore and XML. Fortunately, Snowflake supports JSON and XML data types. The other two data types, JSONB and HStore, are converted to the VARCHAR data type. 

Replicating Geospatial Data from PostgreSQL to Snowflake

Most modern applications and systems utilize one form of spatial data or another. Spatial data is usually represented as using a coordinate system like the cartesian or spherical systems. For PostgreSQL databases, PostGIS is used to enable spatial data storage. It enables PostgreSQL databases to store and query location related data such as longitude and latitude through its custom data types.

PostGIS offers the following data types: Box2d, Box3d, Geometry, Geometry_dump, Geography.

The Geography data type is the most important data type offered by PostGIS, it provides a way to store longitude and latitude data on PostgreSQL. This is done using the Point type, which takes in longitude and latitude as inputs.

Snowflake supports the Geography data type, meaning no typecasting will occur during replication. The replication process is the same as before.

To replicate a table containing airports data to Snowflake, install the PostGIS extension for Windows or Ubuntu. Then, activate the extension.


CREATE EXTENSION postgis

Create a table containing airports and their coordinates by running the airports.sql file from the tutorial folder:


i airports.sql

Update your schema on Airbyte. Notice the geography_columns and geometry_columns tables that were added. These tables originated from the PostGIS extension.


Sync the data using Airbyte, then view the replicated data on Snowflake.

Geography data is converted to VARCHAR by default, but Snowflake has a SQL function, TO_GEOGRAPHY, for converting this VARCHAR into the Geography data type.


Conclusion

Data warehouses like Snowflake are optimized for analytical workloads. For this reason, it's common to replicate data from transactional databases like PostgreSQL.

Airbyte makes it easy to replicate data from PostgreSQL to Snowflake. However, it's important to know what to expect from some special cases, like replicating partitioned tables and how custom data types are mapped.

If you enjoyed using Airbyte, you might want to check our fully managed solution: Airbyte Cloud.

About the author

Osinachi Chukwujama is a web developer and technical writer. He enjoys building backend applications and utilizing cloud computing. He plays the organ and creates casual animations when he isn't coding.

Similar use cases

Export Postgres data to CSV, JSON, Parquet and Avro files in S3

Learn how to easily export Postgres data to CSV, JSON, Parquet, and Avro file formats stored in AWS S3.

Sync MySQL CDC to Kafka using Change Data Capture

Learn how to stream changes from a MySQL database to Kafka using Change Data Capture (CDC).

Migrate your data from Redshift to Snowflake

Get your data out of Redshift and into Snowflake easily with open source data integration.