Learn how we created an ELT pipeline to sync data from Postgres to BigQuery using Airbyte Cloud. You can follow these steps to create your own.
The need to sync data from an operational database like Postgres to a data warehouse like BigQuery is a prevalent challenge that data engineers need to solve. The data team at Airbyte is not the exception.
Airbyte OSS and Airbyte Cloud use Postgres databases to store the state of data replication jobs. Hence, our internal databases contain information about job executions, connectors, workspaces, permissions, credit consumption, and configurations used by our Cloud customers. Since the data in Postgres is precious to the business, we move it to BigQuery to make it available to all teams in the company who are interested in analyzing and visualizing such data.
ℹ️ If you’re using Airbyte OSS, similar data is also available to you. Airbyte stores internal state and metadata in two internal Postgres databases that you can access to gain insights. These databases contain information about: Execution of Airbyte jobs and various runtime metadata. The internal orchestrator used by Airbyte (tasks, workflow data, events, and visibility data). Connectors, sync connections, and various Airbyte configuration objects.
Once the raw data is in BigQuery, we use dbt to transform it and structure it into different layers. From there, the various teams can create dashboards in Metabase. Furthermore, the data from BigQuery is propagated to other systems such as Salesforce and Customer.io using reverse ETL to enrich customer and prospects’ data.
But why do we need to move the data to BigQuery? In short, because using operational databases for analytical purposes is far from ideal!
Postgres is a robust, open-source relational database system with a good reputation for reliability and performance. It implements ACID transactions and is designed to manage a wide range of workloads, from single computers to web applications with many concurrent users.
But, because Postgres is also an online transaction processing (OLTP) database, it’s not optimized for data warehouse use. Data warehouses serve analytical workloads and are typically built on columnar databases – also called online analytical processing (OLAP) databases.
Analytical queries often select a few columns and several rows (sometimes the whole table) and then aggregate those. A columnar database can be 1000x faster for an analytical query than an OLTP database.
BigQuery – Google’s fully managed serverless data warehouse – is one of the most widely used columnar databases. It’s NoOps, meaning there is no infrastructure to operate and can handle data analysis over petabytes.
Now, it’s easy to understand why analyzing data in BigQuery is preferred over Postgres, but how’s that achieved? Replicating a Postgres database to another database or data warehouse is generally tricky. It is not sufficient to export the database using the pg_dump command. You must also handle loading new or updated data by continually checking the tables for fresh updates.
A possible solution is to create an ETL or ELT pipeline and orchestrate it using a platform like Airflow.
The term ELT pipeline, which stands for “extract, load, transform,” refers to a collection of procedures that transport data from one or more sources into a database, such as a data warehouse.
Airflow has built-in transfer and database operators that can be used to sync data from Postgres to BigQuery, such as the PostgresToGCSOperator and the GCSToBigQueryOperator. Building your data pipelines with Airflow is a good starting point; however, it doesn’t come without challenges. That’s why at Airbyte, we use our product, Airbyte Cloud, to sync data from Postgres to BigQuery! With a few configurations, we implemented a Postgres to Bigquery ELT pipeline that incrementally ingests data and provides us with deduplicated and historical data in the destination.
In this tutorial, you will learn how we created an ELT pipeline to sync data from Postgres to BigQuery using Airbyte Cloud. You can follow these steps to create your own.
Before setting up your Postgres source in Airbyte, check our getting started documentation.
To set up a new Postgres Airbyte source, go to Airbyte's UI, click on sources and add a new source. As the connector type, select Postgres. The subsequent illustrations show how we configured one of the sources at Airbyte. Fill in the following configuration fields with the details of your Postgres database.
Then click on Set up source, and Airbyte will test the connection. If everything goes well, you should see a successful message.
Go to destinations and add a new one. As demonstrated in the following diagram, select BigQuery as the destination type and fill in with the following details.
If you’re in doubt regarding the meaning of any of the configuration fields, you can read our BigQuery setup guide. At the same time, you will find hints about each field and links with additional information in the UI.
Then, click on Set up source and let Airbyte test the destination.
Go to connections and create a new connection. Then, select the existing Postgres source you have just created, and then do the same for the BigQuery destination. The following diagram shows how we set up the connection at Airbyte.
As you can see, we set the replication frequency to every hour. You can change the replication frequency depending on your needs.
At Airbyte, we load data from multiple sources to a single BigQuery project. Because we don’t want data to be held in the default dataset specified in the destination configuration (airbyte_raw), we used the “Custom format” option so the data will be stored in the airbyte_prod_configapi dataset. If you want to know more about namespaces, check our documentation.
Then, it's time to configure the streams. Each stream represents a table that will be created in BigQuery. You can expand each stream to see the fields they contain, their data type, how they will be named in BigQuery, etc. You can select which streams you want to sync using the “Sync” toggle.
Now, you should select a sync mode for each stream. As shown in the image below, we extract data from Postgres in an Incremental way and load it using Deduped + history mode.
We use the Deduped + history mode because we want to model our data as Slowly Changing Dimensions of type 2 (SCD). For example, in our dashboards, we want to know when a connection stopped being active (when someone disabled it or deleted it), but we still need it to count for periods when it was active. It means a connection could be counted for June but not in July if it was disabled for July.
You can learn more about sync modes in our documentation.
On top of replication, Airbyte provides the option to enable or disable an additional transformation step at the end of the sync, called normalization. In this case, the basic normalization maps the JSON object from the Postgres ingestion to a format compatible with BigQuery. Hence, we suggest using the “Normalized tabular data” option.
Once you're ready, save the changes. Then, you can run your first sync by clicking on Sync now. You can check your run logs to verify everything is going well. Just wait for the sync to be completed, and that's it! You've created an ELT pipeline from Postgres to BigQuery where the EL steps are the responsibility of Airbyte. In this case, the T step is a simple data normalization step done by dbt and orchestrated by Airbyte. You can then execute other dbt transformations as we do at Airbyte to consolidate your data before creating dashboards.
After a successful sync, you should be able to see the data in BigQuery. If you go to your dataset, you’ll see several sets of tables were created.
The tables named <span class="text-style-code-dark">_airbyte_raw_*</span> contain the raw JSON data fetched from the source. They only have three columns: _airbyte_ab_id, _airbyte_emitted_at, _airbyte_data. These tables exist because a core tenet of ELT philosophy is that data should be untouched as it moves through the “E” and “L” stages so that the raw data is always accessible.
If you enabled basic normalization, the tables whose name matches the stream name are your normalized tables. The raw JSON data should have been mapped into columns.
Because we selected the Deduped + history sync mode, there should be no duplicates in these tables. Deduped implies data in the final table is unique per primary key, which is achieved by sorting the data using the cursor field and preserving only the latest de-duplicated data row. This is known as an SCD table of type 1 in dimensional data warehouse lingo.
You will also see some metadata columns added to these tables.
The tables named <span class="text-style-code-dark">*_scd</span> are the historical tables and are only created when you use the Deduped + history sync mode. History denotes creating an extra intermediate table to which data is constantly appended (with duplicates). This is known as an SCD table of type 2.
You will also see some metadata columns added to these tables.
Taking your operational data to the next level by loading it into a data warehouse can give you significant advantages. At Airbyte, we use our product, Airbyte Cloud, to achieve that, and you can get started for free. This tutorial taught you how we quickly implemented an ELT pipeline from Postgres to BigQuery that incrementally ingests data and provides us with deduplicated and SCD tables.
The next step for us is to enable log-based Change Data Capture (CDC) in the source, which Airbyte supports. Using CDC will allow us to have a more granular data change tracking and is one of the most efficient ways to ingest data.
We invite you to join the conversation at Airbyte’s community Slack Channel to share your ideas with thousands of data practitioners and help make everyone’s project successful. With Airbyte, the integration possibilities are endless, and we can't wait to see what you're going to build!
Get all your ELT data pipelines running in minutes with Airbyte.
Learn to replicate data from Postgres to Snowflake with Airbyte, and compare replicated data with data-diff.
Use Octavia CLI to import, edit, and apply Airbyte application configurations to replicate data from Postgres to BigQuery.