How to Handle Schema Changes During Migration from Postgres to BigQuery?

•

February 11, 2025

•

15 min read

Summarize with ChatGPT

Many organizations are migrating data from traditional databases like Postgres to cloud-native data warehousing platforms like BigQuery. This transition, while beneficial, presents several challenges, particularly when it comes to schema changes. These changes complicate the migration process and lead to data loss and temporary downtime, disrupting critical applications and services if not handled properly.

To overcome these challenges, you can leverage robust data replication platforms like Airbyte. It facilitates data migration by automatically detecting and handling schema changes, ensuring that your data remains consistent across systems.

In this article, you will explore how to set up a data replication pipeline to efficiently migrate data from Postgres to BigQuery while managing schema changes effectively.

How Airbyte Streamlines Postgres Migration?

Airbyte is an AI-powered data movement platform that enables you to migrate data from various sources, such as Postgres, to your preferred target systems like BigQuery. It offers an extensive catalog of 550+ pre-built connectors that help you automate the data pipelines without writing a single line of code.

To enhance large data transfers, Airbyte employs a batch-loading approach by grouping records into batches. This minimizes network overhead, optimizes resources, and improves pipeline performance.

Additionally, Airbyte’s Kubernetes-native architecture ensures high scalability and resilient deployments. As data demands grow, you can effortlessly manage the increasing workloads without disruptions.

Key Features of Airbyte

Custom Connectors: If you’re unable to find the specific connector you need, Airbyte enables you to build custom connectors through its Connector Development Kit (CDK) within 30 minutes. Additionally, the Connector Builder has an AI-assist functionality that reads through the API documentation you provided and automatically prefills most configuration fields, reducing setup time.

Change Data Capture (CDC): With the CDC feature, you can incrementally capture data changes from Postgres, including inserts, updates, and deletes, and reflect them at the destination. This ensures that BigQuery is continuously updated with the latest changes, significantly reducing the risk of data inconsistencies.

Automatic Schema Detection: You can configure Airbyte to detect any schema changes occurring at the source and propagate those changes to the destination. It automatically checks for schema changes every 15 minutes for Cloud users and every 24 hours for self-hosted users. This helps in maintaining data consistency between the source and the target systems.

GenAI Workflows: With Airbyte, you can simplify your AI workflows by loading unstructured data into vector databases like Pinecone, Milvus, or Weaviate. This empowers you to leverage the advanced capabilities of vector stores for similarity search and retrieval of high-dimensional data required for AI-driven applications.

‍RAG Techniques: You can integrate Airbyte with large language model (LLM) frameworks such as LangChain and LlamaIndex to perform Retrieval-Augmented Generation (RAG) operations. This includes processes such as chunking and indexing, which help in enhancing the responses generated by LLMs, making data retrieval more efficient.

Developer-friendly Data Pipelines: Airbyte provides multiple options, such as UI, APIs, PyAirbyte, and Terraform Provider, to help you build and manage the data pipelines according to your use case.

Uninterrupted Data Syncs: Airbyte's checkpointing feature in PostgreSQL replication significantly improves the reliability and performance of data synchronization processes. If a sync fails, Airbyte can pick up right where it left off without the need to start over from the beginning, minimizing data replay and maximizing efficiency.

Sync Reliability: Airbyte's Record Change History feature prevents synchronization failures caused by problematic rows, such as oversized or invalid records. If any record breaks the sync, Airbyte modifies it during transit, logs the changes, and ensures that the sync completes successfully during migration.

Automatic Detection of Dropped Records: Airbyte enhances your data integrity by monitoring record counts across the source, platform, and destination. You'll get notifications if any discrepancies are identified between these stages, enabling you to take corrective action. This ensures that all relevant records are included during syncs and prevents data loss during migration.

Data Orchestration: Airbyte facilitates integration with popular data orchestrators, such as Airflow, Dagster, Prefect, and Kestra. This streamlines the management of large-scale data pipelines and guarantees smooth data transfer from Postgres to BigQuery.

With a strong understanding of Airbyte's prominent features, now let’s explore how you can configure Airbyte to migrate data from Postgres to BigQuery.

Step 1: Start a PostgreSQL Docker Container

Run the following command in your terminal to create a PostgreSQL container using Docker.


docker run --name airbyte-postgres -e POSTGRES_PASSWORD=password -p 5163:5163 -d debezium/postgres:13

Step 2: Configure Your PostgreSQL Database

Set up a PostgreSQL schema, user, and necessary privileges. You can use the psql interactive terminal to run queries. To start psql, SSH into the Docker container you started in the previous step.


docker exec -it airbyte-postgres /bin/bash

Now, launch psql terminal


psql -U postgres

Create your schema and set the search path


CREATE SCHEMA postgresql;
SET search_path TO postgresql;

Create a user named airbyte and set a password of your choice. After that, provide the user with access to the relevant schema.


CREATE USER airbyte PASSWORD 'pass';
GRANT USAGE ON SCHEMA postgresql TO airbyte;

Now, grant the user with read-only access to relevant database schemas and tables. You should re-run this command for each schema from which you want to replicate data.


GRANT SELECT ON ALL TABLES IN SCHEMA postgresql TO airbyte;
# To ensure Airbyte can see tables created in the future
ALTER DEFAULT PRIVILEGES IN SCHEMA postgresql GRANT SELECT ON TABLES TO airbyte;
# Grant additional replication permissions to the user
ALTER USER airbyte REPLICATION;

Create a new table and populate it with data. Here is an example:


CREATE TABLE subjects(id INTEGER, NAME VARCHAR(200), PRIMARY KEY (id));
INSERT INTO subjects VALUES(0, 'java');
INSERT INTO subjects VALUES(1, 'python');

Airbyte utilizes logical replication of the Postgres write-ahead log (WAL) to incrementally capture changes. So, activate logical replication on your Postgres database and run the following command to set up a replication slot using the pgoutput plugin. This slot ensures the master server keeps the WAL logs required by replicas even when they are temporarily disconnected from the master.


SELECT pg_create_logical_replication_slot('airbyte_slot', 'pgoutput');

Logical replication operates on a publish/subscribe model where multiple subscribers can register to one or more publications from a publisher node. So, establish a publication to allow subscription to the events of the subjects table.


CREATE PUBLICATION pub1 FOR TABLE subjects;

Step 3: Configure PostgreSQL as Source in Airbyte

Log into your Airbyte account. Click on Sources from the left navigation bar, search for Postgres, and choose the respective connector.

Fill in the following configuration fields with the details of your PostgreSQL database, as shown below.

Choose the required replication method and fill in the necessary fields.

When implementing CDC, changes are read directly from the transaction log, known as the write-ahead log (WAL) in PostgreSQL. This approach eliminates the need for a suitable cursor field in the source data, making it easier to track changes. Further, intermediate changes and deletions on the source are accurately transmitted to the destination because they are logged just like any other modification to the source data.
Finally, click on the Set up source button to configure the source connector.

Step 4: Configure BigQuery as Destination in Airbyte

Navigate to the Destinations tab on the Airbyte dashboard and select BigQuery to set it up as a destination.

Provide the required details on the configuration screen.

‍

Project ID: Enter your Google Cloud project ID.‍
Dataset Location: Choose the location of your BigQuery dataset.‍
Default Dataset ID: Input the BigQuery Dataset ID.‍
Loading Method: You can select GCS Staging or Batched Standard Inserts.‍
Service Account Key JSON: Type the Google Cloud Service Account Key in JSON format.
Once you fill in the required parameters, click on the Set up destination button.

Step 5: Create an Airbyte Connection

Now, you should establish a connection between the source and the destination. On the Airbyte dashboard, navigate to the Connections tab. Then, select the existing Postgres source and BigQuery destination you configured in the previous steps.

Click the Replicate Source button and select the streams you want to synchronize. You can also select a sync mode for each stream, such as Full Refreshes, Incremental | Append, or Incremental | Append + Deduped.
Click Next to complete your stream setup and move to the connection configuration. Enter the connection name, specify the replication frequency, and any other necessary information. Once you've completed all the connection settings, click the Finish and Sync button.

With this, you have built an end-to-end pipeline from Postgres to BigQuery. Airbyte syncs all data captured from your Postgres tables into new tables in BigQuery. Whenever schema changes occur in the PostgreSQL source, they will be copied over to your BigQuery dataset with minimal delay.

Wrapping Up

This article has detailed the process of migrating from Postgres to BigQuery using Airbyte. By implementing the above-mentioned steps, you can ensure that your data is efficiently migrated without worrying about schema changes. With Airbyte's user-friendly interface and support for incremental syncs, you can maintain a reliable and efficient data pipeline that guarantees data transfer without inconsistencies.

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program

The data movement infrastructure for the modern data teams.

Try a 14-day free trial