Article

Reading Very Large Postgres tables - Top Lessons We Learned

•

August 9, 2023

•

5 min read

Note: this article is part of a series of articles about Airbyte's Postgres connector: Postgres Replication Performance Benchmark: Airbyte vs. Fivetran and Why you can replicate Postgres datasets of any size.

Over the past 12 months, we have worked to make our Postgres Source connector enterprise-grade. We started with a simple connector that could read entire tables for full refreshes or a subset of rows for incremental syncs. We then added support for logical replication methods (CDC and xmin). We have ensured that the connector handles most data types and formats including PostgreSQL edge cases (think of ‘infinity’ or BC dates for example). We then removed performance bottlenecks and invested in tools to measure performance so that we could continuously improve.

Our throughput performances have gone from 4 to 11 MB per second (compared to Fivetran at 5MB per second), enabling us to reliably read tables in the Terabytes range. Whether you’re using CDC cursor-based or cursorless (xmin) incremental syncs, the initial load of data will now be able to handle virtually any amount of data.

We learned a few lessons we will share with you in this article!

Don’t rearrange data

To be able to read large tables, data should be read in the natural order of the database rather than attempting to reshape it for our needs.

Early on, we were doing the obvious thing:

SELECT * FROM some_table ORDER BY cursor_column

This makes a ton of sense because then you can checkpoint (keep a high-water mark) easily. But as the scale grows, that sorting can take a long time in itself. We found that when dealing with larger tables, any attempt to arrange the data can add hours to the sync time.

We now read the table rows in the order in which they are saved on disk by PostgreSQL. This leads to predictable query times because we are not fighting the storage engine.

How data files are structured

Table data is written to a separate file or files. The files are typically 1GB each and are divided into blocks of 8k (defined by the BLOCK_SIZE server variable). Each of these blocks is referred to as a Page. Each Page is divided into Tuples which hold row data entries, alongside a few headers. The size of Tuples corresponds to the size of data in a table in that it depends on how many columns and what data types are defined.

‍

Postgres refers to a physical address on a data file in the form of (Page, Tuple).

Postgres data file structure

‍

Additionally, every row on every table in Postgres has a few hidden system columns. One of these columns is known as the Current Tuple ID (CTID). It contains this (Page, Tuple) combination as its physical address on disk.

‍

 > SELECT ctid FROM some_table where id = 1999999959;

ctid     
-------------- 
(46302775,3)
(1 row)

Querying the data files we can use the disk layout to more efficiently sync the data using its default sort order.

So instead of:

SELECT c1, c2... FROM some_table ORDER BY c1 ASC

We can query:

SELECT c1, c2... FROM some_table where ctid > '(0,0)'

Running this query, which translates into Postgres Tuple ID (TID) scan has several benefits:

Even though we didn’t explicitly ORDER BY the data, the result rows are given to us in a defined order.
Because data is read in a sequential order, there is no need to access pages randomly on disk, which can degrade the speed in which data is returned. Each page random access has a cost defined in the random_page_cost server parameter.
The CTID of a row can be used as a marker if we need to go back and restart a read from the middle. We’ll talk more about this next.

Read it in chunks

Our naive approach took advantage of the default JDBC behavior and transactions to stream the whole data set in one loop. On larger databases, this can take many hours. Many things go wrong during that time, so the sync becomes unreliable due to the strain on the server or network issues.

A better approach would be to break a read of an entire table into several smaller sub-queries that can run successively or even concurrently. We call these chunks.

For example, we can query the first pages of data like this:

SELECT "id","age","name","email" FROM "users" where ctid > '(0,0)' AND ctid <= '(128728,0)'

We would then increment it and get the next set:

SELECT "id","age","name","email" FROM "users" where ctid > '(128728,0)' AND ctid <= '(257456,0)'

This approach greatly increased our reliability because each of the resulting queries is very fast.

Save your place

Even when breaking the table read query into smaller sub-queries, there may be failures caused by external factors. For example, there could be a network error or a server going down for maintenance. The more data available, the longer it takes to read it, increasing the chance of an error.

With smaller tables, you might be able to get away with it. In these cases, maybe it would be fine to have the sync start over if something failed. With larger tables, however, we have to expect that an error will occur and be able to recover.

We call the approach checkpointing. When a source is checkpointing, Airbyte stores the state of a sync such that we can restart from a known point.

In case of a Postgres initial sync, we save CTID markers along the way. Then, if one of these sub-queries produces an error, we can restart our read from the last known saved checkpoint.

Transition to incremental

The discussion so far has got us to a point where we can reliably sync large Postgres tables. This initial sync snapshots the data at a certain point in time and copies it to your Destination (e.g. data warehouse). It can still take many hours for a large table. To keep your data up-to-date, we needed to transition to an incremental approach so that the next sync will only operate on changed data.

Unfortunately, the CTID approach that got us here can not take us forward. Because the CTID points to a physical address on the disk, it is prone to change and is not a permanent marker we can use. For example, Postgres could `VACUUM` later, which would change the arrangement on the disk. What we need is something that is reliably increasing over time.

Once the initial load is done, Airbyte can perform Postgres CDC, xmin, or a user column. In each of these cases, it depends on a reliable cursor to be able to find data that has changed.

CDC saves the Log Sequence Number (LSN) in the Write-Ahead Logs (WAL). This is nice because it also handles deleted rows and can handle very large tables efficiently.
xmin saves the current `pg_snapshot_xmin()` of the current database and uses that to query the next time.
The user column is often filled out by the application in the form of a timestamp. Airbyte can run a query like `SELECT * FROM some_table WHERE udpated_at > {last_time}`.

Measure, Measure, Measure

This may come as no surprise, but an important lesson we learned was to continuously test how our assumptions about database behavior measure against real-life scenarios. It was also important to run those tests on various table sizes. In some cases, the actual behavior changes when table sizes cross a threshold.

The first step was to profile our running Java code to find bottlenecks using aysnc_profiler and other tools. This led to improvements in speed and memory management.

We also created a harness to be able to sync any given database and test its throughput and reliability. We created a dashboard to track our progress (up and to the right!) and monitored it for changes. We also run this through continuous integration (CI) on all code hoping to be updated to verify that things keep moving in the right direction.

Conclusion

Airbyte has released an approach to PostgreSQL that allows us to read unlimited amounts of data without putting the server under stress. This method is based on the following principles:

Read the data in its natural order to increase performance
Don’t attempt to read everything all at once
Use checkpoints to allow for continuous replication in case of an error
Switch to an incremental approach after the initial snapshot
Continuously measure performance and make adjustments as needed

This method allows Airbyte to handle any size of the table in a linear order. This means that the more rows of data that exist, the longer it will still take to read them. However, there is no need to increase database resources such as memory or disk and we can do so reliably every time. We are excited to apply the approach and learnings to other databases such as MySQL and MongoDB.

💡Suggested Read: PostgreSQL Use Cases

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->

The data movement infrastructure for the modern data teams.

Try a 14-day free trial