Note: this article is part of a series of article about Airbyte's Postgres connector: Why you can replicate datasets of any sizes with Postgres and Top lessons learned when upgrading Postgres to read from very large tables.
Over the last six months, the Airbyte team has made significant progress on the performance of its platform and its connectors, multiplying the speed by up to 10 for some of its most popular ones. For instance, this has been the case for the Postgres and MySQL source connectors.
In this article, we wanted to show you how performant the Postgres connector has become by comparing it with Fivetran’s, and how we’ve achieved such progress.
Spoiler alert, while Airbyte was 5x slower at the beginning of the year, Airbyte now is already much faster than Fivetran for Postgres, and has plans to go beyond.
Measuring Throughput with Uniform Dataset
To gauge performance improvements accurately, it's crucial to establish a baseline. We achieve this by conducting a daily replication of a standardized dataset from Postgres to Snowflake. This approach provides continuous insights into the behavior of connectors and allows for rapid response in case of any performance degradation.
The dataset used for benchmarking reproduces variety in normal large Postgres databases. It consists of a randomized table with multiple numbers, text and dates columns in the range of 1 TB, excluding any indexes or other metadata.
Replicating the standardized dataset from Postgres to Snowflake using Airbyte resulted in the following performance metrics:
Serialized Data Moved: 1.4 TB
- Time: 43 hours
- Effective Replication Rate: 9 MB/s
- Time: 55-86 hours (multiple attempts)
- Effective Replication Rate: 6-7 MB/s
These results highlight Airbyte's ability to outperform Fivetran in terms of effective replication rate, showcasing the great stride we’ve made with performance.
Visibility to keep improving
To maintain performance oversight, we built a dashboard that displays speed results. This dashboard allows for trend analysis, distinguishing consistent trends from occasional anomalies in speed. It also helps us in understanding the performance trajectory and facilitates decision-making based on reliable data.
1st Drill Down - Segmentation and Optimization
Identifying bottlenecks is crucial for enhancing performance. By segmenting the data flow through our multiple components - database > source connector > orchestrator > destination connector > data warehouse - , Airbyte can pinpoint areas that require optimization. The ability to get an accurate read of just a segment data travels means that we can focus on a single component and improve it while neutralizing inter component side effects such as back-pressure, fluctuating data warehouse performance etc. This allows us to be very accurate in making and seeing how much we gain from incremental improvements to our code.
We focused on our connectors - source and destination - and built a performance harness which mimics as closely as possible our real life conditions including the environment in which our containers run in the cloud and how data flows in between them. Being able to get accurate performance numbers discounting any external noises helped us be deliberate in making decisions in our code.
The performance harness allows us to test using a variety of sets of data to understand how different lengths translate into performance - that is 1 million rows compared to 10 compared to 200. Or how different widths - that is how a number of columns and data types affects us.
In addition, we invested in integrating this performance harness into our CI/CD process and making it available to all developers that can easily get an always accurate and repeatable performance measurement by running a simple command.
2nd Drill Down - Identifying Low Hanging Fruits
In order to find performance gains that are cost effective in our code, and to take out a lot of the guesswork, we used performance profiling tools. Tools such as async_profiler allowed us to find where our CPU spent time, which routines allocated memory, where our routines triggered a blocking wait and so on.
In the Postgres source connector, for instance, we identified our handling of Date types as a bottleneck.
In the Snowflake destination connector, we made it possible to upload blocks of data to the warehouse asynchronously while still being able to accept and process new records from the Airbyte source.
Next Steps: Parallelization
We recognize the potential of parallelization to achieve significant performance gains. Replicating multiple data streams simultaneously, optimizing large table reads with parallel chunk processing, and enhancing concurrency are definitely areas of huge gains. Stay tuned for more updates in the future on that topic, as this is how we will outperform any competing connectors several fold.
The journey to performance excellence is ongoing, and we have a commitment to continuous improvement.
By investing in tools for precise performance measurement and making them integral to daily processes, we are achieving remarkable gains in data replication throughput and will continue to do so. As performance remains a central focus, Airbyte users can look forward to even more enhancements, including support for additional source connectors like MySQL, MongoDB and MS SQL in the near future.
What can you do now? Well, if you’re using another ETL vendor for your Postgres replication, you should consider testing Airbyte’s Postgres connector. In addition to having the higher performing Postgres connector, Airbyte offers a pricing model based on GBs for databases, which are a lot cheaper than row-based ones for databases.