In our last article, we mentioned the significant progress Airbyte made on the performance of its platform and its connectors, multiplying the speed by up to 10 for some of its most popular connectors, notably our Postgres source connector..
This article is now about delving into the performance progress we’ve recently made on our MySQL source connector.
3x Increase in Performance in 6 Months
You can note the performance improvement from 4 MB per second in March to ~11 MBps today. We’ve put more details on the uniform dataset and connector versions we used to benchmark our improvements at the end of the article.
How did we achieve such improvements?
The improvements we made to power this new performance
We’ve detailed the lessons we learned when improving our Postgres source connector 10x, which now enables us to sync Postgres databases of any size. There were many learnings from Postgres that were applied to MySQL:
- Support for Initial Syncs of Any Size
- Reliable Handoff to Incremental Syncs
- Compressing State in Logical Replication (CDC) Mode
Support for Initial Syncs of Any Size
As of the most recent versions of MySQL, we can now support syncing datasets of any size. To do this, we’ve built new features that ensure initial MySQL snapshots are always resumable:
a. Checkpointing for all initial snapshots in any MySQL database
If a sync is ever to be interrupted, Airbyte can resume extracting data where it left off and not have to go back to the beginning. This is accomplished in different ways between Postgres and MySQL.
In Postgres, we identified CTID as a field that represented the order in which table rows are stored on disk. However, in MySQL the primary key (or index) represents the natural order of table rows on disk. An index in a database is a separate data structure that contains a copy of part of the data stored in your table and points back to the original row. In MySQL, this is represented as a B+ tree. However, the primary index representation in MySQL’s default storage engine (InnoDB) contains the contents of the entire row. This increases speed while accessing data by primary keys as it skips the additional seek from the leaf node in the B+ tree. As a result, ordering & checkpointing by primary key is very efficient.
b. Chunking MySQL database reads
As with Postgres, we leverage a smaller number of sub-queries to read the table during the initial sync. These sub-queries are reading 1GB of data but this value can be made configurable. Previously, we were reading all of the data in a single transaction. The new approach has many benefits including lower memory usage and transaction durability.
By retrieving data in smaller chunks, you can limit the memory footprint and avoid overwhelming the system. Then, by also splitting the data into smaller chunks and using multiple transactions, you further reduce the risk of losing progress and make it easier than ever to resume replication from where it left off in case of any issues.
Reliable Handoff to Incremental Syncs
Once the initial load is done, Airbyte uses a CDC or cursor-based mode to perform incremental syncs. Change Data Capture (CDC) is the default sync setting when creating new MySQL connectors. CDC offers numerous advantages, including the ability to capture changes made to data and database schema, all while imposing minimal additional load on the database itself. This provides users with a robust, high-performance default configuration for MySQL data synchronization.
As for Postgres, we built logic in the MySQL connector to automatically switch from checkpointing by primary key during the initial snapshot to the incremental mode of choice, typically CDC. The primary key load is useful only for the initial sync. After that point, we need to transition to another method to only sync data which has changed. But modifications & inserts can happen to a primary key continuously which requires us to switch to CDC mode or user-defined cursor mode.
With these improvements to both initial and incremental syncs, we also had to address a certain number of edge cases, including:
- No primary keys: Some tables do not have primary keys defined. In such rare cases, we have to default to the legacy method of loading data which can be inefficient.
- Storage Engines: MySQL allows users to specify their own storage engines. We’ve built this flow primarily for the default storage engine (InnoDB), which handles the most common scenario. Future work will include understanding different storage engines and optimizing for those use cases as well.
- Composite Indexes: MySQL tables can have composite indexes. In these cases, the first field associated with the composite primary key (e.g. a user_id field) is the field that represents the order in which table rows are stored on disk and the one that should be used for checkpointing an initial sync.
One area of future improvement is to read changes from the binlog during the initial sync. This will prevent a data re-sync scenario where an initial sync for a very large MySQL dataset could take longer than the binlog retention time configured in your MySQL instance.
Compressing State in Logical Replication (CDC) Mode
State messages play a pivotal role in tracking sync progress within the Airbyte platform. In CDC mode, these messages are passed to Debezium, which reads binlogs to determine where to resume data replication. However, for MySQL connectors, storing the schema of all accessible streams within a database in the state message led to bloated state messages, occasionally exceeding the 4MB limit supported by the platform (Temporal more specifically). This resulted in sync failures for databases with extensive schemas.
Airbyte resolved this issue by implementing state message compression, reducing the size of the state message by a remarkable 5x. This change enables successful synchronization of databases with larger schemas.
The uniform dataset to measure our performance progress
The benchmarking was conducted with the latest versions of the MySQL to Snowflake connectors, which you can find in our changelogs in our documentation. At the time of writing, the newest MySQL connector version was 3.1.1. You can also connect MySQL to Elasticsearch and MySQL to Redshift, among hundreds of other different types of connections thanks to Airbyte.
The benchmarking was conducted using a 'users' table within a 100GB schema. This dataset consists of 116 GB on disk, which serializes to 202 GB (as measured by the Airbyte platform).
Here’s how you can create on your side:
Looking Forward: Faster Connectors & Parallelization
Our improved MySQL connector now provides you with a higher performing, more flexible and reliable experience replicating data from MySQL databases. In parallel, the team has iterated on many other table stakes features in the past few months:
- Automatically propagating new schemas without data resets
- Selecting columns to exclude prior to replicating
- Typing & deduping to more reliably load data into your destinations
Looking forward, we recognize the potential of parallelization to achieve significant performance gains. Replicating multiple data streams simultaneously is another huge opportunity for further performance improvements.
To stay tuned for more updates in the future on this topic, as always, you can consult our public roadmap for more detail on what’s coming next. Our ambition is to make Airbyte’s MySQL source connector the best in the industry.