Article

Replicating MySQL: A Look at the Binlog and GTIDs

•

March 15, 2024

•

6 min

Whether for analytics or scaling, the data stored in your MySQL database will eventually need to be replicated, either between MySQL servers, or to an entirely different system.

In this article, we will examine the two major methods for achieving CDC replication in MySQL. Then, we’ll provide guidance on how to replicate data across MySQL instances and to other data stores.

What is Database Replication?

Database replication is the process of copying and maintaining database objects (in our case today, tables), in multiple locations. This can happen synchronously or asynchronously, at the byte, block, or logical level, and is crucial for high availability, load balancing, and data activation.

Change Data Capture (CDC) refers to the tracking of all changes in a data source to be captured in a destination. For databases, this usually means the replication of a log of events.

MySQL Replication Mechanisms

MySQL replication starts with enabling the binary log (binlog). The binlog contains a record of events that represent changes to the database. Enabling the binlog will introduce some overhead that may affect performance, but the recoverability and replication capabilities it enables far outweighs any performance loss.

There are many event types written to the binlog, but the most common include:

gtid_log_event: contains per-transaction fields, including the global transaction identifier and logical timestamps
Query events: contains the actual SQL statements that were executed on the primary database server
Table Map events: contains information about the structural changes to the tables
Write / Update / Delete row events: represent row insertions / updates / deletions (only written with row-based logging)
Xid events: represent the end of a series of events that make up a transaction

There are a couple of formats to choose from when setting up binary logging in MySQL. In row-based logging, the source database writes row change events to the binlog. A row change event contains two images - one image is the “before” state of the row, and the other is the “after” state. This is considered the safest and most reliable method for replicating data between servers, but it can lead to large storage requirements for the binlog.

In statement-based logging, the source writes SQL statements to the binlog. Replicas then execute those SQL commands to synchronize with the source. Using this method results in significantly less data being written to the binlog. There are some limitations, however: any non deterministic behavior will be difficult or impossible to replicate. For example, if you issued a DELETE or UPDATE SQL command that uses a LIMIT clause, but without an ORDER BY, this would be nondeterministic, and is considered an “unsafe” command when replicating a statement-based log.

Note: When statement-based replication is enabled, query events will be written for both DML and DDL changes, but in the case of row-based replication, they are written only for DDL changes.

There is a third logging format which combines the two above approaches known as the Mixed Binary Logging Format. MySQL will determine which logging format to use based on a set of rules you can find here. Using the mixed logging format is often an ideal choice for replication, as you get the compactness of statement-based logging and the safety of row-based logging.

MySQL 8.0 introduced global transaction identifiers (GTIDs). A GTID is a unique identifier that is associated with each transaction committed on the source server, and is written to the binlog in the form as the gtid_log_event. The gtid_log_event consists of the GTID of the transaction, the timestamp when the transaction was committed, the isolation level of the transaction, and metadata for multithreaded replication. This removes the need to track the file-position when enabling replication between servers.

In our guides below, we’ll first take a look at how to set up GTID-enabled CDC replication between MySQL servers. This is the recommended approach for replicating data between a source and replica MySQL server in the newest versions of MySQL. Then, we’ll look at how to achieve CDC replication across data stores using the row-based logging and Airbyte.

Replicating Data Between MySQL Servers with GTIDs

In this guide, we’ll set up one source and one replica server. This simple setup involves 6 steps:

Synchronize both servers and set them to read-only
Stop both servers
Restart both servers with GTIDs enabled and options configured
Set the replica to use the source as its data source, and use auto-positioning
Take a new backup
Start the replica and disable read-only mode

Step 0: Create a Replication User

If you are starting a new server, you will need to set up a user with replication privileges. Use these commands to set the new user up:


mysql> CREATE USER 'repl'@'%.example.com' IDENTIFIED BY 'password';
mysql> GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%.example.com';

Step 1: Synchronize

This is only necessary when working with servers that are already replicating without using GTIDs. If you’re starting up a new server, you can skip to step 3.

Wait for all ongoing transactions to commit or roll back, and allow the replica to catch up with the source. This is important, as it ensures the replica has processed all updates before continuing.

Next, make the servers read-only by setting the read_only system variable to ON with the following command:

mysql> SET @@GLOBAL.read_only = ON;

‍

Step 2: Stop both servers

Use mysqladmin to stop each server. Make sure the username you supply has shut down privileges.

$> mysqladmin -uusername -p shutdown

Make MySQL Replication and Migration Easy with Airbyte.

Try FREE for 14 Days

Step 3: Restart both servers with GTIDs enabled and options configured

Each server must be started with GTID mode enabled - this can be done by setting the gtid_mode variable to ON. It’s recommended that you also set the enforce_gtid_consistency variable to ON, to ensure that only statements which are safe for GTID-based replication are logged.

Step 4: Set the replica to use the source as its data source, and use auto-positioning

In this step, you will instruct the replica to use the source with GTID based transactions as the replication data source, and to use GTID-based auto-positioning rather than file-based positioning. This can be done by issuing a CHANGE_REPLICATION_SOURCE_TO statement or CHANGE_MASTER_TO statement on the replica. Make sure to include the SOURCE_AUTO_POSITION | MASTER_AUTO_POSITION option in the statement.

Then, if not already set, you will need to issue the following commands to connect the replica to the source.


mysql> CHANGE MASTER TO
     >     MASTER_HOST = host,
     >     MASTER_PORT = port,
     >     MASTER_USER = user,
     >     MASTER_PASSWORD = password,
     >     MASTER_AUTO_POSITION = 1;
Or from MySQL 8.0.23:
mysql> CHANGE REPLICATION SOURCE TO
     >     SOURCE_HOST = host,
     >     SOURCE_PORT = port,
     >     SOURCE_USER = user,
     >     SOURCE_PASSWORD = password,
     >     SOURCE_AUTO_POSITION = 1;

Step 5: Take a new backup

Existing backups are no longer viable once you’ve switched over to GTID-based replication. For that reason, it’s recommended to create a new backup at this stage.

Step 6: Start the replica and disable read-only mode

Use the following commands to start your replica:


mysql> START SLAVE;
Or from MySQL 8.0.22:
mysql> START REPLICA;

If you set your global variable to read_only, you will want to change it back:

mysql> SET @@GLOBAL.read_only = OFF;

Replicating Data Between MySQL and External Data Stores with Airbyte

Airbyte features a catalog of source and destination connectors that you can use to quickly build reliable EL (Extract & Load) pipelines. Airbyte works by providing two connectors - a source connector and a destination connector. These connectors can then create a connection, which is your EL pipeline. Many connectors in the Airbyte catalog are community built and maintained. The MySQL source connector is certified, meaning the Airbyte team maintains the connector, and provides a production-readiness guarantee.

Airbyte’s MySQL source connector offers multiple methods of data replication, including Change Data Capture (CDC) replication via the binlog.

When replicating data from your MySQL database to an external source, it is often better to replicate your data from a replica MySQL server, rather than the source. This reduces the load on the primary server, increases fault tolerance, and brings flexibility to maintenance operations. Consider setting up a source and replica with MySQL’s native replication mechanisms. Note: The `log_replica_updates` setting must be enabled for this kind of chaining to work. Log_replica_updates ensures the replica writes transactions to its own binlog.

Let’s take a look at how to set up a MySQL source connector.

Step 1: Create a dedicated read-only MySQL user with permissions to replicate data

First, use the following commands to create a new user:


CREATE USER user_name IDENTIFIED BY 'your_password_here';

Then, give the user read-only access to relevant schemas and tables:


GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO username;

Step 2: Enable binary logging on your MySQL server

As mentioned previously, you must enable binary logging for MySQL to use it for replication purposes. Many cloud providers offer a one-click option for enabling the binlog on your MySQL database.

When managing your own server, you can set up the binlog with the following configurations:


server-id                  = 223344
log_bin                    = mysql-bin
binlog_format              = ROW
binlog_row_image           = FULL
binlog_expire_logs_seconds  = 864000

Let’s break down each config:

server-id: A unique, non-zero value between 1 and 4294967295 is required for each server and replication client.
log_bin: The base name of the sequence of binlog files.
binlog_format: Airbyte relies on row-based logging, so set this option to ‘ROW’.
binlog_row_image: Airbyte also expects the full row image to be written to the binlog.
binlog_expire_logs_seconds: This is the number of seconds before MySQL will automatically remove the binlog file. We recommend setting this to 864000 seconds (10 days), and setting your sync frequency to short intervals. This should ensure that if there is a failure or pause, it can be re-started without losing the binlog file.

Step 3: Create a new MySQL source in Airbyte UI

From the Airbyte UI, select Sources from the left nav bar, and search for MySQL. Then, create a new MySQL source. You will be asked to fill out:

The hostname, port number, and name for your MySQL database
The username and password you created in Step 1
An SSL mode - we recommend require or verify-ca. Both of these require encryption, and verify-ca also requires certificates from your MySQL database. Airbyte also supports other SSL modes, as well as SSH tunneling.
Select Read Changes using Binary Log (CDC)
(On Airbyte Cloud) Allow inbound traffic from Airbyte IPs. You can find a list of all IPs that must be allowed listed in our Security docs.

To finish, click Set up source, and you’ve set up a MySQL source connector! Now, all that’s left is to set up a destination like BigQuery or Snowflake, and you can start replicating your MySQL data.

Final Thoughts

Today, we’ve examined how MySQL implements logical replication using the binlog and global transaction identifiers, as well as how Airbyte can be used for CDC replication. We hope you found this guide informative. If you liked this content, share it with a friend, or reach out to us on LinkedIn. We’d love to hear from you.

Until next time!

‍

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->

The data movement infrastructure for the modern data teams.

Try a 14-day free trial

About the Author

Jacob is a Senior Developer Advocate at Airbyte. Before joining Airbyte, Jacob was a software engineer and led technical writing teams. He enjoys teaching, writing, and connecting with other enthusiasts in the data space.