Data & AI
Article

Understanding Change Data Capture (CDC): Definition, Methods and Benefits

Thalia Barrera
May 12, 2022
10 minutes
Limitless data movement with free Alpha and Beta connectors
Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program ->

Database triggers (Trigger-based CDC)

This method requires the creation of database triggers with logic to manage the metadata within the same table or in a separate book-keeping table, often called a shadow table

Most databases allow the creation of triggers; you can see how to create a trigger for PostgreSQL.

Key challenges:

  • If a transaction fails, roll-back logic may need to be implemented to remove the operation from the shadow table.
  • The trigger needs to be modified in case of table schema changes.
  • Triggers cannot be reused for other databases, given the differences in SQL language.
  • The use of triggers can slow down the transactional workload.

Database transaction log (Log-based CDC)

Log-based CDC uses the transaction logs that some databases – such as Postgres, MySQL, SQL Server,  and Oracle – implement natively as part of their core functionality.

Log-based and trigger-based CDC are very similar – both keep a log of changes every time a database operation happens – so the shadow table and the transaction log contain the same information. The difference between log-based and trigger-based CDC is that the first one uses a core functionality of the database (transaction log); meanwhile, the triggers are created and defined by the user.

Since database logs are updated in every transaction, the experience is transparent, which means log-based CDC does not require any logical changes in database objects or the application running on top of the database. A system reads data directly from the database Change Data Capture logs to identify changes in a database, minimizing the impact of the capture process.

Key challenges:

  • Some database operations are not captured in the CDC logs, such as ALTER or TRUNCATE. In that case, additional logic needs to be configured to force the logging of those operations.
  • If the destination datastore is down, transaction logs must be kept intact until the subsequent replication happens.

As you can see, there are several approaches to implementing Change Data Capture. Before databases included native CDC, data engineers used techniques such as table differences, table metadata, and database triggers to capture changes made to a database. These methods, however, can be inefficient and intrusive, and they tend to impose overhead on source databases.

As a result, many modern and real-time data architectures employ log-based CDC, which uses a background process to scan database transaction logs for changed data. Transactions are unaffected, and the impact on source databases is minimal. It’s also easy to implement, provided you have the right Change Data Capture tool.

Log-based CDC replication using Airbyte

Until now, we have discussed what Change Data Capture is, different implementation methods, and why log-based CDC may be the best solution for production-level data replication. But, how can you efficiently use it in your data stack?

Even if the source database automatically creates a transaction log that you can readily use to replicate data, you still need to parse that log and load the changes to the target data warehouse or data lake. In other words, you need a consumer that understands the specific log syntax and propagates changes to the destination.

In addition, you may have more than one source database that you want to replicate using CDC, which means you need to create and maintain different consumers.

Does the above start to sound too complicated? Then, you may need to consider using a Change Data Capture tool and that’s where Airbyte can help. As the open-source data integration platform for modern data teams, Airbyte supports log-based CDC from Postgres, MySQL, and Microsoft SQL Server to any destination, such as BigQuery or Snowflake.

To support log-based CDC, Airbyte uses Debezium, an open-source platform that constantly monitors your databases, allowing your applications to stream database row-level changes. Airbyte is engineered to use Debezium as an embedded library, so you don’t need to worry about knowing its specifics.

Conclusion

In this post, we defined what Change Data Capture is, how it works, and the benefits it can offer to a production data stack. 

CDC is a data replication method with several advantages, especially if using log-based CDC. It’s one of the best approaches that data teams can use to enable near real-time analytics without compromising performance.

But even if implementing log-based CDC in your data stack comes with many benefits, it doesn’t mean that it’s easy. It requires data engineering knowledge and expertise. Fortunately, modern data integration platforms like Airbyte can help with the heavy lifting.

With Airbyte, you can create a log-based CDC pipeline without the need to parse logs or create and maintain logic for different consumers. This tutorial shows how straightforward it is to configure CDC for Postgres. If you’re ready to try it, you can deploy Airbyte open-source locally or on the cloud for free.

To explore more about data replication tools and their capabilities, check out our article on the top data replication tools available in the market.

The data movement infrastructure for the modern data teams.
Try a 14-day free trial