Mongodb CDC: How to Sync in Near Real-Time

January 29, 2024
9 min read

Organizations use more than one database to handle different applications to serve reporting and analytics use cases. Therefore, there are situations where data from one database needs to be relayed to another database or a system. This is where Change Data Capture (CDC) comes in. It allows you to sync data across the system in real time. 

In this article, we will look at CDC from the standpoint of the MongoDB database. You will learn how you can perform MongoDB CDC, how it works, and multiple ways to implement it.

What is Change Data Capture? 

Change Data Capture (CDC) is the process of identifying and tracking changes to data in a database. It provides real-time or near real-time movement of data by processing and moving data continuously as new changes occur. In ETL, CDC is a method used for data replication where data is extracted from a source, transformed into a standard format, and loaded to a destination such as a data warehouse or analytics platform. However, CDC is not limited to replicating data across platforms, and it also ensures data quality and integrity. 

What is MongoDB?

MongoDB is an open-source document-oriented designed to store large amounts of data. It is categorized as a NoSQL database because it uses a document-oriented approach to store and retrieve data in documents rather than tables. In MongoDB, you can store data in a format similar to JSON called BSON or Binary Object Notation. This format is quite useful for developers working with large data sets as the data is distributed in small modules. 

Additionally, MongoDB stands out with its cutting-edge features, like ad-hoc queries, load balancing, and indexing, make it ideal for organizations seeking efficiency in data management. Some reputed organizations that use MongoDB in their tech stack include Uber, Forbes, Coinbase, and Accenture. 

What are the CDC Methods in MongoDB?

MongoDB supports CDC and provides powerful mechanisms to leverage its full potential. To understand its workings, let's look at different methods of CDC in MongoDB: 

  • Tailing MongoDB Oplog
  • Change Streams
  • Data Tracking Using Timestamps

Tailing MongoDB Oplog

In MongoDB, CDC works through oplog (short for operation log), a special built-in replication mechanism that records all operations that modify the data stored in your database. Whenever a change of event happens in your MongoDB instance, such as insert, delete, or update, it is recorded in oplog. As soon as oplog records the change of event, MongoDB allows you to use connectors or applications to tail the oplog for tracking changes in the database. Once the data of changes are received in applications through oplog, it can be further processed or streamed to other downstream systems according to your requirements.

Change Streams

Change streams are application programming interfaces (APIs) that allow you to subscribe to your MongoDB to track changes in a collection or entire database. It is built on top of the oplog of MongoDB and acts as a middleware between the oplog and applications to listen for changes. The change stream method works as follows:

  • To use change streams, you can include the .watch() method in your application code on a specific MongoDB collection.
  • Then, you can specify filters to tailor change streams according to your specific requirements.
  • Once the change stream is established, MongoDB notifies the application of change events as they occur and continuously monitors specified collections. 

Change streams can be a good alternative to tailing the oplog method as it simplifies the MongoDB CDC with its straightforward API. 

Data Tracking Using Timestamps

This method involves associating each record or document with a timestamp to track changes over time. Every time a modification is made, be it an insert, update, or delete, the document is updated with a timestamp indicating the exact time of the change. The idea is to compare the different states of data with the current state. 

It is not a dedicated CDC mechanism of MongoDB. Instead, it depends on developers to manually maintain timestamps inside the documents. While this CDC method has a lot of flexibility, it must be implemented carefully to ensure efficient querying and accurate timestamping. This makes data tracking with timestamps a resource-intensive method requiring more time and technical effort.

How to Implement MongoDB CDC?

There are a number of ways to implement CDC in MongoDB. Here are two ways with a detailed guide to set up MongoDB CDC: 

  • MongoDB CDC With Airbyte
  • Using Change Streams With Confluent Cloud  

MongoDB CDC With Airbyte

Airbyte is a popular data integration tool that allows you to streamline the MongoDB CDC process with its easy-to-use interface and extensive library of connectors. Here’s a step-by-step guide on performing CDC with the MongoDB connector of Airbyte:

Prerequisites

  • MongoDB Atlas. 

Step 1: Setting MongoDB Atlas 

Step 2: Setting MongoDB As a Source

  • Sign up or log in to the Airbyte cloud. After navigating to the main dashboard, click on the Sources option in the left navigation bar. 
  • On the Sources section, use the search bar from the top and type in MongoDB. Click on it when you see the connector. 
  • You’ll be redirected to the New Source page. Select the ‘MongoDB Atlas Replica Set’ option on top and fill in the details such as Connection String, Database name, Username, and Password from the MongoDB cluster. Toggle the Advanced window, and you can optionally select Initial Waiting Time in Seconds, Size of the queue, and Document discovery sample size.
  • Click on Set up source

That concludes the process. If you followed the above steps carefully, you have successfully set up CDC with MongoDB using Airbyte. However, if you want to set up a destination to sync data to another location, follow up on the next step. 

Step 3: Setting A Destination (Optional)

  • Click on Destinations on the left side of the navigation bar. 
  • On the Destinations page, use the search bar and search for the destination of your choice (let’s take PostgreSQL). 
  • Click on the PostgreSQL connector card. 
  • Fill in the details on the Create a destination page, including Host, Port, DB Name, and User.
  • Click on Set up destination.

Step 4: Create a Connection Between Source and Destination

  • Go to the home page. Click on Connections > Create a new Connection
  • Select MongoDB as a source and Postgres as a destination to establish a connection between them. 
  • Enter the Connection Name and configure Replication frequency according to your requirements. You can optionally tweak the additional configuration options, including Schedule type, Destination namespace, Detect and propagate schema changes, and Destinations Stream Prefix
  • There is an Activate the streams you want to sync section on the connection page. You can select which streams to sync, and they will be loaded in the destination. Click here to learn more about sync modes
  • Click on the Set up connection button. Once setup is complete, you must run the sync by clicking the Sync now button. This will successfully establish a connection between MongoDB and Postgres. 

Explore our insightful article on migrating from MongoDB to Postgres for valuable insights and a smooth transition experience.

Benefits of Using Airbyte

Below are some of the benefits of using Airbyte for performing CDC with MongoDB: 

  • User Interface: The intuitive and user-friendly interface of Airbyte streamlines the CDC processes for users with varying levels of technical expertise. 
  • Ease of Use: Using the rich features of Airbyte, such as workflow orchestration and logging, can help you monitor MongoDB CDC efficiently. 
  • Scalability: Airbyte provides horizontal scalability, which enables you to scale your data integration efforts as MongoDB data grows. 

Using Change Streams With Confluent Cloud  

In this implementation, we will perform MongoDB CDC using Confluent and MongoDB Atlas. This involves leveraging the Change Streams of MongoDB Atlas to capture changes in data and move these changes to a Kafka topic. Then, you can connect downstream applications or other databases from the Kafka topic to stream MongoDB data in real time. Below are detailed steps:

Prerequisites

Step 1: Set Up MongoDB Atlas

Step 2: Install Confluent 

Step 3: Set Up Confluent 

  • Launch the Confluent Cloud Cluster. 
  • In the left menu, click Connectors. Search for MongoDB Atlas Source and select the connector card. 
  • Fill in the required details on the connector screen, such as Kafka Credentials, Authentication, Sizing, etc.

Step 4: Create the Connector Configuration File in Confluent CLI

  • Launch Confluent CLI in your system and access it with the required credentials. 
  • Now, create a file in JSON format that contains configuration properties. The configuration file looks like this: 
 {"connector.class": "MongoDbAtlasSource",     "name": "my-connector-name",     "kafka.auth.mode": "KAFKA_API_KEY",     "kafka.api.key": "my-kafka-api-key",     "kafka.api.secret": "my-kafka-api-secret",     "topic.prefix": "topic-prefix",     "connection.host": "database-host-address",     "connection.user": "database-username",     "connection.password": "database-password",     "database": "database-name",     "collection": "database-collection-name",     "poll.await.time.ms": "5000",     "poll.max.batch.size": "1000",     "startup.mode": "copy_existing",     "output.data.format": "JSON"     "tasks.max": "1"}

Step 5: Load the Configuration File and Create a Connector

Enter the code mentioned below in the command line:

confluent connect cluster create --config-file configuration-file.json

Here, --config-file refers to the location, and <configuration file>.json is the file name we created in the above step.

Step 6: Check Kafka Topic

By now, MongoDB documents should be populating the Kafka topic. Run the following command in Confluent CLI to check the topics: 

confluent kafka topic list --bootstrap-server your-kafka-bootstrap-server

Replace  <your-kafka-bootstrap-server> with the address of the Kafka cluster bootstrap server. If the connector restarts for any reason, you may see duplicate records in the topic. 

If you have carefully followed the steps mentioned above, you should have a MongoDB CDC setup using Atlas and Confluent. 

Conclusion 

MongoDB is an ideal database if you want to build a real-time notification system, conduct historical analyses, or ensure data consistency. It provides a flexible collection of tools for performing CDC that meets your development requirements. Techniques like the oplog and change stream of MongoDB not only ensures data synchronization across platform but also enable dynamic features like event-driven architecture and automated notification. By incorporating the methods and implementations mentioned above, you can harness the full potential of MongoDB CDC. 

To streamline the process of CDC with MongoDB, you can use Airbyte. Its rich features, like an easy-to-use interface and extensive connector library, elevate the CDC capabilities of MongoDB and simplify data movement. Sign Up or Login to Airbyte now. 

Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program
The data movement infrastructure for the modern data teams.
Try a 14-day free trial