Learn how to replicate Postgres data to Kafka in a few minutes with Airbyte.
Data engineers are tasked to move data from a source system to a destination system. For data to move between different systems, there should be a data integration tool that can effectively communicate with both the source and destination systems and ensure data is transferred between these systems.
For small-scale projects where large volumes of data are not being processed, the communication between these systems can be through a network protocol like TCP or IP. For large-scale projects where the processing of huge volumes of data and scalability is paramount, communication can be through a message broker.
Airbyte is an open-source data integration tool that enables you to move data between various systems effortlessly, with low latency and high throughput. This tutorial will teach you how to easily migrate data from a Postgres database to Apache Kafka using Airbyte open-source.
Message broker systems came about as a result of the need for systems to communicate with each other effectively. Without message brokers, communication between multiple source and destination systems would be difficult. The diagram below shows six systems trying to communicate with each other.
In the architecture shown above, this system can be very challenging to manage. If one system is unavailable at a point, data is lost for that system. Also, it can be very difficult to ensure that each of these systems is using the same data format or communicate via the same language.
Another problem arising from this architecture is that it can not scale. When the number of source and destination systems increases rapidly, each system must maintain a separate connection with other systems which can be difficult to manage.
This is one of the problems a message broker solves. Message brokers act as an intermediate between source systems and destination systems. Using message brokers, messages are transported from the source system to the message broker and the message broker routes the appropriate messages to the appropriate destination.
This architecture is easier to manage and scale, compared to the one shown previously. If a system is down, the message broker stores the messages in a queue. When the system comes back up, the message broker pushes the messages to the system. In this way, data is not lost. Examples of message brokers are Apache Kafka, RabbitMQ, Apache Qpid, amongst others.
Apache Kafka is a distributed publish-subscribe (pub-sub) messaging system. This pub-sub system decouples senders and receivers of messages. Here, producers send messages to a Kafka topic. Consumers can then subscribe to one or more Kafka topics and consume messages persisted in the topic. One of the advantages of Apache Kafka is the decoupled architecture. Producers only need to know the topic they will send messages. Similarly, consumers only need to know the topic they can subscribe to. The producers and consumers send and receive messages via the Apache Kafka topic.
Now, you understand what message brokers are and why we need them. One question you might have in mind is “Why do I need to replicate data from Postgres to Kafka?”. There are numerous reasons why this is important. One of the reasons is fast and distributed data processing. Sometimes, data needs to be processed as soon as they are generated. One use case of this is fraud detection. Banking systems or e-commerce applications need to detect fraudulent transactions as soon as they are happening. For this, data needs to be migrated from their transactional database system to Kafka.
Another reason data can be replicated from Postgres to Kafka is for data backup. Apache Kafka is a highly scalable, distributed, and fault-tolerant event streaming tool. Most times, large organizations need a backup mechanism to recover data in case of failure, downtime, or disaster. Apache Kafka ensures that the data is available even if your system is unavailable at a particular point in time.
Another reason to replicate data to Kafka is for increased scalability. The decoupled architecture of Apache Kafka makes it easy to distribute data across multiple consumers. This allows your data to scale horizontally across multiple nodes.
Here are the tools you need to replicate data from a Postgres database to Apache Kafka.
You can install Airbyte locally by using just three simple commands. These commands can be run on your terminal.
When the command executes, you would see an Airbyte banner as in the image below.
When this banner shows, you can navigate to the URL http://localhost:8000/ . At first login, you will be asked for a username and password. By default, the username is airbyte and the password is password. You can change this in the .env file in your local Airbyte directory.
After installing Airbyte using Docker, the next step is to install Postgres. You can install Postgres for your operating system by using the link. After installing Postgres and configuring it, you will have to create a database. You can create a database by running the command shown below in your Postgres query editor or the psql terminal.
In this tutorial, the database that was created has the name postgres_kafka.
In this section, you will create a table in Postgres and populate the table with data. The data used in this tutorial is an open-source dataset of credit card transactions. This dataset can be downloaded here. The dataset used in this tutorial contains 19,963 records and 15 attributes.
After downloading the dataset, you can create a schema and table for it by executing the command below.
The code block above creates a schema called postgres_kafka_schema and a table of name postgres_kafka_table. This table will be populated with the data in the CSV you just downloaded. To populate the table with the CSV file, you will head over to psql terminal and run the command:
After executing this code block, the data will be successfully loaded into the Postgres database. You can preview the loaded data by executing the code:
You can also get information on the count of rows in the table by executing the command
In the previous section, you loaded the dataset into Postgres database. In this section, you will learn how to spin up a Kafka broker locally.
To spin up a Kafka broker locally without using docker, you will have to go to the directory of your installed Kafka and run the command
Apache Kafka uses Zookeeper to coordinate various components in the Kafka cluster. Zookeeper stores the configuration information for Apache Kafka such as the location of partitions, topics configuration, and so on. You can spin up a Zookeeper server by executing the command
The commands shown above spins up both Kafka broker and Zookeeper cluster. Now, you have to create a Kafka topic. A Kafka topic receives messages from the producer(in this case Postgres) and sends the messages to any consumer subscribed to it.
To create a Kafka topic, you can run the command
An alternative to this is using Docker to spin up the Zookeeper and Kafka clusters. You can spin up the clusters by executing the command docker compose up on this Github repository.
You can then create your Kafka topic by executing the command
In this case, the topic created has the name postgres_kafka_topic.
To create the Postgres source, head over to Airbyte on localhost:8000 and create a new source. Enter the source name, host address, database name, the schema you created (postgres_kafka_schema), username and password. All these details were used when creating your Postgres database.
In the image above, you created a Postgres source connection. If you use the default SSL mode of disable, your test might fail. You will get an error message of Unsecured connection. To solve this error, you will have to set SSL Modes to require. SSL Modes are ways of ensuring a secure browser connection. All data connections in Airbyte are encrypted.
In the previous section, you set up a Postgres source connection. In this section, you will set up Kafka destination in Airbyte to stream records from Postgres. In the Airbyte UI, select Destinations > new destination, and Kafka as the type.
Whilst selecting the Kafka destination, you put in your topic name, bootstrap server and for simplicity, leave the remaining values as the default.
Once the source and destination settings have been configured, the next thing to do is to go to Airbyte UI and set up a connection.
In this use case, incremental | append sync mode was used. There are various sync modes available in Airbyte. You can read more about sync modes by clicking here.
After creating the Postgres to Kafka connection, the next thing you do is to run the sync.
From the image above, the sync was successful and exactly 19,963 records were synced from the Postgres database to Apache Kafka. Recall previously, 19,963 records was the total number of records in the dataset. The time it took for this sync to run was exactly twenty-one seconds.
You see that Airbyte migrates data from a source to a destination in a very fast and optimized manner. To verify the results in Kafka, you can consume the messages in the Kafka topic by executing the command
The image below is a snapshot of the data streamed to the Kafka topic.
In this tutorial, you have learned about message brokers and how easy it is to replicate data from a Postgres database to a Kafka topic using Airbyte.
Speed is vital for any organization. By transferring data to a distributed system like Kafka, organizations can improve their data processing speed and scalability. This can aid organizations in making better data-driven decisions. By making use of the right tools, organizations can unlock a realm of fast and reliable data processing.
Using the Airbyte GitHub connector and Metabase, we can create insightful dashboards for GitHub projects.
Learn how to use Airbyte to easily synchronize your Klaviyo data into BigQuery.
Learn how to move your data to a data warehouse with Airbyte, model it, and build a self-service layer with Whaly’s BI platform.