Learn how to use Airbyte, Kafka, and Pinot to build a data pipeline for a user-facing analytics dashboard.
Typically, a real-time analytics pipeline consists of several components, including ELT pipelines, an event streaming platform, and an analytical database that can answer queries at scale. However, building and maintaining such a platform is expensive and demands significant engineering effort and time.
Having open-source data engineering and analytics tools at your disposal helps reduce the cost of running a real-time analytics pipeline. Having access to the source code avoids vendor-locking, giving you the flexibility to customize the solution to tailor your organizational needs. This article explores three popular open-source products in the data space, Airbyte, Apache Kafka, and Apache Pinot, to build a user-facing e-commerce dashboard that updates in real-time.
Airbyte is an open-source data integration platform capable of moving data from OLTP databases such as MySQL to destinations such as Apache Kafka using change data capture (CDC) with low latency. Apache Pinot is an open-source OLAP database capable of ingesting streaming data from Kafka and making it available for querying within seconds.
This section discusses the key components of the solution, along with the rationale for picking them up for the project.
We derive the seller dashboard analytics by analyzing e-commerce orders. Currently, orders reside in a MySQL database, making it challenging to run analytical queries. The OLAP queries require aggregating and filtering a large batch of records to generate metrics, resulting in performance degradation in MySQL, which is not designed to handle such queries.
So we need to extract the orders from MySQL and move them to Apache Pinot for further analysis. We use Airbyte for that.
A relational database like MySQL maintains a transaction log to record every state-changing operation such as inserts, updates, and deletes. CDC mechanism tails this transaction log to detect the database entities that have been changed and streams them as change events. That way, we can obtain incrementally updated orders to avoid costly full table extractions.
Airbyte runs this extraction at a scheduled interval, for example, every hour, day, week, or so. Let’s schedule it for every five minutes to get more fresh data. Extracted orders are written to a Kafka topic (orders) as JSON formatted events, streamed to Pinot from there.
The Airbyte ELT pipeline running every five minutes generates a massive amount of raw data to be analyzed. Also, the analyzed data will be exposed to all the sellers in the platform, forcing us to deal with a high query throughput and a latency range of milliseconds. Hence, the analytics engine must be capable of running analytical queries and returning results in real-time to ensure a good user experience. Therefore, we will use Apache Pinot as the analytics engine to satisfy these needs.
I hope now you have a solid understanding of what we will build next. You can either follow along with the article or have a quick look at the finished solution in this GitHub repo.
The article assumes you have Docker Compose installed on your machine. For better performance, it is recommended to have at least 8GB of RAM and adequate disk space. Clone the following GitHub repository to your local machine and navigate to the project folder.
Next, we will create a single node Kafka cluster and a multi-node Apache Pinot cluster with Docker Compose. The cloned project contains a docker-compose.yml file. Launch the Docker stack by running:
We will revisit this setup in the coming sections.
Once the Docker stack runs, let’s create a MySQL database to load some mock e-commerce orders. The location of the database doesn’t matter; it could be either a hosted MySQL instance or a local installation. For this article, let’s use a local installation.
Connect to your MySQL instance via MySQL CLI or using a GUI tool. Execute the following script to create the ecommerce database orders table, and insert some mock orders. You can find this script inside the mysql directory of the accompanying GitHub repository.
Once the script is completed, you can check the content inside the orders table by running:
It is always recommended to grant scoped permissions to Airbyte for accessing MySQL. We can do this by creating a dedicated MySQL user with the necessary privileges.To create a dedicated database user, run the following commands against your database.
The required permissions depend on the replication method. While the STANDARD replication method only requires SELECT permissions, CDC replication requires SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, and REPLICATION CLIENT permissions.
Now, our ecommerce database is ready to be used with Airbyte.
We will run Airbyte on your local machine as a separate Docker Compose project. You can follow these instructions to get it up and running.
Once Airbyte is running, we need to instruct Airbyte on where to read data (source), move data (destination), and create a connection. That can be done using the Airbyte UI, which runs on localhost:8000.
Log into Airbyte UI, choose Sources > new source, and select MySQL as the type. Make sure to select CDC as the replication method. We are not going to use SSH for this example. But, an SSH tunnel is recommended when using a public internet network.
Provide the following values in the UI. I’m connecting to my local MySQL installation, which runs in the default port. Feel free to adjust the values based on your environment.
Next, we will set up a Kafka destination in Airbyte to stream MySQL orders data in near real-time (minutes instead of seconds). Here, we are connecting to the single node Kafka instance we started earlier. First, we will create a Kafka topic named ‘orders’ which will be used to write the CDC data. Navigate to the cloned project, then run the following command from the root level.
In the Airbyte UI, select Destinations > new destination, and Kafka as the type. Most of the values can be set to defaults, except for the Topic Pattern and the Bootstrap Servers.
Once the source and destination are set up, you can create a connection from MySQL to Kafka in Airbyte to create a data pipeline between the two. In the “select the data you want to sync” section, choose the orders table and select Incremental under Sync mode.
Currently, you can set the sync frequency as low as five minutes. If you need a lower frequency you can trigger the Airbyte syncs from the API or integrate it to a workflow management tool like Airflow, Prefect or Dagster. Once configured, you can see the connection in the Connections tab.
Once the sync job runs for the first time, it syncs all the orders in MySQL to Kafka. You can run the following command to examine the content inside the orders topic.
You will see JSON formatted records coming out of Kafka like this.
Starting from the first job, Airbyte repeatedly runs a sync job every five minutes to sync new or updated orders from MySQL to Kafka.
Now that we have our orders showing up in Kafka. Next, we will ingest them into Apache Pinot so that the dashboard can run analytical queries.
Pinot is a distributed system made of different components responsible for data ingestion, data storage, and query brokering. Pinot also depends on Zookeeper for metadata storage and cluster coordination.
If you remember, we started Kafka, Zookeeper, and the rest of the Pinot components as Docker containers in the prerequisites. That simplifies many things for us. However, you can follow this guide if you want to set up a Pinot cluster manually.
Before ingesting the incoming stream of orders, Pinot requires you to define a structure for the stream beforehand. That enables Pinot to optimize its storage and indexing strategies to provide faster data analytics.
We achieve this by creating a schema and a table for the orders data. A schema provides a logical abstraction for the underlying data, declaring attributes, data types, and other constraints. A table implements a schema, specifying concrete information on data ingestion, indexing, and storage requirements. You can think of a schema as a template for a table.
Below is the schema for the orders table, which you can find inside the config folder of the GitHub repo. Schema attributes are partitioned into three sections, dimensions, metrics, and datetime fields. Aggregations are performed on metric fields such as counts, totals, and averages while temporal filtering and sorting are performed on datetime fields. The rest of the attributes fall under dimensions.
The following is the table definition, which is a REALTIME table. The streamConfigs configuration block specifies the Kafka and Zookeeper settings required for real-time data ingestion.
The sync records written to Kafka are formatted in JSON according to Airbyte specification. Therefore, during the ingestion, each Kafka event is normalized and mapped to the corresponding schema attribute. That happens inside the transformConfigs block.
In reality, an order can transition across many life cycle stages such as OPEN, PROCESSING, IN_TRANSIT, and CANCELLED. We have defined the status field to capture that in MySQL. Order status changes are captured by Airbyte and published to Kafka, ultimately ending up in Pinot as duplicated order entries. But, we know that they belong to the same order.
To avoid that, we can enable upserts in the orders table, a feature that allows merging together the data records bearing the same primary key. We have already defined id as the primary in the orders schema. The following configuration block in the above table definition defines a FULL upsert on the orders table, overwriting the old order entirely with the latest arriving order record.
Finally, run the following command to create the orders schema and table inside Pinot.
Pinot starts ingesting from the orders topic right after the command completion and populates the orders table with incoming orders.
We can also verify the upserts feature by updating the status of an order in MySQL.
Running the following query in Pinot will result in returning an order record with the status field updated as CANCELLED.
Now that we have the orders ingested into Pinot. We can do some ad-hoc SQL querying to find answers to the questions we had in the beginning. Let’s use the integrated Pinot query console for that.
In our example, each order has a store_id field to represent the store it belongs to. When a seller logs into the dashboard, orders can be filtered by his store_id. For now, let’s use 100 as a sample store_id.
This query returns the total sales for the past week.
This query returns the contribution of sales channels based on their revenue.
And this returns the average order value for the past week.
Now that we have done the hardest part of the solution, moving orders from MySQL to Pinot. Once we have data ingested into Pinot, a user-facing analytics dashboard can be built with any front-end technology such as React, Node.JS, or even Python. The dashboard then pulls data from Pinot through REST APIs or appropriate driver interfaces.
Pinot REST API allows you to post any of the SQL queries we executed above as an HTTP POST request. In return, you will get JSON formatted responses. For example, to query total sales:
Building the dashboard goes beyond the scope of this article. Therefore, we will look at it in a future article.
A user-facing analytics dashboard requires running complex OLAP queries on the underlying data set. It is not recommended to run them on operational OLTP databases as it can degrade the performance. Hence, the operational data must be moved to an OLAP database.
In this article, we learned about building a real-time analytics pipeline with Airbyte to move e-commerce orders from MySQL to Kafka. Ingested orders are streamed to Apache Pinot for answering low-latency OLAP queries, coming from a user-facing dashboard. As a result, sellers are presented with real-time store analytics, to be used for informed decision making.
Dunith Dhanushka is a seasoned IT professional with over ten years of experience in architecting, building, and consulting large-scale event-driven systems. He's particularly interested in exploring innovations in the enterprise data management space including streaming platforms, real-time analytics, and distributed databases. Dunith currently works at StarTree as a developer advocate. In his spare time, he materializes his thoughts on his blog on Medium.
Get all your ELT data pipelines running in minutes with Airbyte.
Learn how to use Airbyte’s Python CDK to write a source connector that extracts data from the Webflow API.
Learn how to load data to a Databricks Lakehouse and run simple analytics.