What is Kafka Streams: Example & Architecture

•

July 4, 2024

•

20 min read

Apache Kafka, an open-source streaming platform, has become a popular choice in modern data architectures. Engineered to manage real-time data feeds and stream processing at scale, it is an efficient tool for capturing, storing, and processing massive data streams in a fault-tolerant and highly available form.

One of Apache Kafka's key components is Kafka Streams, a powerful stream-processing library seamlessly integrated into the Kafka ecosystem. With Kafka Streams, you can unlock your data's full potential by gaining actionable insights, detecting patterns, and triggering automated responses as data flows through your system.

Read along to learn more about Kafka Streams and how to work with it.

What is a Kafka Stream?

Kafka Stream is a client library used to build stream-processing applications and microservices on top of Apache Kafka. It enables you to consume data from Kafka topics, perform analytical or transformation tasks on the data, and send the processed data to another Kafka topic or external system.

Kafka Streams allow you to build an application on a single-node machine quickly. As your needs grow, you can easily scale the application by adding more instances. Furthermore, it can overcome all the streaming challenges, such as parallel processing, scalability, and fault tolerance.

Key Features of Kafka Stream

Here are some of the key features of Kafka Stream:

Kafka Stream is a simple and lightweight client library that is easily plugged into any Java application and integrated with deployment tools to get going with stream processing.
It has no external dependencies on any other systems other than Apache Kafka.
Kafka Streams process one record at a time, guaranteeing that each record will be processed only once, even if a failure occurs.
It can handle all time domain complexities, such as event time, processing time, latecomers, and high watermarks.
Kafka Streams support stateful operations, such as aggregations, joins, and windowing, through a distributed mechanism to enable better data storage and processing.
It offers necessary stream processing elements, along with a low-level processor API and a high-level Streams DSL (Domain-Specific Language).

Kafka Stream Processing Topology

A processor topology outlines the logic of stream processing within your application, determining how input data transfers to output streams. A topology is a graphical representation of stream processors and shared state stores. Kafka Stream topology consists of two unique processors.

Source Processor

A source processor is a specialized type of processor within the topology lacking upstream processors. Its role involves generating an input stream by consuming records from one or more Kafka topics and transferring them to the downstream processors.

Sink Processor

A sink processor is the exit point in a topology where the processed data is sent out to Kafka topics. It receives records from upstream processors and writes them to one or more Kafka topics.

‍Note: Upstream processors supply data to other processors; they are the sources or initial points in topology. Downstream processors receive data from upstream processors to further process it.

Kafka Streams Architecture

Here’s the structure of an application utilizing the Streams API. It offers an overview of the Kafka Streams application, which contains several threads, each containing multiple stream tasks.

Key architectural concepts of Kafka Streams include:

Stream Partitions and Tasks

Kafka Streams uses the partitions and tasks concept to enhance the performance.

Partitioning the data improves the efficiency and effectiveness of data processing in distributed environments like Kafka Streams. It enables data locality, elasticity, scalability, high performance, and fault tolerance.

Each partition in Kafka Streams represents a sequence of data records, maintaining order, and corresponds to a partition within a Kafka topic. A data record corresponds to a Kafka message originating from the same topic. Data record keys establish the partitioning of data in both Kafka and Kafka Streams, determining the path to specific partitions within topics.

Kafka Streams generates a fixed number of tasks based on the input stream partitions. With respect to the partitions assigned, tasks can also initialize their own processor topology. These tasks can be processed independently and in parallel, without any manual intervention.

Threading Model

Kafka Streams offer flexibility in adjusting the number of threads allocated for parallel processing with an application instance. Each thread can independently execute one or more tasks along with its processor topologies.

Increasing the number of stream threads in the application replicates the topology and assigns it to one of the Kafka partitions, thereby enabling parallel processing. There is no shared state among the threads, which eliminates inter-thread coordination. This makes it possible to run topologies in parallel across the application threads and instances. The distribution of Kafka topic partitions among the various stream threads is seamlessly done by Kafka Streams, leveraging Kafka’s coordination functionality.

Local State Stores

Kafka Streams provide local state stores, which are primarily used for storing and querying data within stream processing applications. In a Kafka Stream application, each stream task may incorporate one or more local state stores, which APIs can access to store and retrieve data needed for processing.

Furthermore, Kafka Streams ensure the robustness of local state stores by maintaining replicated changelog topics in Kafka, which track state store updates. In case of failure, Kafka Streams restores the state stores by replaying the changelog topics, making failure handling transparent to the user.

Fault Tolerance

Tasks in Kafka Streams utilize fault-tolerance capabilities to handle failures.

In the event of task failure, Kafka Streams automatically restarts the task in the remaining running application instances. However, Kafka Streams ensure that the local state stores are resilient against failures. Each state store is accompanied by a replicated changelog Kafka topic, which tracks any updates to the state.

In the event of a failure, the Kafka Stream guarantees the restoration of the associated state stores to their respective state prior to the failure. This is accomplished by replaying the changelog topics before resuming processing on the newly initialized tasks.

Implementing Kafka Streams Example

Here is an example of leveraging Java Kafka Stream API to count the number of times a word appears in a Kafka topic.

‍Step 1: Initiate by adding the Kafka package to your application as a dependency.


<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>1.1.0</version>
</dependency>

‍Step 2: Configure Kafka Streams.

Before using Streams API, set some basic configuration parameters. Create streams.properties file with the following code. Change the bootstrap.servers list with your Kafka cluster’s IP addresses.


# Kafka broker IP addresses to connect to bootstrap.servers=54.236.208.78:9092,54.88.137.23:9092,34.233.86.118:9092/9093

# Name of Streams application
application.id=wordcount

# Values and Keys will be Strings
default.value.serde=org.apache.kafka.common.serialization.Serdes$StringSerde
default.key.serde=org.apache.kafka.common.serialization.Serdes$StringSerde

# Commit at least every second instead of the default 30 seconds
commit.interval.ms=1000

Also, include the following code in the streams.properties.file with your trust store location and password.


ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 ssl.truststore.location = truststore.jks ssl.truststore.password = instaclustr ssl.protocol=TLS security.protocol=SASL_SSL sasl.mechanism=SCRAM-SHA-256 sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required \ username=”ickafka” \ password=”64500f38930ddcabf1ca5b99930f9e25461e57ddcc422611cb54883b7b997edf”

‍Step 3: Create the Streams Application.

To create a Kafka Stream application, start by loading the properties.


Properties props = new Properties();
 
try {
  props.load(new FileReader("streams.properties"));
} catch (IOException e) {
  e.printStackTrace();
}

Create a new input KStream object on the wordcount-input topic:


StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source = builder.stream("wordcount-input");

Now, to determine the number of times the word occurs, we’ll develop a word count KStream.


final Pattern pattern = Pattern.compile("\\W+");
 
KStream counts &nbsp;= source.flatMapValues(value-> Arrays.asList(pattern.split(value.toLowerCase())))
      .map((key, value) -> new KeyValue<Object, Object>(value, value))
      .groupByKey() &nbsp;&nbsp;&nbsp;
      .count(Materialized.as("CountStore"))
      .mapValues(value->Long.toString(value)).toStream();

Next, the following code will direct the word count KStream output to a topic called word count-output.


counts.to("wordcount-output");

Then, create a KafkaStreams object and start it


KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

‍Step 4: Create the Input Topic.

Create a Kafka topic named wordcount-input, from which the application can read the input.

‍Step 5: Set up a Kafka console producer.‍

Use the Kafka console producer to produce messages to the wordcount-input and send input to Kafka.


KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

‍Step 6: Run the Streams application.

Launch the Kafka Stream application. It will run continuously until stopped.

The stream's application output will be based on the most recent messages.

For example, consider an input topic:

Message 1: “All Streams lead to Kafka”

Message 2: “Hello Kafka Streams”

Message 3: “Join Kafka Summit”

The output will include the word counts that the Streams application has produced.


all         1
Streams     1
lead     	  1
to       	  1
Kafka       1
hello	   	  1
Kafka      	2
Streams   	2
join       	1
Kafka      	3
summit      1

Introducing Airbyte for Streamlining Data Integration Journey

While Kafka Streams is an efficient solution for processing data within the Kafka ecosystem, Airbyte focuses on a broader range of data movement requirements. It is a robust and reliable data integration platform that helps you move data from various sources, including databases, APIs, or cloud applications, to the desired warehouse or destination of your choice.

Unique features of Airbyte:

Ease of Use: Airbyte offers multiple development options for building and managing data pipelines, catering to users with different levels of technical expertise. These options include a user-friendly interface (UI), an API, a Terraform Provider, and PyAirbyte.
‍Pre-built Connectors: It has a comprehensive library of 350+ pre-built connectors for establishing seamless data integration between diverse sources and destinations.
‍Connector Development Kit (CDK): Airbyte’s CDK empowers you to develop custom connectors tailored to meet your specific data integration needs.
‍Change Data Capture (CDC): By leveraging Airbyte's Change Data Capture (CDC) technique, you can seamlessly capture and synchronize data modifications from source systems. This ensures that the target system is constantly updated with the latest changes.
‍Deployment Flexibility: Airbyte is an open-source data movement platform known for its flexibility and extensibility. Depending on specific requirements, you can deploy Airbyte in several environments, including a virtual machine (VM), a Kubernetes cluster, or locally using Docker.
‍Customized Transformations: It enables you to integrate with popular data transformation tools like dbt (data build tool), allowing you to perform advanced and complex transformations.
‍Robust Security: Airbyte emphasizes data security by implementing industry-standard measures. It incorporates robust access controls, encryption protocols, and authentication methods, ensuring that data access is only restricted to authorized individuals.

How Airbyte and Kafka Can Work Together?

You can easily consume data from required sources and publish it to Kafka using Airbyte. First, set up a source connector in Airbyte by choosing from over 350 sources like APIs, databases, data lakes, or cloud data warehouses. If needed, you can even create custom connectors with Airbyte’s no-code builder.

Next, set Kafka as the destination connector by connecting it through simple account authentication. Configure the connection by setting the sync frequency and selecting the Kafka topics where the data should be loaded. This allows you to build a scalable and reliable data pipeline with ease.

Conclusion

You now have a clear understanding of how Kafka Streams empowers you to unlock the full potential of your data. This blog has explained Kafka Streams, its topology, and architecture, along with examples, in detail.

FAQs

1. What are the benefits of Kafka Stream?

Kafka Streams' benefits include scalability, fault tolerance, and ease of integration with Apache Kafka. It allows developers to build real-time stream-processing applications without needing external dependencies.

2. What are the examples of Kafka Streams?

One example of using Kafka Streams is to perform real-time aggregation of events, such as counting the number of occurrences of different types of events within a specified time window. For instance, you could use Kafka Streams to count the number of page views per minute from a stream of weblog events.

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program

The data movement infrastructure for the modern data teams.

Try a 14-day free trial