Getting Started with Apache Kafka

Developing scalable application these days have a lot of moving part especially when your company adopt micro service. One of the problem you need to handle is background processing. Well there's a lot of way to do background processing. One of the way to do that is via message broker. So in this article we will take a look how to process background task with kafka.

What is apache kafka?

From official website apache kafka is.

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

In simple term think kafka as database to process background task. The main job of kafka is retrive the data from one service and distribute that data to the other service. This way we can use kafka to handle queue, processing analytics data, etl and much more.

Apache Kafka is written in Scala and Java and can be used to develop applications that process, store, and analyze streaming data. Kafka can connect to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library.

Kafka Component

There is a lot of terminology in kafka, so before you begin learning here's the terminology you need to learn.

  • Publiser: Application that will send the data to kafka cluster.
  • Comsumer: Application that will receive data from kafka cluster.
  • Cluster: A collection of kafka server.
  • Broker : A Kafka broker is a Kafka server that stores and distributes the messages in topics.
  • ZooKeeper: Application that will manage the kafka cluster.
  • Topic: A name for the data that stored in kafka cluster, just like table in traditional database.
  • Partition: The storage to storing data of kafka topic message. One topic can have multiple partiion.
  • Replication: A backup storage of kafka message topic, the replication will be stored to different kafka server.
  • Consumer Group: To prevent the consumer for listening a topic multiple time when the node of that service is more than one. This is useful to prevent duplicate data in the consumer side. Thing's to note is that make sure the node of the consumer is the same size of the the partition of that topic.
  • Retention Policy: By default the data in Kafka server can not be deleted. If you want to delete that you delete all the data by topic. With retention Policy we can configure when the data in Kafka server can be deleted automatically. It can be deleted two ways. First auto delete by retention size, for example if we set the retention size to 1GB it will delete the data when it hit 1GB size, and it will delete the last data not all. The second is retention by time, just like expired time.
  • Offsets : Offsets are the pointer to the messages in a Kafka topic. Offsets are used to keep track of the messages that have been consumed.

Install Java

Kafka is written in Scala and Java, so in order to make it run on you machine you need to install Java, download it here.

Install Kafka

Now download Kafka binary here choose the platform you like it work for MacOS, Linux and Windows.

After download extract the file using this command.

tar -xzf kafka-3.6.0-src.tgz
cd kafka-3.6.0-src/

Now let's update config of zookeeper in config/zookeeper.properties.

dataDir=./data/zookeeper
clientPort=2181
maxClientCnxns=0
admin.enableServer=false

Also update config for server in config/server.properties.

...
log.dirs=./data/kafka-logs
...

All this config is to make sure that we are not loosing our data after our computer is restarted.

Run the server

First let run the zookeper server.

./bin/zookeeper-server-start.sh config/zookeeper.properties

If you find this error: Classpath is empty. Please build the project first e.g. by running './gradlew jar -PscalaVersion=2.13.11', let's build the project source first.

./gradlew jar -PscalaVersion=2.13.11

Then run the kafka server.

/bin/kafka-server-start.sh config/server.properties

Create topics

Topics is the important part of Kafka. You can think of topics like table in relational database. Before sending data to be distributed accoss the services.

Let's imagine building notification service using kafka, if you think about notification we can have multiple notification for example in most of web application that implementing authentication we will have notification for user registration, user sign in or sign out.

Now, let's create topics to send notification on user registration via email.

./bin/kafka-topics.sh --create --topic user-register-events --bootstrap-server localhost:9092

You can also see the detail topics using this command.

./bin/kafka-topics.sh --describe --topic user-register-events --bootstrap-server localhost:9092

It should give you information like this.

$ ./bin/kafka-topics.sh --describe --topic user-register-events --bootstrap-server localhost:9092

...
Topic: user-register-events	TopicId: vtahK7PMRF2dXIJmWhpVtA	PartitionCount: 1	ReplicationFactor: 1	Configs: 
	Topic: user-register-events	Partition: 0	Leader: 0	Replicas: 0	Isr: 0

If you want to see all the topics you can use this cli command:

./bin/kafka-topics.sh --list --bootstrap-server localhost:9092

Result:

$ ./bin/kafka-topics.sh --list --bootstrap-server localhost:9092

...
user-register-events

Publish and Subscribe the events

well, we can send and event from any programming language you can but in this first getting started tutorial we will use the cli command to interact with topics in kafka.

Publish message to topics:

./bin/kafka-console-producer.sh --topic user-register-events --bootstrap-server localhost:9092

And here's how to listent:

./bin/kafka-console-consumer.sh --topic user-register-events --from-beginning --bootstrap-server localhost:9092

Example messages:

kafka-pubsub

So that's very simple introduction to apache kafka. We'll take a look how to implement producer and consumer app using any programming language in the next article.