Learning Apache Kafka

If you working on micro service there's a time when you will need to implement message broker to distribute a data accros your micro services. On of the popular option for is apache kafka.

What is apache kafka?

Apache Kafka is an open-source stream-processing software platform fist developed by LinkedIn and now is maintained by the Apache Software Foundation. It is used for building real-time data pipelines and streaming applications. It is a distributed, high-throughput, and fault-tolerant message broker.

Apache Kafka is written in Scala and Java and can be used to develop applications that process, store, and analyze streaming data. Kafka can connect to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library.

Kafka Component

There is a lot of terminology in kafka, so before you begin learning here's the terminology you need to learn.

  • Publiser: Application that will send the data to kafka cluster.
  • Comsumer: Application that will receive data from kafka cluster.
  • Cluster: A collection of kafka server.
  • Broker : A Kafka broker is a Kafka server that stores and distributes the messages in topics.
  • ZooKeeper: Application that will manage the kafka cluster.
  • Topic: A name for the data that stored in kafka cluster, just like table in traditional database.
  • Partition: The storage to storing data of kafka topic message. One topic can have multiple partiion.
  • Replication: A backup storage of kafka message topic, the replication will be stored to different kafka server.
  • Consumer Group: To prevent the consumer for listening a topic multiple time when the node of that service is more than one. This is useful to prevent duplicate data in the consumer side. Thing's to note is that make sure the node of the consumer is the same size of the the partition of that topic.
  • Retention Policy: By default the data in Kafka server can not be deleted. If you want to delete that you delete all the data by topic. With retention Policy we can configure when the data in Kafka server can be deleted automatically. It can be deleted two ways. First auto delete by retention size, for example if we set the retention size to 1GB it will delete the data when it hit 1GB size, and it will delete the last data not all. The second is retention by time, just like expired time.
  • Offsets : Offsets are the pointer to the messages in a Kafka topic. Offsets are used to keep track of the messages that have been consumed.

Install Java

Kafka is written in Scala and Java, so in order to make it run on you machine you need to install Java, download it here.

Install Kafka

Now download Kafka binary here choose the platform you like it work for MacOS, Linux and Windows.

After download extract the file using this command.

tar -xzf kafka_2.13-3.3.1.tgz
cd kafka_2.13-3.3.1

Now let's update config of zookeeper in config/zookeeper.properties.

dataDir=data/zookeeper
clientPort=2181
maxClientCnxns=0
admin.enableServer=false

Also update config for server in config/server.properties.

...
log.dirs=data/kafka
...

All this config is to make sure that we are not loosing our data after our computer is restarted.

Run the server

First let run the zookeper server.

./bin/zookeeper-server-start.sh config/zookeeper.properties

Then run the kafka server.

/bin/kafka-server-start.sh config/server.properties