Getting Started With Kafka Streams-1

Aditya Mohanty
3 min readJun 30, 2020
(Picture By Juken Tatesi On Unsplash)

Apache Kafka is a distributed streaming platform. Built in the year 2011 it was open-sourced by a group of linkedin data engineers. It is written in scala and Java. It has got the ability to handle trillions of records everyday. Now that we need data and logs to be analysed,processed and handled in real time, kafka has turned out to be a great choice to avail the same.

Kafka As Distributed Streaming Platform:

As a streaming platform kafka has certain properties. It is supposed to publish,subscribe,store and process stream of records. Let us try to understand what exactly is a record. Records are basically byte arrays that can store records in any format. It has four attributes with key and value being the mandatory one. As optional attribute it can have timestamp and headers. Before moving further let us just look at the key terminologies of kafka. These are

  • Broker
  • Consumer
  • Zookeper
  • Producer
  • Topics

Let us go through each of them one by one.

Broker:

Kafka mainly acts like a message broker. Like in our daily life where broker brings two entities that not necessarily know eachother kafka does the same for the producer and consumer. In short we can say that kafka is a message broker where producer send messages to kafka and those messages are stored and made available to consumers via subscription to the topic.

Topics:

A topic is a category where records are published. Producers write data to topics whereas the consumers read from topics. Before going further into topics let us understand what is a log. Log can be defined as an append-only totally ordered sequence of records ordered by time. The figure below shows what a log looks like. So basically log is a file where incoming records are appended sequentially. Topic in Kafka are logs that are segregated by a topic name or we can think of topics as labelled logs.

Log file with incoming records

Producer:

A Producer is an entity that sends record to the topics. It can assign record to the partition of its choice. We shall discuss about partition in another blog.

Consumer:

Consumers can read messages starting from a specific offset. This feature helps the consumer in joining a cluster at any point of time. Consumer is used for optimal consumption of kafka data. Consumers label themselves with a certain consumer group name. Just as producers don’t know who read the messages the consumers also do not know who consumed the messages.

Zookeeper:

Zookeeper is a centralized service for maintaining configuration information. For a distributed application like kafka it is pretty much necessary. Kafka has the notion of leader and follower brokers. In kafka for each topic partition one broker is chosen as the leader to assign the replication of partition to the follower brokers. Zookeeper allows kafka to have leader brokers also track the replication of topics. To elect the controller broker Kafka uses Zookeeper.

This was an over view various kafka entities. In the next blog we shall go through each of them in detail.

--

--