Data streaming with Apache Kafka

Data streaming is becoming more and more important, and hardly a new project is started without having at least considered data streaming as a solution or part of the solution.

Apache Kafka is a messaging system that has become extremely popular and is the top choice for data streaming nowadays.

This post goes into what is Apache Kafka and why it is such a good match for data streaming.

The origins

Apache Kafka was created by Jay Kreps and Jun Rao while working at LinkedIn. Kafka was designed to be the nervous system of complex systems.

The idea was to develop a systems where events could flow between components without components knowing about each other.

The internals

Kafka is modeled based on independent but collaborative units called brokers. Each broker is independent of each other, but they collaborate in order to ensure high performance and reliability.

In the current version of Kafka there are some additional units running a different type of software called Zookeeper. Zookeeper is a distributed key value store used to keep the configuration of the system. In the next version of Kafka Zookeeper will be replaced by normal Kafka brokers.

Data in Kafka is stored in what is called topics. A topic is nothing more than a log of immutable events. This means that a topic is a sequence of events, and that each event cannot be changed once it is added to the log.

A topic can be partitioned, which means that the load of the topic can be shared by several brokers.

A topic can also be replicated, which means that each partition on a topic will be maintained by more than one broker.

Producing data

To produce data, a producer needs to find out the topology of the cluster and then decide where to send the data.

The topology of the cluster is the distribution of topic-partitions across brokers. Each topic-partition has a leader, which means a broker that accepts the writes from producers and then communicates to the followers that a new record was added.

The producer uses the key of the record to decide to which partition of the topic the record should be written to. There is a default partitioning algorithm that computes a hash of the key and then applies the modulo over the number of partitions. If a record has no key, then round robin is used over the number of partitions.

The producer can chose between three levels of consistency:

  1. Write a record and do not wait for the followers to reply.
  2. Write a record and wait for one follower to reply.
  3. Write a record and wait for all followers to reply.

Consuming data

What makes Kafka special is that data is not consumed as it would be in a normal queue. Data is read by a consumer, but the act of reading the data does not remove the data from the log as opposed to the act of dequeuing data from a queue.

Consumers form groups called consumer groups and divide the partitions of a topic among themselves. If there are less consumers than partitions then some consumers will get more than one partition assigned. If there are more consumers than partitions, then some consumers will not be assigned a partition.

Events on Kafka have an expiration, and they can be read by consumers as long as they have not expired. Once a record expires, it is not possible to read the record anymore.

A typical system

Typically Kafka is used to separate the producer applications from the consumer application. Producers can focus on producing data as fast as they can, while consumer applications can process the information at their own pace.

For example, a system that receives log entries via a REST API can produce events directly to a topic or several topics in Kafka. The analysis of those log entries can be performed by consumer applications without affecting the producers.

Cool, so what’s the catch?

As with any technology, there are things that can be problematic or at least not optimal depending on the use case.

The first problem is finding a given record. Kafka is designed so records can be read sequentially not for finding records given a key. The usual solution to this problem is to use a processing library called KafkaStreams, also part of the Kafka project, which offers higher abstractions such as tables that provide a key-value interface.

Another typical problem is how to consume and produce data between several Kafka clusters or even other systems. There are projects that allow you to replicate all or only a few selected topics from one cluster to another. The main issue with this kind of setup is the fact that Kafka is meant to be kept in a controlled environment and not out in the internet. It is possible to do this, it just requires more planning.

Long term storage is complicated but not impossible. Kafka is designed for relatively short lived records, not for records that have to last weeks, months or even years. The usual solution is to offload the records to another storage solution, such as cloud storage and consume them again if needed.

What’s next?

In my next post I will describe a typical Kafka system including producers, consumers and processors.

Published by carlosware

Busy dad of three with a passion for fly fishing and computers.

One thought on “Data streaming with Apache Kafka

Leave a comment