Kafka‘s popularity and growth are on its all-time high. It has become so popular that now it has started to overshadow the popularity of its namesake novelist Franz Kafka. Its popularity is evident from that fact that over 500 Fortune companies use Kafka.
These companies include top seven banks, nine out of the top ten telecom companies, top ten travel companies, eight out of the top ten insurance companies, etc. Netflix, LinkedIn, and Microsoft are few names which process four-comma messages (1,000,000,000,000) per day with Kafka.
Now, you must be thinking what makes Kafka so popular right? Well, if you’re having this question then you’re not alone. And you’ve certainly come to the right place as here we are going to discuss every aspect of Kafka including its origin story, working, key differentiators, use cases, and many more.
What is Apache Kafka?
Apache Kafka is an open-source streaming platform developed by the Apache Software Foundation. It was earlier developed as a messaging queue at LinkedIn; however, over the years Kafka has emerged to be much more than just a messaging queue. Now, it has become a robust tool for data streams. Not only this, but it also has many diverse use cases.
One of the major advantages of Kafka is that it can be scaled up whenever needed. To scale up, all you need to do is add some new nodes (servers) to the Kafka cluster.
Kafka is also known for managing a high amount of data per unit time. It also allows the processing of data in real-time mode due to its low latency. Kafka is written in Java and Scala. However, it’s compatible with other programming languages as well.
Kafka can also connect to external systems for export and import through Kafka Connect. Furthermore, it also provides Kafka Streams which is a Java stream processing library. Kafka uses a binary TCP-based protocol which relies on a “message set” abstraction. This groups the messages together to cut the overhead of network roundtrip.
This results in larger sequential disk operations, larger network packets, and contiguous memory blocks which enable Kafka to convert a stream of the random message into linear writes.
There are many factors which makes Kafka different from its traditional counterparts like RabbitMQ. First, Kafka retains the message for some period of time (default period is 7 days) after its consumption, whereas, RabbitMQ removes the message as soon as it receives the consumer’s confirmation.
Not only this, but RabbitMQ also pushes messages to consumers along with keeping track of their load. It determines as to how many messages should be under processing by each of the consumers.
On the other hand, Kafka supports consumers to fetch messages. This is also known as pulling. Kafka is designed to scale horizontally with the addition of nodes. This is quite different from traditional messaging queues as the traditional messaging queues expect to scale in the vertical direction with the addition of more power to the machine.
This is one of the major factors which differentiate Kafka from other traditional messaging systems.
The origin story at LinkedIn
Kafka was built around the year 2010 at LinkedIn by Jun Rao, Jay Kreps, and Neha Narkhede. The main issue which Kafka was intended to solve was of low-latency ingestion of big amounts of event data from LinkedIn website into a lambda architecture which harnessed real-time event processing systems and Hadoop.
“Real-time” processing was the key at that time since there was no solution for this kind of ingress of real-time applications.
There were good solutions that were used for ingesting data into offline batch systems. However, they used to leak implementation details to the downstream users. Furthermore, they also used a push model that was enough to overwhelm any consumer. Most importantly, they were not designed for the real-time use case.
If we talk about the traditional messaging queues then they guarantee a great delivery and support things like protocol mediation, transactions, and message consumption tracking. However, they used to be overkill for the use case that LinkedIn was working on.
At this time everyone including LinkedIn was looking to come up with a learning algorithm. But algorithms are nothing without data. To get data from the source systems and to reliably move it around was a tough ask. And the existing enterprise messaging solutions and batch-based solutions didn’t resolve the issue.
Kafka was actually built to become ingestion backbone. In the year 2011, Kafka was ingesting over 1 billion events per day. Currently, the ingestion rates reported by LinkedIn are somewhere around 1 trillion messages per day.
Terminologies associated with Kafka
To understand the working of Kafka, you must know how streaming applications work. And for that you need to understand various concepts and terminologies such as:
Event is the first thing which everyone should understand to know the working of streaming applications. The event is nothing but an atomic piece of data. For an instance, when the user registers into the system, then that action creates an event. An event can also be a message with data.
The registration event refers to the message where information such as email, location, user’s name, location, etc. is included. Kafka is the platform which works on the streams of events.
Producers continuously write events to Kafka. This is exactly the reason why they are called producers. Producers are of several types such as entire applications, components of an application, web servers, monitoring agents, IoT devices, etc.
A weather sensor can create weather events every hour which will consist of information regarding humidity, temperature, wind speed, and many more. Similarly, the component of a website which is responsible for user registrations can create an event “new user registered”. In simple words, a producer is anything that creates data.
Consumers are those entities that use data. In simple words, they receive and use the data that are written by producers. It’s also important to note that the entities like whole applications, components of applications, monitoring systems, etc. can act as producers as well as consumers.
Whether an entity will be a producer or consumer depends on the archtiecture of the system. However, generally, entities such as data analytics applications, databases, data lakes, etc. act as consumers as they often require to store the created data somewhere.
Kafka acts as a middleman between producers and consumers. The Kafka system is also referred as Kafka cluster since it consists of multiple elements. These elements are known as nodes.
The software components which run on a node are called brokers. Due to brokers, Kafka is also categorized as a distributed system.
The data inside the Kafka cluster is distributed among several various brokers. Also, the Kafka cluster consists of several copies of the same data. These copies are called replicas.
The presence of replicas makes Kafka more reliable, stable, and fault-tolerant. It’s because, even if something bad happens to a broker, then the information is not lost as it remains safe with other replicas. Due to this, another broker begins performing the functions of the defective broker.
Producers are responsible for publishing events to Kafka topics. Consumers can get access to the data by simply subscribing to those particular topics. Kafka topics are nothing but an immutable log of events. Each and every topic serve the data to numerous consumers. This is the reason why producers are also known as publishers. Similarly, consumers are called subscribers.
The main objective of Partitions is to replicate data across brokers. Every Kafka topic is divided into various partitions. And each partition can be placed on different nodes.
A unit or record within Kafka is called a message. Every message has a value, key, and optionally headers.
Every message present within the partition is assigned to an offset. An offset is an integer that increases monotonically. Furthermore, it also serves as a unique identifier for the message present within the partition.
A customer is said to experience lagging when he reads from the partition at a slower rate than the rate of messages being produced. Lag is expressed in the terms of the number of offsets that are behind the head of the partition. The time needed to catch up or recover from the lag depends on how fast the consumer is able to consume messages per second.
How does it work?
Now that we had a look at the various terminologies related to the Kafka. Let’s see how it actually works. Kafka receives all the information from a large number of data sources and organizes it into “topics”. These data sources can be something as simple as a transactional log of the grocery store records for each store.
The topics could be “number of oranges sold” or “no. of sales between 10 AM to 1 PM”. These topics can be analysed by anyone who needs insight into the data.
You might think that this sound very similar to the working of a conventional database. However, as compared to the conventional database, Kafka would be more suitable for something as big as a national chain of grocery stores that process thousands of apple sales each minute.
Kafka achieves this feat with the help of a Producer which acts as an interface between applications and the topics. Kafka’s own database of segmented and ordered data is called Kafka Topic Log.
This data stream is generally used to feed real-time processing pipelines like Storm or Spark. Moreover, it’s also used to fill data lakes like Hadoop’s distributed databases.
Like Producer, Consumer is another interface which allows topic logs to be read. Not only this, but it also enables the information stored in it to pass onto other applications which might require them.
The moment you put all the components together along with other common elements of Big Data analytics framework, then Kafka begins to form the central nervous system. Through this system, the data passes via input and captures applications, storage lakes, and data processing engines.
Why use Kafka?
There are a plethora of choices when it comes to choosing to publish/subscribe messaging systems. This begs the question as to what makes Kafka a standout choice for developers. Let’s find out.
Kafka comes with the capability to manage multiple producers seamlessly. It can handle multiple producers whether those clients are using the same topics or multiple different topics. This makes the system consistent and ideal for aggregating data from multiple frontend systems.
For an instance, a site that provides the content to users with the help of a number of microservices can have a single topic for page views that every service writes to use a common format. As a result, consumer application receives the single stream of application’s page views on the site and that too without any need of coordinate consuming from multiple topics.
Apart from multiple producers, Kafka is also designed for multiple consumers to read a single stream of messages and that too without any kind of interference from each other. This is in complete contrast to many of the queuing systems where a message once consumed by one client becomes unavailable for the rest of the clients.
Multiple Kafka consumers can also choose to share a stream which will ensure that the entire group gets to process the given message for only once.
Managing multiple consumers is not the only thing in Kafka’s arsenal. With durable message retention, Kafka frees its consumers from working in real-time. In this, first, the messages are committed to disk and are then stored as per the configurable retention rules. This enables a different stream of messages to have a varied amount of retention depending on the needs of the consumer.
Here, it’s important to understand what durable retention means. It means that if a consumer falls behind due to reasons like burst in traffic or slow processing, then there won’t be any danger of losing data. Not only this, but it also means that the maintenance can be performed on consumers when applications are in the offline mode for short period.
During this period, there’s absolutely no concern regarding messages being lost or being backed up on the producer. The messages will be retained in Kafka as soon as the consumers are stopped. This enables them to restart processing messages from where they had left off and that too without any data loss.
Cutting features of Kafka make it an excellent publish/subscribe messaging system that can easily perform under high load. Not only this, but its various components like producers, brokers, and consumers can be easily scaled to manage a high volume of message stream.
Kafka’s robust scalability makes it extremely easy to manage any amount of data. Users can simply begin with a single broker and then expand it to a small development cluster of three to four brokers. After that, they can move onto a larger cluster of tens or hundreds of brokers. These brokers grow with time as the data begins to scale up.
Expansions can be performed when the cluster is online. One important thing to notice here is that while expansion there’s no impact on the availability of the system. This also implies that the cluster of multiple brokers can easily manage the failure of any individual broker thus continuing to service clients.
In this article, we have provided you with a complete guide on Kafka. First, we discussed what Apache Kafka is? Then we discussed its origin story and journey at LinkedIn. We also had a look at all the terminologies associated with Kafka. Then we discussed its working and top reasons as to why one must use Kafka. We also saw Kafka’s best use cases.
If you’re an entrepreneur and are looking to leverage Kafka and its use cases for growing your business then you must consult experts at Peerbits who will guide towards success with their technical expertise and experience.