Show Menu

Apache Kafka Cheat Sheet (DRAFT) by

Kafka is a distributed publish-subscribe messaging system used for collecting and delivering high volumes of data with low latency

This is a draft cheat sheet. It is a work in progress and is not finished yet.


What is Kafka Used For?
1. Building real-time streaming pipelines that move data between different applic­ations.
2. Building real-time streaming applic­ations that are capable of processing streams of data.
3. Building a fault tolerant storage system that stores streams of records.
A Kafka topic is a category or feed name under which messages are stored. A Kafka producer publishes messages to a topic, which may be subscribed by zero or more consumers.
A topic partition is a structured commit log to which the records are contin­ually appended. For each topic, Kafka keeps a minimum of one partition. Each record in the partition is assigned a sequential id called as the offset, which uniquely identifies each of them within the partition. The partitions enable the topic to scale beyond a single server and act as the unit of parall­elism.

Benefits of Kafka

Kafka's distri­buted design, topic partit­ioning, and data replic­ation over servers make it reliable.
Kafka system exists as a cluster of brokers. The number of brokers can grow over time when more data comes. Any failure of an individual broker in a cluster is handled by the system providing uninte­rrupted service.
Disk-based data retention makes Kafka durable. Messages remain on the disk based on the retention rule configured on a per-topic basis. Even if a consumer falls backs due to any reason, the data continue to reside in the Broker till the retention period and is not lost.
All the above features make Kafka a High-P­erf­ormance messaging system.