Kafka is a distributed publish-subscribe messaging system used for collecting and delivering high volumes of data with low latency
This is a draft cheat sheet. It is a work in progress and is not finished yet.
What is Kafka Used For?
1. Building real-time streaming pipelines that move data between different applications. 2. Building real-time streaming applications that are capable of processing streams of data. 3. Building a fault tolerant storage system that stores streams of records.
A Kafka topic is a category or feed name under which messages are stored. A Kafka producer publishes messages to a topic, which may be subscribed by zero or more consumers.
A topic partition is a structured commit log to which the records are continually appended. For each topic, Kafka keeps a minimum of one partition. Each record in the partition is assigned a sequential id called as the offset, which uniquely identifies each of them within the partition. The partitions enable the topic to scale beyond a single server and act as the unit of parallelism.
Benefits of Kafka
Kafka's distributed design, topic partitioning, and data replication over servers make it reliable.
Kafka system exists as a cluster of brokers. The number of brokers can grow over time when more data comes. Any failure of an individual broker in a cluster is handled by the system providing uninterrupted service.
Disk-based data retention makes Kafka durable. Messages remain on the disk based on the retention rule configured on a per-topic basis. Even if a consumer falls backs due to any reason, the data continue to reside in the Broker till the retention period and is not lost.
All the above features make Kafka a High-Performance messaging system.