Apache Kafka is an open source distributed streaming platform, famous for its ability to handle very high message rates with low latency. While Kafka is wildly popular, we recommend caution due to operational complexity, and consider other tools to be better default choices.
As a streaming platform, Kafka has two key features:
- It lets components publish and subscribe to streams of messages, similar to a message queue or enterprise messaging system.
- It stores streams of messages in a fault-tolerant, durable way.
Typical use cases are messaging, log transport, stream processing, user activity tracking, event sourcing, and as a commit log.
Kafka is 'distributed' in that it is designed to run as a cluster of broker nodes with no single leader.
Kafka is generally used for two broad classes of applications:
- Building real-time streaming data pipelines and applications that reliably process events. Consumers may output transformed events to other streams, persist data to a durable data store, or send some form of notification or message to an external system.
- Acting as a buffer between real-time event sources and processes which cannot process messages in real time, either because message rate is spiky and at times exceeds the processor's throughput, or because processing occurs in batches.
Kafka is often used in conjunction with event sourcing systems that are designed to be able to rebuild their state from an immutable event log. Consumers are able to request replay of events from specific offsets in topics, and other consumers can continue to process from the tail of the topic without interruption.
While Kafka's design is conceptually elegant and it theoretically lends itself to a wide variety of use-cases, its operational complexity means that for many use cases other tools are a better choice. Where available, we consider fully managed solutions such as Azure Event Hubs and AWS SQS/SNS/Kinesis to be the default choices. If a self-hosted system is preferred then we find RabbitMQ is often a better fit due to its operational simplicity and rich feature set, with the case for Kafka only being compelling if its distinct capabilities are truly needed.
Kafka is designed for longer message retention/playback periods than other message bus tools, so the deployment patterns can cope with a broader range of replay/retry scenarios. The original invention at LinkedIn had an 'infinite' retention period, and was used as a long-term message backup and replay tool. This allowed for the replay and re-processing of messages using slightly different logic, even months after the original event.
Kafka is optimised for low latency writes and high read performance, and scales horizontally by partitioning messages within each topic based on a user-definable key. The concept of Consumer Groups allows messages to be distributed between multiple instances of a particular consumer to allow processing to be horizontally scaled.
Each message can be stored by multiple nodes within a cluster for greater fault tolerance. The number of nodes is configurable, allowing you to balance performance, storage requirements, and resilience. Writes are only considered 'committed' when the configured number of copies have been durably written.
Streams and queues
Key differences between stream-based systems like Kafka and Kinesis and queue-based systems like RabbitMQ and AWS SQS are where the responsibility lies for tracking message delivery, and how message retention is handled. With stream systems, each message consumer is 'smart', meaning that consumer is responsible for maintaining a record of which message to read next (known as the stream 'offset'). Consumers of queue-based systems are 'dumb', and rely on the queue system to maintain that information.
Once a consumer of a queue-based system has acknowledged a message to indicate that it has been processed and to trigger the next message to be delivered, the queue system deletes that message. If the consumer stops consuming messages then the queue system retains all unprocessed messages (subject to some cap), so the storage demands can fluctuate significantly. Once a message is acknowledged and deleted, it is not available to be replayed to recover from failure of a consuming system, or to allow messages to be reprocessed. The approach taken by stream systems is somewhat different — all messages are retained for the configured period (measured either in time or data volume). These messages are available to be read as many times as needed, by any consumer with access, but once the messages drop off the end of the buffer they are gone.
AMQP-based tools such as RabbitMQ also have advanced message routing capabilities, which make them more suitable for some applications. Exchanges allow messages to be routed based on routing key, topic name, or message headers, and they have built-in support for message retry and dead letter handling. Kafka, on the other hand, streams every message on a topic to all listening consumers.
Kafka is generally considered to be complicated to deploy, configure, and operate. The list of configuration options is daunting, and the need to run a ZooKeeper cluster alongside the Kafka cluster complicates matters, especially as ZooKeeper has a tendency to be somewhat fragile and high maintenance.
An additional complexity intrinsic to Kafka's design is the importance of choosing a good partitioning strategy. A poor choice for number of partitions and partition keys can significantly impact the performance and efficiency of your cluster, and these parameters are not easy to change once set.
Alongside several companies which specialise in managed Kafka services, the major cloud providers offer varying flavours of 'managed Kafka' to ease the operational burden. Amazon MSK (Amazon Managed Streaming for Apache Kafka) is a managed service which runs open source Kafka under the hood, while Azure Event Hubs for Apache Kafka presents a Kafka-compatible API mapped onto the AMQP-based traditional Azure Event Hubs internally, rather than being based on the Kafka engine.