Explain Kafka Like I am 5 : Kafka is like a big post office where people can send messages to different rooms (called “topics”) and other people can come and read the messages.

The messages are saved in a big notebook (called “log”) so even if the rooms get too full, the messages don’t get lost. The log keeps track of all the messages that have been received, in the order that they were received. And if more people want to read the messages, we can just make more rooms (more “topics”) out of thin air.

Definition:

Kafka is a distributed streaming platform that is used for building real-time data pipelines and streaming applications. It is designed to handle high volumes of data and provide a fault-tolerant way of storing and processing streams of records in real-time.

Kafka is based on a publish-subscribe model, where producers write data to topics, and consumers read data from those topics. Topics are partitioned and replicated across a cluster of servers, which allows for high scalability and fault-tolerance.

Kafka Internals:

At a high level, Kafka is composed of four main components:

Producers: Producers are the systems or applications that generate and send data to Kafka topics.
Brokers: Brokers are the servers that make up the Kafka cluster. They are responsible for receiving data from producers, storing it in topics, and forwarding it to consumers. Brokers also handle replication and partitioning of data across the cluster, to ensure high availability and fault-tolerance.
Topics: Topics are the logical container for data in Kafka. Topics are the categories or feeds to which records are written by a producer and read by a consumer. Topics are partitioned, meaning that each topic is split into a number of partitions, which are spread across the brokers in the cluster.
Consumers: Consumers are the systems or applications that read data from Kafka topics. Consumers read data from topics and can process or forward it to other systems.

Other kafka concepts:

Partitions: A partition is a portion of a topic that is stored on a single broker. Each partition is an ordered, immutable sequence of records. Each partition is replicated across a configurable number of brokers for fault tolerance.
Consumer Groups: A consumer group is a set of consumers that work together to read data from a topic. Each consumer in the group is responsible for reading a unique subset of the partitions.
Offsets: Offsets are the position of a consumer in a partition. Each consumer in a consumer group maintains its own offset, which is used to keep track of which records have been read.
Messages: The core abstraction in Kafka is a message, which is a simple byte array that can be used to store any type of data, such as text, binary, or avro. Each message is assigned a unique offset, which is used to identify its position within a partition. Producers write messages to a specific topic and partition, and consumers read messages from a specific topic and partition.

Kafka uses a log-based storage system, where all incoming data is written to a log file, this log file is called a partition. Each partition is a self-contained log file that stores all the messages for a given topic. Each partition is also replicated across multiple servers in the cluster, this replication is done to ensure high availability and fault-tolerance.

Kafka Usage Example: Here we are using the Kafka Java API to create a simple producer and consumer:

Producer Code :

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import java.util.Properties;

public class KafkaProducerExample {

    public static void main(String[] args) {

        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

        KafkaProducer<String, String> producer = new KafkaProducer<>(props);
        String topic = "test";
        String key = "key1";
        String value = "value1";

        ProducerRecord<String, String> record = new ProducerRecord<>(topic, key, value);
        producer.send(record);
        producer.close();
    }
}

This code creates a simple Kafka producer that connects to a Kafka cluster running on “localhost” at port 9092. The producer sends a single message to a topic called “test”, with a key “key1” and a value “value1”.

Consumer Code :

import org.apache.kafka.clients.consumer.KafkaConsumer;
import java.util.Properties;
import java.util.Arrays;
public class KafkaConsumerExample {
    public static void main(String[] args) {

        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("group.id", "test-group");

        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
        consumer.subscribe(Arrays.asList("test"));

        while (true) {
            ConsumerRecords<String, String> records = consumer.poll(100);
            for (ConsumerRecord<String, String> record : records)
                System.out.println("Received message: (" + record.key() + ", " + record.value() + ") at offset " + record.offset());
        }
    }
}

This code creates a simple Kafka consumer that connects to the same Kafka cluster as the producer and subscribes to the “test” topic. The consumer polls the topic every 100 milliseconds, and prints out any messages it receives.

Definitions:

Log-based storage system: A log-based storage system is a type of data storage system that uses a log to record all changes made to the data, rather than storing the data in a traditional file or block format. This allows for a more efficient and reliable method of storing and recovering data, as well as providing features such as point-in-time recovery and data replication. Examples of log-based storage systems include databases like MySQL and PostgreSQL, as well as distributed systems like Apache Kafka and Apache Cassandra.

Kafka: Capabilities and Real world Applications

Kafka Capabilities:

Can process millions of events per second.

High throughput and Low Latency: Kafka is able to handle a large volume of data at high speeds. This is achieved by the use of a distributed architecture, where multiple brokers can be used to handle the load. Producers write messages to a specific topic, and a partitioner is used to determine which partition the message should be written to. The partitioner can use a variety of strategies, such as round-robin, random, or key-based, to ensure that messages are distributed evenly across all partitions.
High scalability:As the number of messages increases, more brokers can be added to the cluster to handle the load. Additionally, partitions can be split or merged as needed, to ensure that the data is distributed evenly across all brokers.

Data is not lost, Even in case of failures

Durability: Kafka also provides robust durability, with all messages being written to disk and replicated to multiple brokers to ensure that they are not lost in case of broker failure. It also allows you to set a retention period for messages, after which they will be automatically deleted.
Replay Data: Since the data is stored on disk it allows the system to handle high loads of data and also makes it possible to replay the data if needed.
High data availability and fault tolerance: Kafka replicates data across multiple machines in the cluster, which ensures that data is always available even if one or more machines fail.

Powerful set of APIs and Integrations

Powerful set of APIs: Kafka also provides a powerful set of APIs for integrating with other systems, such as Kafka Connect for integrating with other systems for data ingestion, and Kafka Streams for building real-time streaming applications.
Supports a wide range of data formats: including text, JSON, Avro and Protobuf, which makes it easy to integrate with a wide range of systems.
Security features, such as authentication and encryption, that can be used to secure data in transit and at rest.

Kafka Real world use cases:

Since Kafka is able to handle high volumes of data and process them in real-time, it is well-suited for use cases such as :

Event-Driven Architecture: Kafka can be used to build event-driven architectures, where data is processed as soon as it is generated, rather than in batches. This enables real-time data processing and analytics.
Adtech: Kafka can be used to handle real-time data generated by adtech platforms, such as ad impressions, clicks, and conversions.
Log Aggregation: One of the most common use cases for Kafka is log aggregation. Kafka can be used to collect log data, metrics from multiple sources, such as servers, applications, and devices, and store it in a central location for real-time analysis and reporting.
Online Gaming: Kafka can be used to handle real-time data generated by online games, such as player actions, game states, and leaderboards.
IoT: Kafka can be used to handle real-time data generated by IoT devices, such as sensor data, device status, and control commands.
Financial Services: Kafka can be used to handle real-time financial data, such as stock prices, trades, and financial transactions.

Drawbacks of kafka

Resource Intensive: Kafka requires a significant amount of resources, including memory, CPU, and disk space. This can make it challenging to run Kafka on resource-constrained environments, such as small or low-powered machines.
Lacks Complex Data Querying capabilities: Kafka is primarily designed for data ingestion and stream processing. While it does support data querying, the capabilities are limited, and it may not be the best choice for use cases that require complex data querying and analysis.
Limited Security Features: While Kafka does support security features, such as authentication and encryption, it does not provide a comprehensive security solution.
Complex Setup: Setting up and configuring a Kafka cluster can be complex and time-consuming.
Basic Monitoring: Kafka provides only basic monitoring capabilities, which can make it difficult to diagnose and troubleshoot issues that arise in the cluster.

Subrat's Technical Blog

Saturday, January 28, 2023

Kafka: An Overview

Definitions:

Kafka: Capabilities and Real world Applications

Can process millions of events per second.

Data is not lost, Even in case of failures

Powerful set of APIs and Integrations

No comments:

Must Watch YouTube Videos for Databricks Platform Administrators

Report Abuse