Saturday, August 6, 2022

Starbucks Twitter Sentiment Analysis

 

[Kafka] Running Kafka with Docker (python)

CONTENTS

Kafka with Docker

In this post, we would like to go over how to run Kafka with Docker and Python. Before starting the post, make sure you have installed Docker (Docker hub) on your computer.

Okay, first, let’s create a directory folder to store docker-compose.yml file. The docker-compose file does not run your code itself.

1
$ mkdir ~/docker-kafka && cd docker-kafka

You can pull kafka and zookeeper images by using this docker pull command, more detailed explanation can be found in the following link - kafka and zookeeper from Docker Hub.

1
2
3
4
5
# Kafka
$ docker pull wurstmeister/kafka

# Zookeeper
$ docker pull wurstmeister/zookeeper

Instead of pulling images separately, you can write docker-compose.yml file to pull those simultaneously. What is docker-compose.yml file? It is basically a config file for Docker Compose. It allows you to deploy, combine, and configure multiple docker containers at the same time. Is there difference between dockerfile and docker-compose? Yes! “A Dockerfile is a simple text file that contains the commands a user could call to assemble an image whereas Docker Compose is a tool for defining and running multi-container Docker applications” (dockerlab)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# This docker-compose file starts and runs:
# * A 1-node kafka cluster
# * A 1-zookeeper ensemble

version: '2'
services: 
  zookeeper:
    image: wurstmeister/zookeeper
    container_name: zookeeper
    ports:
      - "2181:2181"
  kafka:
    image: wurstmeister/kafka
    container_name: kafka
    ports:
      - "9092:9092"
    environment:
      KAFKA_ADVERTISED_HOST_NAME: 127.0.0.1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

Make sure you run the following command where docker-compose.yml file is located at.

1
$ docker-compose up -d
1
$ docker container exec -it kafka bash

bash script will prompt!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
cd opt/kafka_2.13-2.8.1/bin

# topic list
kafka-topics.sh --list --zookeeper zookeeper:2181

# create topic 
kafka-console-producer.sh --bootstrap-server kafka:9092 --topic <topic-name>

# read events
kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic <topic-name> --from-beginning

# delete topic 
kafka-topics.sh --zookeeper zookeeper:2181 --delete --topic <topic-name>
1
2
3
4
5
6
7
8
# Check topic list
$ docker exec -it kafka kafka-topics.sh --bootstrap-server kafka:9092 --list

# Create topic
$ docker exec -it kafka kafka-console-producer.sh --bootstrap-server kafka:9092 --topic <topic-name>

# Read events
$ docker exec -it kafka kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic <topic-name> --from-beginning

You may run the following command at any time from a separate terminal instance:

1
$ docker-compose ps

When you are ready to stop Docker compose you can run the following command

1
$ docker-compose stop

And if you’d like to clean up the container to reclaim disk space, as well as the columns containing your data, run the following command:

1
2
$ docker-compose rm -v
Are you sure? [yN] y

So, when you completed connecting kafka and docker, it’s time to actually get real-time tweets from twitter through kafka.

Imagine, you own a small company which produces a service to users through own online platform. Then, there should be a source system like clickstream and a target system like own online platform. Data integration between the source system and target system woudln’t be that complicated. But, once the size of your company grows, the company would face lots of struggles when the company has more source systems and target systems with all different data sources. That’s the when the Kafka comes in. Kafka is a platform to get produced data from the source systems and the target systems read a streaming data from Kafka.

Kafka Diagram

The image is originally from a post explaining about Kakfa. I recommend the post !

In this post, we will create three files under src folder.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from kafka import KafkaConsumer
import json

topic_name = "<your_topic_name>"

consumer = KafkaConsumer(
    topic_name,
     bootstrap_servers=['localhost:9092'],
     auto_offset_reset='latest',
     enable_auto_commit=True,
     auto_commit_interval_ms =  5000,
     fetch_max_bytes = 128,
     max_poll_records = 100,

     value_deserializer=lambda x: json.loads(x.decode('utf-8')))

for message in consumer:
 tweets = json.loads(json.dumps(message.value))
 print(tweets)

Open two different terminals.

1
python src/consumer.py
1
python src/producer.py

The source code can be checked here in github

No comments:

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...