Covid19 Tweet Data Collection

Overview

Stream real time Tweets of current affairs like covid-19 using Kafka 2.0.0 high throughput producer & consumer into Elasticsearch using safe, idempotent and compression configurations. In this example, keywords like coronavirus, covid-19, pandemic are being used, Change them according to your requirements. Elasticsearch stores the tweets as JSON. Aggregate this data and use the data collected for further analytics.

Breakdown

To ensure a high throughput, idempotence, safe and compression configurations are enabled with respect to Kafka 2.0.0 version. In the TwitterProducer class, the variable terms can be updated to stream tweets about Current affairs. The prerequisite to running this project is to procure Twitter API credentials. To do this, sign up for twitter Developer account here. After creating the app and getting the OAuth credentials,consumerKey,consumerSecret,token,tokenSecret are to be used to set the variables in config.java. Set these Strings to be static. Follow the below template for config.java:

public class config {
    static String consumerKey= "";
    static String consumerSecret = "";
    static String token = "";
    static String tokenSecret = "";
}

To integrate Twitter streaming with Kafka, we use the Horsebird Client which is a Java HTTP client for consuming Twitter's standard streaming API.

In the ElasticSearchConsumer class, We create a consumer which is subscribed to the TwitterTopic which is being used in the TwitterProducer class and the consumer group is Twitter-Consumer-Group. Google gson library is being used to parse the tweets since they are consumed in the JSON format. We extract id_str and use it to make our consumer idempotent and BulkRequest is used to batch before committing the offsets into Elasticsearch. After setting up ElasticSearch either on premise or on the cloud. Set hostname,username,password as static variables in config.java under the kafka-consumer-elasticsearch folder. Follow the below template for config.java:

public class config {
    static String hostname = ""; // Use localhost:9200 or Elasticsearch cloud URL
    static String username = ""; // Required only for Elasticsearch cloud
    static String password = ""; // Required only for Elasticsearch cloud
}

Environment

Java JDK 1.8
Twitter Developer Account
Kafka 2.0.0
Zookeeper
Windows/Linux
Maven
Elasticsearch 7.2.0 in the cloud (bonsai)

Prerequisites

Add the bin folder location of kafka to the $PATH

Ensure that Zookeeper & Kafka servers are up and running (Open in separate terminal windows if necessary).

Command to start Zookeeper

zookeeper-server-start.sh config/zookeeper.properties

Command to start Kafka server

kafka-server-start.sh config/server.properties

In this example, we are implementing Elasticsearch on the cloud. bonsai provides a very good sandbox environment with a 3 node cluster for implementing this.

Personally recommend running Elasticsearch on local machine if and only if free & available RAM >= 5 GB so as to not turn your machine into a room heater

For the Kafka consumer to commit data into Elasticsearch, Create an index called twitter using the PUT command.

Installation

After cloning this repo, Do this for both Kafka Twitter Producer and ElasticSearch consumer

Run the maven clean install command

mvn clean install

Maven will now generate a target directory with the jar Kakfa-Streaming-1.0-shaded.jar or kafka-consumer-elasticsearch-1.0-shaded.jar respectively

Move into the target directory

cd target

Execution

Executing Twitter Producer

Create a topic called TwitterTopic. Always have a replication factor equal or lesser than the number of available brokers. (One Time activity)

kafka-topics --bootstrap-server localhost:9092 --topic TwitterTopic --create --partitions 3 --replication-factor 1

Run the kafka console consumer in another terminal window with the following topic and group parameters

kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic TwitterTopic --group Twitter-Consumer-Group

Execute the TwitterProducer class from the shaded jar

java -cp Kakfa-Streaming-1.0-shaded.jar com.github.thomas.kafka.TwitterProducer

You should now be able to see the output in your Kafka console consumer terminal.

Executing Elasticsearch Kafka Consumer

Create an index in Elasticsearch called twitter in console or localhost using an PUT command

PUT /twitter

or

curl -X PUT "localhost:9200/twitter?pretty"

Execute the ElasticSearchConsumer class from the shaded jar

java -cp kafka-consumer-elasticsearch-1.0-shaded.jar com.github.thomas.kafka.elasticsearchconsumer.ElasticSearchConsumer

View all the tweets in your Elasticsearch console using GET command.

GET /twitter/_search?pretty

or

curl -X GET "localhost:9200/twitter/_search?pretty"

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
kafka-consumer-elasticsearch		kafka-consumer-elasticsearch
src/main/java/com/github/thomas/kafka		src/main/java/com/github/thomas/kafka
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
elastisearch.JPG		elastisearch.JPG
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Covid19 Tweet Data Collection

Overview

Breakdown

Environment

Prerequisites

Installation

Execution

Executing Twitter Producer

Executing Elasticsearch Kafka Consumer

Elasticsearch Cloud Result

License

Resources

About

Contributors 2

Languages

License

Thomas-George-T/Kafka-Tweet-Stream

Folders and files

Latest commit

History

Repository files navigation

Covid19 Tweet Data Collection

Overview

Breakdown

Environment

Prerequisites

Installation

Execution

Executing Twitter Producer

Executing Elasticsearch Kafka Consumer

Elasticsearch Cloud Result

License

Resources

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages