Skip to content

jdutton/spark-playground

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark Playground

Build Status

This is a playground for experiments with Spark! Who's got two thumbs and is excited...!?

Getting Started

On a Mac, run brew update and brew install apache-spark sbt to install the latest version of Spark and the Scala sbt build system.

To get started playing, run spark-shell and follow the Spark Quick Start guide for some examples of simple interactive processing.

The Scala experiments code is located in the standard directory layout at src/main/scala.

To build the various experiments into a jar for submitting to spark, run sbt assembly. For example:

sbt assembly
spark-submit --class playground.HardFeelings --master local[*] target/scala-2.11/spark-playground-assembly-*.jar

Or, to build and run with Scala 2.10, which most distributions of Spark are currently built against:

sbt ++2.10.5 assembly
spark-submit --class playground.HardFeelings --master local[*] target/scala-2.10/spark-playground-assembly-*.jar

The rest of the examples assume this project is built against Scala 2.10.

Live Tweets

To process live tweets, copy src/main/resources/twitter.conf.example to src/main/resources/twitter.conf and modify the config to specify your Twitter API credentials. If you don't have Twitter API credentials, see http://ampcamp.berkeley.edu/3/exercises/realtime-processing-with-spark-streaming.html for directions how to obtain them.

To see live tweets printed out:

spark-submit --class playground.connector.Twitter --master local[*] target/scala-2.10/spark-playground-assembly-*.jar

Elasticsearch

To enable Elasticsearch in the playground, make sure Elasticsearch is running locally (Elasticsearch REST API should be available at http://localhost:9200), then you can enable Elasticsearch support by adding --conf spark.playground.es.enabled=true to the spark-submit command arguments. See examples below.

Kafka

To enable Kafka in the playground, make sure Kafka is running locally. Kafka broker expected to be on localhost:9092 and zookeeper is expected to be on localhost:2181.

First, make sure the Kafka topic MericaTweets exists. To list topics, run:

kafka-topics.sh --zookeeper localhost:2181 --list

If the MericaTweets topic does not exist, create it:

kafka-topics.sh --zookeeper localhost:2181 --create --replication-factor 1 --partitions 1 --topic MericaTweets

To listen in externally to the MericaTweets feed from MericaStreaming, optionally from the beginning of the queue:

kafka-console-consumer.sh --zookeeper localhost:2181 --topic MericaTweets [--from-beginning]

HDFS

To output to HDFS (or simulated HDFS via local files), add --conf spark.playground.es.enabled=true to the spark-submit command arguments.

Note that spark-submit will use the HADOOP_CONF_DIR environment variable to find HDFS. When HDFS is not properly setup, make sure this environment variable is not set. Then Spark will write to local files rather than HDFS.

Streaming Tweet Processing

To run streaming tweet processing, you can enable or disable various external datastores, for example:

spark-submit --class playground.MericaStreaming --master local[*] target/scala-2.10/spark-playground-assembly-*.jar
spark-submit --conf spark.playground.kafka.enabled=true --class playground.MericaStreaming --master local[*] target/scala-2.10/spark-playground-assembly-*.jar
spark-submit --conf spark.playground.kafka.enabled=true --conf spark.playground.es.enabled=true --class playground.MericaStreaming --master local[*] target/scala-2.10/spark-playground-assembly-*.jar

To save the tweets for offline batch processing (see below), run with --conf spark.playground.hdfs.enabled=true, e.g:

spark-submit --conf spark.playground.hdfs.enabled=true --class playground.MericaStreaming --master local[*] target/scala-2.10/spark-playground-assembly-*.jar

Note that any combination of the above --conf external datastores is supported.

The Merica and MericaStreaming sentiment analysis jobs can perform sentiment calculations the "easy way" (compute sentiment in memory) or the "hard way" (distributed sentiment calculation via Spark).

To change the calculation mode --conf spark.playground.easy=false (defaults to true). This works for streaming or batch.

Batch Tweet Processing

To do batch sentiment analysis on the tweets saved from the MericaStreaming processing described above, run any of the following:

spark-submit --class playground.Merica --master local[*] target/scala-2.10/spark-playground-assembly-*.jar
spark-submit --conf spark.playground.es.enabled=true --class playground.Merica --master local[*] target/scala-2.10/spark-playground-assembly-*.jar

Development

Just like any other SBT project, to develop in Eclipse (like the Scala IDE), initialize or update the Scala project by running:

sbt eclipse

Testing

Just like any other SBT project, to run unit tests:

sbt test

About

Playground for experimenting with Apache Spark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages