This is a playground for experiments with Spark! Who's got two thumbs and is excited...!?
On a Mac, run brew update
and brew install apache-spark sbt
to install the latest version of
Spark and the Scala sbt
build system.
To get started playing, run spark-shell
and follow the
Spark Quick Start guide for some examples of
simple interactive processing.
The Scala experiments code is located in the standard directory layout at src/main/scala
.
To build the various experiments into a jar for submitting to spark, run sbt assembly
. For example:
sbt assembly
spark-submit --class playground.HardFeelings --master local[*] target/scala-2.11/spark-playground-assembly-*.jar
Or, to build and run with Scala 2.10, which most distributions of Spark are currently built against:
sbt ++2.10.5 assembly
spark-submit --class playground.HardFeelings --master local[*] target/scala-2.10/spark-playground-assembly-*.jar
The rest of the examples assume this project is built against Scala 2.10.
To process live tweets, copy src/main/resources/twitter.conf.example
to
src/main/resources/twitter.conf
and modify the config to specify your Twitter API credentials. If
you don't have Twitter API credentials, see
http://ampcamp.berkeley.edu/3/exercises/realtime-processing-with-spark-streaming.html for directions
how to obtain them.
To see live tweets printed out:
spark-submit --class playground.connector.Twitter --master local[*] target/scala-2.10/spark-playground-assembly-*.jar
To enable Elasticsearch in the playground, make sure Elasticsearch is running locally (Elasticsearch
REST API should be available at http://localhost:9200), then you can enable Elasticsearch support
by adding --conf spark.playground.es.enabled=true
to the spark-submit
command arguments.
See examples below.
To enable Kafka in the playground, make sure Kafka is running locally. Kafka broker expected to be
on localhost:9092
and zookeeper is expected to be on localhost:2181
.
First, make sure the Kafka topic MericaTweets
exists. To list topics, run:
kafka-topics.sh --zookeeper localhost:2181 --list
If the MericaTweets
topic does not exist, create it:
kafka-topics.sh --zookeeper localhost:2181 --create --replication-factor 1 --partitions 1 --topic MericaTweets
To listen in externally to the MericaTweets
feed from MericaStreaming
, optionally from the
beginning of the queue:
kafka-console-consumer.sh --zookeeper localhost:2181 --topic MericaTweets [--from-beginning]
To output to HDFS (or simulated HDFS via local files), add --conf spark.playground.es.enabled=true
to the spark-submit
command arguments.
Note that spark-submit will use the HADOOP_CONF_DIR
environment variable to find HDFS. When HDFS
is not properly setup, make sure this environment variable is not set. Then Spark will write to
local files rather than HDFS.
To run streaming tweet processing, you can enable or disable various external datastores, for example:
spark-submit --class playground.MericaStreaming --master local[*] target/scala-2.10/spark-playground-assembly-*.jar
spark-submit --conf spark.playground.kafka.enabled=true --class playground.MericaStreaming --master local[*] target/scala-2.10/spark-playground-assembly-*.jar
spark-submit --conf spark.playground.kafka.enabled=true --conf spark.playground.es.enabled=true --class playground.MericaStreaming --master local[*] target/scala-2.10/spark-playground-assembly-*.jar
To save the tweets for offline batch processing (see below), run with --conf spark.playground.hdfs.enabled=true
, e.g:
spark-submit --conf spark.playground.hdfs.enabled=true --class playground.MericaStreaming --master local[*] target/scala-2.10/spark-playground-assembly-*.jar
Note that any combination of the above --conf
external datastores is supported.
The Merica and MericaStreaming sentiment analysis jobs can perform sentiment calculations the "easy way" (compute sentiment in memory) or the "hard way" (distributed sentiment calculation via Spark).
To change the calculation mode --conf spark.playground.easy=false
(defaults to true).
This works for streaming or batch.
To do batch sentiment analysis on the tweets saved from the MericaStreaming processing described above, run any of the following:
spark-submit --class playground.Merica --master local[*] target/scala-2.10/spark-playground-assembly-*.jar
spark-submit --conf spark.playground.es.enabled=true --class playground.Merica --master local[*] target/scala-2.10/spark-playground-assembly-*.jar
Just like any other SBT project, to develop in Eclipse (like the Scala IDE), initialize or update the Scala project by running:
sbt eclipse
Just like any other SBT project, to run unit tests:
sbt test