rdd
Follow instructions for your environment.
$ spark-shell
Type this in Spark Shell
sc.setLogLevel("WARN")
Go to http://localhost:4040 in the browser.
Issue the following commands in Spark-shell
val clickstream = sc.textFile("/user/root/clickstream/in/clickstream.csv")
# count how many lines are there
clickstream.count
# print all lines
clickstream.collect
Let's find all traffic from 'facebook.com'.
We can use a filter command for this.
Try the following in Spark-shell
# apply filter
val fb = clickstream.filter(line => line.contains("facebook.com"))
# check the Shell UI, is the above transformation executed yet? why (not) ?
# count the FB traffic
fb.count
# print fB traffic
fb.collect
Find the views / clicks ratio for facebook.com
Let's load all data in clickstream/in
directory.
val clickstream = sc.textFile("/user/root/clickstream/in/")
val fb = clickstream.filter(line => line.contains("facebook.com"))
val fbViews = fb.filter(line => line.contains("viewed"))
val fbClicks = fb.filter(line => line.contains("clicked"))
# calculate the views / clicks ratio
println ("FB views / clicks = " + (fbViews.count.toFloat / fbClicks.count))
** => Inspect the Spark Shell UI (port 4040) **