-
Notifications
You must be signed in to change notification settings - Fork 707
Rosetta Code
A collection of MapReduce tasks translated (from Pig, Hive, MapReduce streaming, etc.) into Scalding. For fully runnable code, see the repository here.
tweets = LOAD 'tweets.tsv' AS (text:chararray);
words = FOREACH tweets GENERATE FLATTEN(TOKENIZE(text)) AS word;
word_groups = GROUP words BY word;
word_counts = FOREACH word_groups GENERATE group AS word, COUNT(words) AS count;
STORE word_counts INTO 'word_counts.tsv';
Tsv("tweets.tsv", 'text)
.flatMap('text -> 'word) { text : String => text.split("\\s+") }
.groupBy('word) { _.size }
.write(Tsv("word_counts.tsv"))
The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. (Example taken from Google Code University.)
%declare PATTERN '.*hello.*';
tweets = LOAD 'tweets.tsv' AS (text:chararray);
results = FILTER tweets BY (text MATCHES '$PATTERN');
val Pattern = ".*hello.*";
Tsv("tweets.tsv", 'text)
.filter('text) { text : String => text.matches(Pattern) }
The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions. (Example taken from Google Code University.)
tweets = LOAD 'tweets.tsv' AS (tweet_id:int, text:chararray);
words = FOREACH tweets GENERATE tweet_id, FLATTEN(TOKENIZE(text)) AS word;
word_groups = GROUP words BY word;
inverted_index = FOREACH word_groups GENERATE group AS word, words.tweet_id;
val tweets = Tsv("tweets.tsv", ('id, 'text))
val wordToTweets =
tweets
.flatMap(('id, 'text) -> ('word, 'tweetId)) {
fields : (Long, String) =>
val (tweetId, text) = fields
text.split("\\s+").map { word => (word, tweetId) }
}
val invertedIndex =
wordToTweets.groupBy('word) { _.toList[Long]('tweetId -> 'tweetIds) }
- Scaladocs
- Getting Started
- Type-safe API Reference
- SQL to Scalding
- Building Bigger Platforms With Scalding
- Scalding Sources
- Scalding-Commons
- Rosetta Code
- Fields-based API Reference (deprecated)
- Scalding: Powerful & Concise MapReduce Programming
- Scalding lecture for UC Berkeley's Analyzing Big Data with Twitter class
- Scalding REPL with Eclipse Scala Worksheets
- Scalding with CDH3U2 in a Maven project
- Running your Scalding jobs in Eclipse
- Running your Scalding jobs in IDEA intellij
- Running Scalding jobs on EMR
- Running Scalding with HBase support: Scalding HBase wiki
- Using the distributed cache
- Unit Testing Scalding Jobs
- TDD for Scalding
- Using counters
- Scalding for the impatient
- Movie Recommendations and more in MapReduce and Scalding
- Generating Recommendations with MapReduce and Scalding
- Poker collusion detection with Mahout and Scalding
- Portfolio Management in Scalding
- Find the Fastest Growing County in US, 1969-2011, using Scalding
- Mod-4 matrix arithmetic with Scalding and Algebird
- Dean Wampler's Scalding Workshop
- Typesafe's Activator for Scalding