-
Notifications
You must be signed in to change notification settings - Fork 707
Scalding with CDH3U2 in a Maven project
amimimor edited this page May 15, 2012
·
5 revisions
This wiki describes a procedure that should allow the dedicated reader to create an executable jar file implementing Scalding, using Maven, that is readily available for deployment on CDH3U2 cluster.
To deploy a MapReduce job on any Hadoop cluster, since the different Hadoop versions are not necessarily compatible with each other, one has to ensure that the core Hadoop libraries the client code uses are identical to those found throughout the entire cluster. Roughly said, client code that is planned to be deployed as an executable jar, should use the same exact jars as are used by the server nodes on the cluster.
- Scalding source - here we used v0.5.3
- SBT - to build Scalding
- Cloudera's Hadoop (CDH) - binaries are fine, e.g. hadoop-0.20.2-cdh3u2.tar.gz . Other versions are cool, just use the same version your cluster uses.
- IDE with Maven support - here I use Eclipse. There is no need for an IDE if you are a Maven wizard. I am not one of those.
- CD to your Scalding source directory
- Edit build.sbt to exclude the hadoop-core jar from being packaged in Scalding:
excludedJars in assembly <<= (fullClasspath in assembly) map { cp => cp filter {Set("janino-2.5.16.jar", "hadoop-core-0.20.2.jar" ) contains _.data.getName } }
(https://gist.github.com/238d74b081d9f2c6e5f1) - sbt -29 update (-29 is a flag for SBT to build with Scala 2.9.1 libraries. Use if you intend to implement your code with this version of Scala)
- sbt -29 assembly (creates a scalding-assembly.0.5.3.jar)
- mvn install:install-file ..... (http://maven.apache.org/plugins/maven-install-plugin/usage.html) to install the created scalding-assembly.0.x.y.jar locally
- download Cloudera's hadoop-0.20.2-cdh3u2.tar.gz (or just download hadoop-core-cdh3u2.jar) 6a. same as 5, install locally your cdh3u2 hadoop-core jar (of course, get it first, or embed Cloudera's parent pom)
- in your IDE, create a new project using this pom: https://gist.github.com/40f1838bbdd15cc25b21
- create the file src/assembly/job.xml and edit: https://gist.github.com/9c5e6f04da287667983a
- create your Scala class implementing Scalding's Job, i.e. "class SomethingCool(args: Args) extends Job(args)"
- mvn package
- the created jar would be placed under your project's target folder, named like: YOURPROJECT-0.0.1-SNAPSHOT-job.jar
- setup your hadoop conf files (most importantly, your core-site.xml file) and edit fs.default.name hdfs://namenode.somethingcool.com:8020/
- cd to your hadoop-0.20-cdh3u2 folder
- bin/hadoop jar YOURPROJECT-0.0.1-SNAPSHOT-job.jar com.twitter.scalding.Tool your.package.your.class --hdfs --input hdfs://namenode.somethingcool.com/user/hdfs/tmp/hello.txt --output hdfs://namenode.somethingcool.com/user/hdfs/tmp/hello_out.txt -libjars YOURPROJECT-0.0.1-SNAPSHOT-job.jar
- Scaladocs
- Getting Started
- Type-safe API Reference
- SQL to Scalding
- Building Bigger Platforms With Scalding
- Scalding Sources
- Scalding-Commons
- Rosetta Code
- Fields-based API Reference (deprecated)
- Scalding: Powerful & Concise MapReduce Programming
- Scalding lecture for UC Berkeley's Analyzing Big Data with Twitter class
- Scalding REPL with Eclipse Scala Worksheets
- Scalding with CDH3U2 in a Maven project
- Running your Scalding jobs in Eclipse
- Running your Scalding jobs in IDEA intellij
- Running Scalding jobs on EMR
- Running Scalding with HBase support: Scalding HBase wiki
- Using the distributed cache
- Unit Testing Scalding Jobs
- TDD for Scalding
- Using counters
- Scalding for the impatient
- Movie Recommendations and more in MapReduce and Scalding
- Generating Recommendations with MapReduce and Scalding
- Poker collusion detection with Mahout and Scalding
- Portfolio Management in Scalding
- Find the Fastest Growing County in US, 1969-2011, using Scalding
- Mod-4 matrix arithmetic with Scalding and Algebird
- Dean Wampler's Scalding Workshop
- Typesafe's Activator for Scalding