Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
Browse files Browse the repository at this point in the history
Conflicts:
	project/SparkBuild.scala
	python/pyspark/context.py
  • Loading branch information
MLnick committed Jan 13, 2014
2 parents eb40036 + e6ed13f commit 1c8efbc
Show file tree
Hide file tree
Showing 464 changed files with 14,259 additions and 7,247 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
*~
*.swp
*.ipr
*.iml
*.iws
.idea/
sbt/*.jar
.settings
.cache
/build/
Expand Down
26 changes: 16 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,20 +13,22 @@ This README file only contains basic setup instructions.
## Building

Spark requires Scala 2.10. The project is built using Simple Build Tool (SBT),
which is packaged with it. To build Spark and its example programs, run:
which can be obtained [here](http://www.scala-sbt.org). If SBT is installed we
will use the system version of sbt otherwise we will attempt to download it
automatically. To build Spark and its example programs, run:

sbt/sbt assembly
./sbt/sbt assembly

Once you've built Spark, the easiest way to start using it is the shell:

./spark-shell
./bin/spark-shell

Or, for the Python API, the Python shell (`./pyspark`).
Or, for the Python API, the Python shell (`./bin/pyspark`).

Spark also comes with several sample programs in the `examples` directory.
To run one of them, use `./run-example <class> <params>`. For example:
To run one of them, use `./bin/run-example <class> <params>`. For example:

./run-example org.apache.spark.examples.SparkLR local[2]
./bin/run-example org.apache.spark.examples.SparkLR local[2]

will run the Logistic Regression example locally on 2 CPUs.

Expand All @@ -36,7 +38,13 @@ All of the Spark samples take a `<master>` parameter that is the cluster URL
to connect to. This can be a mesos:// or spark:// URL, or "local" to run
locally with one thread, or "local[N]" to run locally with N threads.

## Running tests

Testing first requires [Building](#building) Spark. Once Spark is built, tests
can be run using:

`./sbt/sbt test`

## A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
Expand All @@ -54,7 +62,7 @@ versions without YARN, use:
# Cloudera CDH 4.2.0 with MapReduce v1
$ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly

For Apache Hadoop 2.0.X, 2.1.X, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
For Apache Hadoop 2.2.X, 2.1.X, 2.0.X, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
with YARN, also set `SPARK_YARN=true`:

# Apache Hadoop 2.0.5-alpha
Expand All @@ -63,10 +71,8 @@ with YARN, also set `SPARK_YARN=true`:
# Cloudera CDH 4.2.0 with MapReduce v2
$ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly

When building for Hadoop 2.2.X and newer, you'll need to include the additional `new-yarn` profile:

# Apache Hadoop 2.2.X and newer
$ mvn -Dyarn.version=2.2.0 -Dhadoop.version=2.2.0 -Pnew-yarn
$ SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly

When developing a Spark application, specify the Hadoop version by adding the
"hadoop-client" artifact to your project's dependencies. For example, if you're
Expand Down
27 changes: 0 additions & 27 deletions assembly/lib/PY4J_LICENSE.txt

This file was deleted.

1 change: 0 additions & 1 deletion assembly/lib/PY4J_VERSION.txt

This file was deleted.

Binary file removed assembly/lib/net/sf/py4j/py4j/0.7/py4j-0.7.jar
Binary file not shown.
9 changes: 0 additions & 9 deletions assembly/lib/net/sf/py4j/py4j/0.7/py4j-0.7.pom

This file was deleted.

12 changes: 0 additions & 12 deletions assembly/lib/net/sf/py4j/py4j/maven-metadata-local.xml

This file was deleted.

34 changes: 22 additions & 12 deletions assembly/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -41,33 +41,33 @@
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-bagel_2.10</artifactId>
<artifactId>spark-bagel_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.10</artifactId>
<artifactId>spark-mllib_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-repl_2.10</artifactId>
<artifactId>spark-repl_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>net.sf.py4j</groupId>
<artifactId>py4j</artifactId>
<version>0.7</version>
<version>0.8.1</version>
</dependency>
</dependencies>

Expand All @@ -79,7 +79,7 @@
<artifactId>maven-shade-plugin</artifactId>
<configuration>
<shadedArtifactAttached>false</shadedArtifactAttached>
<outputFile>${project.build.directory}/scala-2.10/${project.artifactId}-${project.version}-hadoop${hadoop.version}.jar</outputFile>
<outputFile>${project.build.directory}/scala-${scala.binary.version}/${project.artifactId}-${project.version}-hadoop${hadoop.version}.jar</outputFile>
<artifactSet>
<includes>
<include>*:*</include>
Expand Down Expand Up @@ -108,12 +108,12 @@
<transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>META-INF/services/org.apache.hadoop.fs.FileSystem</resource>
</transformer>
</transformers>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
<transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>reference.conf</resource>
</transformer>
<transformer implementation="org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer">
<resource>log4j.properties</resource>
</transformer>
</transformers>
</configuration>
</execution>
Expand All @@ -124,11 +124,21 @@

<profiles>
<profile>
<id>hadoop2-yarn</id>
<id>yarn-alpha</id>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-yarn-alpha_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
</dependencies>
</profile>
<profile>
<id>yarn</id>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-yarn_2.10</artifactId>
<artifactId>spark-yarn_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
</dependencies>
Expand Down
11 changes: 4 additions & 7 deletions assembly/src/main/assembly/assembly.xml
Original file line number Diff line number Diff line change
Expand Up @@ -39,23 +39,20 @@
</fileSet>
<fileSet>
<directory>
${project.parent.basedir}/bin/
${project.parent.basedir}/sbin/
</directory>
<outputDirectory>/bin</outputDirectory>
<outputDirectory>/sbin</outputDirectory>
<includes>
<include>**/*</include>
</includes>
</fileSet>
<fileSet>
<directory>
${project.parent.basedir}
${project.parent.basedir}/bin/
</directory>
<outputDirectory>/bin</outputDirectory>
<includes>
<include>run-example*</include>
<include>spark-class*</include>
<include>spark-shell*</include>
<include>spark-executor*</include>
<include>**/*</include>
</includes>
</fileSet>
</fileSets>
Expand Down
10 changes: 5 additions & 5 deletions bagel/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
Expand All @@ -43,18 +43,18 @@
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.10</artifactId>
<artifactId>scalatest_${scala.binary.version}</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalacheck</groupId>
<artifactId>scalacheck_2.10</artifactId>
<artifactId>scalacheck_${scala.binary.version}</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<outputDirectory>target/scala-2.10/classes</outputDirectory>
<testOutputDirectory>target/scala-2.10/test-classes</testOutputDirectory>
<outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
<testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
<plugins>
<plugin>
<groupId>org.scalatest</groupId>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,6 @@ class BagelSuite extends FunSuite with Assertions with BeforeAndAfter with Timeo
}
// To avoid Akka rebinding to the same port, since it doesn't unbind immediately on shutdown
System.clearProperty("spark.driver.port")
System.clearProperty("spark.hostPort")
}

test("halting by voting") {
Expand Down
2 changes: 1 addition & 1 deletion bin/compute-classpath.cmd
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ rem Load environment variables from conf\spark-env.cmd, if it exists
if exist "%FWDIR%conf\spark-env.cmd" call "%FWDIR%conf\spark-env.cmd"

rem Build up classpath
set CLASSPATH=%SPARK_CLASSPATH%;%FWDIR%conf
set CLASSPATH=%FWDIR%conf
if exist "%FWDIR%RELEASE" (
for %%d in ("%FWDIR%jars\spark-assembly*.jar") do (
set ASSEMBLY_JAR=%%d
Expand Down
2 changes: 1 addition & 1 deletion bin/compute-classpath.sh
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ SCALA_VERSION=2.10
FWDIR="$(cd `dirname $0`/..; pwd)"

# Load environment variables from conf/spark-env.sh, if it exists
if [ -e $FWDIR/conf/spark-env.sh ] ; then
if [ -e "$FWDIR/conf/spark-env.sh" ] ; then
. $FWDIR/conf/spark-env.sh
fi

Expand Down
7 changes: 3 additions & 4 deletions pyspark → bin/pyspark
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
#

# Figure out where the Scala framework is installed
FWDIR="$(cd `dirname $0`; pwd)"
FWDIR="$(cd `dirname $0`/..; pwd)"

# Export this as SPARK_HOME
export SPARK_HOME="$FWDIR"
Expand All @@ -37,7 +37,7 @@ if [ ! -f "$FWDIR/RELEASE" ]; then
fi

# Load environment variables from conf/spark-env.sh, if it exists
if [ -e $FWDIR/conf/spark-env.sh ] ; then
if [ -e "$FWDIR/conf/spark-env.sh" ] ; then
. $FWDIR/conf/spark-env.sh
fi

Expand All @@ -59,8 +59,7 @@ if [ -n "$IPYTHON_OPTS" ]; then
fi

if [[ "$IPYTHON" = "1" ]] ; then
IPYTHON_OPTS=${IPYTHON_OPTS:--i}
exec ipython "$IPYTHON_OPTS" -c "%run $PYTHONSTARTUP"
exec ipython $IPYTHON_OPTS
else
exec "$PYSPARK_PYTHON" "$@"
fi
File renamed without changes.
2 changes: 1 addition & 1 deletion pyspark2.cmd → bin/pyspark2.cmd
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ rem
set SCALA_VERSION=2.10

rem Figure out where the Spark framework is installed
set FWDIR=%~dp0
set FWDIR=%~dp0..\

rem Export this as SPARK_HOME
set SPARK_HOME=%FWDIR%
Expand Down
21 changes: 13 additions & 8 deletions run-example → bin/run-example
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,21 @@
# limitations under the License.
#

cygwin=false
case "`uname`" in
CYGWIN*) cygwin=true;;
esac

SCALA_VERSION=2.10

# Figure out where the Scala framework is installed
FWDIR="$(cd `dirname $0`; pwd)"
FWDIR="$(cd `dirname $0`/..; pwd)"

# Export this as SPARK_HOME
export SPARK_HOME="$FWDIR"

# Load environment variables from conf/spark-env.sh, if it exists
if [ -e $FWDIR/conf/spark-env.sh ] ; then
if [ -e "$FWDIR/conf/spark-env.sh" ] ; then
. $FWDIR/conf/spark-env.sh
fi

Expand All @@ -40,25 +45,25 @@ fi
EXAMPLES_DIR="$FWDIR"/examples
SPARK_EXAMPLES_JAR=""
if [ -e "$EXAMPLES_DIR"/target/scala-$SCALA_VERSION/*assembly*[0-9Tg].jar ]; then
# Use the JAR from the SBT build
export SPARK_EXAMPLES_JAR=`ls "$EXAMPLES_DIR"/target/scala-$SCALA_VERSION/*assembly*[0-9Tg].jar`
fi
if [ -e "$EXAMPLES_DIR"/target/spark-examples*[0-9Tg].jar ]; then
# Use the JAR from the Maven build
# TODO: this also needs to become an assembly!
export SPARK_EXAMPLES_JAR=`ls "$EXAMPLES_DIR"/target/spark-examples*[0-9Tg].jar`
fi
if [[ -z $SPARK_EXAMPLES_JAR ]]; then
echo "Failed to find Spark examples assembly in $FWDIR/examples/target" >&2
echo "You need to build Spark with sbt/sbt assembly before running this program" >&2
exit 1
fi


# Since the examples JAR ideally shouldn't include spark-core (that dependency should be
# "provided"), also add our standard Spark classpath, built using compute-classpath.sh.
CLASSPATH=`$FWDIR/bin/compute-classpath.sh`
CLASSPATH="$SPARK_EXAMPLES_JAR:$CLASSPATH"

if $cygwin; then
CLASSPATH=`cygpath -wp $CLASSPATH`
export SPARK_EXAMPLES_JAR=`cygpath -w $SPARK_EXAMPLES_JAR`
fi

# Find java binary
if [ -n "${JAVA_HOME}" ]; then
RUNNER="${JAVA_HOME}/bin/java"
Expand Down
File renamed without changes.
Loading

0 comments on commit 1c8efbc

Please sign in to comment.