Skip to content

Commit

Permalink
Merge branch 'master' of github.com:apache/spark
Browse files Browse the repository at this point in the history
Conflicts:
	core/src/main/scala/org/apache/spark/deploy/DeployWebUI.scala
	core/src/main/scala/org/apache/spark/deploy/WebUI.scala
	core/src/main/scala/org/apache/spark/deploy/master/Master.scala
	core/src/main/scala/org/apache/spark/ui/WebUI.scala
  • Loading branch information
andrewor14 committed Mar 22, 2014
2 parents 60bc6d5 + d780983 commit 5dbfbb4
Show file tree
Hide file tree
Showing 11,433 changed files with 247,907 additions and 176 deletions.
The diff you're trying to view is too large. We only load the first 3000 changed files.
5 changes: 5 additions & 0 deletions assembly/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,11 @@
<artifactId>spark-graphx_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>net.sf.py4j</groupId>
<artifactId>py4j</artifactId>
Expand Down
31 changes: 27 additions & 4 deletions bin/compute-classpath.sh
Original file line number Diff line number Diff line change
Expand Up @@ -33,23 +33,43 @@ fi
# Build up classpath
CLASSPATH="$SPARK_CLASSPATH:$FWDIR/conf"

# Support for interacting with Hive. Since hive pulls in a lot of dependencies that might break
# existing Spark applications, it is not included in the standard spark assembly. Instead, we only
# include it in the classpath if the user has explicitly requested it by running "sbt hive/assembly"
# Hopefully we will find a way to avoid uber-jars entirely and deploy only the needed packages in
# the future.
if [ -f "$FWDIR"/sql/hive/target/scala-$SCALA_VERSION/spark-hive-assembly-*.jar ]; then
echo "Hive assembly found, including hive support. If this isn't desired run sbt hive/clean."

# Datanucleus jars do not work if only included in the uberjar as plugin.xml metadata is lost.
DATANUCLEUSJARS=$(JARS=("$FWDIR/lib_managed/jars"/datanucleus-*.jar); IFS=:; echo "${JARS[*]}")
CLASSPATH=$CLASSPATH:$DATANUCLEUSJARS

ASSEMBLY_DIR="$FWDIR/sql/hive/target/scala-$SCALA_VERSION/"
else
ASSEMBLY_DIR="$FWDIR/assembly/target/scala-$SCALA_VERSION/"
fi

# First check if we have a dependencies jar. If so, include binary classes with the deps jar
if [ -f "$FWDIR"/assembly/target/scala-$SCALA_VERSION/spark-assembly*hadoop*-deps.jar ]; then
if [ -f "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar ]; then
CLASSPATH="$CLASSPATH:$FWDIR/core/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/repl/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/mllib/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/bagel/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/graphx/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/streaming/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/catalyst/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/core/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/hive/target/scala-$SCALA_VERSION/classes"

DEPS_ASSEMBLY_JAR=`ls "$FWDIR"/assembly/target/scala-$SCALA_VERSION/spark-assembly*hadoop*-deps.jar`
DEPS_ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark*-assembly*hadoop*-deps.jar`
CLASSPATH="$CLASSPATH:$DEPS_ASSEMBLY_JAR"
else
# Else use spark-assembly jar from either RELEASE or assembly directory
if [ -f "$FWDIR/RELEASE" ]; then
ASSEMBLY_JAR=`ls "$FWDIR"/jars/spark-assembly*.jar`
ASSEMBLY_JAR=`ls "$FWDIR"/jars/spark*-assembly*.jar`
else
ASSEMBLY_JAR=`ls "$FWDIR"/assembly/target/scala-$SCALA_VERSION/spark-assembly*hadoop*.jar`
ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark*-assembly*hadoop*.jar`
fi
CLASSPATH="$CLASSPATH:$ASSEMBLY_JAR"
fi
Expand All @@ -62,6 +82,9 @@ if [[ $SPARK_TESTING == 1 ]]; then
CLASSPATH="$CLASSPATH:$FWDIR/bagel/target/scala-$SCALA_VERSION/test-classes"
CLASSPATH="$CLASSPATH:$FWDIR/graphx/target/scala-$SCALA_VERSION/test-classes"
CLASSPATH="$CLASSPATH:$FWDIR/streaming/target/scala-$SCALA_VERSION/test-classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/catalyst/target/scala-$SCALA_VERSION/test-classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/core/target/scala-$SCALA_VERSION/test-classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/hive/target/scala-$SCALA_VERSION/test-classes"
fi

# Add hadoop conf dir if given -- otherwise FileSystem.*, etc fail !
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,7 @@ import javax.servlet.http.HttpServletRequest

import scala.xml.Node

import org.apache.spark.deploy.DeployWebUI
import org.apache.spark.ui.UIUtils
import org.apache.spark.ui.{UIUtils, WebUI}

private[spark] class IndexPage(parent: HistoryServer) {
private val dateFmt = new SimpleDateFormat("yyyy/MM/dd HH:mm:ss")
Expand Down Expand Up @@ -63,7 +62,7 @@ private[spark] class IndexPage(parent: HistoryServer) {
val startTime = if (info.started) dateFmt.format(new Date(info.startTime)) else "Not started"
val endTime = if (info.finished) dateFmt.format(new Date(info.endTime)) else "Not finished"
val difference = if (info.started && info.finished) info.endTime - info.startTime else -1L
val duration = if (difference > 0) DeployWebUI.formatDuration(difference) else "---"
val duration = if (difference > 0) WebUI.formatDuration(difference) else "---"
val logDirectory = parent.getAppId(info.logPath)
val lastUpdated = dateFmt.format(new Date(info.lastUpdated))
<tr>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ private[spark] class Master(

val conf = new SparkConf

val DATE_FORMAT = new SimpleDateFormat("yyyyMMddHHmmss") // For application IDs
def createDateFormat = new SimpleDateFormat("yyyyMMddHHmmss") // For application IDs
val WORKER_TIMEOUT = conf.getLong("spark.worker.timeout", 60) * 1000
val RETAINED_APPLICATIONS = conf.getInt("spark.deploy.retainedApplications", 200)
val REAPER_ITERATIONS = conf.getInt("spark.dead.worker.persistence", 15)
Expand Down Expand Up @@ -671,7 +671,7 @@ private[spark] class Master(

/** Generate a new app ID given a app's submission date */
def newApplicationId(submitDate: Date): String = {
val appId = "app-%s-%04d".format(DATE_FORMAT.format(submitDate), nextAppNumber)
val appId = "app-%s-%04d".format(createDateFormat.format(submitDate), nextAppNumber)
nextAppNumber += 1
appId
}
Expand All @@ -695,7 +695,7 @@ private[spark] class Master(
}

def newDriverId(submitDate: Date): String = {
val appId = "driver-%s-%04d".format(DATE_FORMAT.format(submitDate), nextDriverNumber)
val appId = "driver-%s-%04d".format(createDateFormat.format(submitDate), nextDriverNumber)
nextDriverNumber += 1
appId
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,10 @@ import scala.xml.Node
import akka.pattern.ask
import org.json4s.JValue

import org.apache.spark.deploy.{DeployWebUI, JsonProtocol}
import org.apache.spark.deploy.{JsonProtocol}
import org.apache.spark.deploy.DeployMessages.{MasterStateResponse, RequestMasterState}
import org.apache.spark.deploy.master.{ApplicationInfo, DriverInfo, WorkerInfo}
import org.apache.spark.ui.UIUtils
import org.apache.spark.ui.{WebUI, UIUtils}
import org.apache.spark.util.Utils

private[spark] class IndexPage(parent: MasterWebUI) {
Expand Down Expand Up @@ -169,10 +169,10 @@ private[spark] class IndexPage(parent: MasterWebUI) {
<td sorttable_customkey={app.desc.memoryPerSlave.toString}>
{Utils.megabytesToString(app.desc.memoryPerSlave)}
</td>
<td>{DeployWebUI.formatDate(app.submitDate)}</td>
<td>{WebUI.formatDate(app.submitDate)}</td>
<td>{app.desc.user}</td>
<td>{app.state.toString}</td>
<td>{DeployWebUI.formatDuration(app.duration)}</td>
<td>{WebUI.formatDuration(app.duration)}</td>
</tr>
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ private[spark] class Worker(
Utils.checkHost(host, "Expected hostname")
assert (port > 0)

val DATE_FORMAT = new SimpleDateFormat("yyyyMMddHHmmss") // For worker and executor IDs
def createDateFormat = new SimpleDateFormat("yyyyMMddHHmmss") // For worker and executor IDs

// Send a heartbeat every (heartbeat timeout) / 4 milliseconds
val HEARTBEAT_MILLIS = conf.getLong("spark.worker.timeout", 60) * 1000 / 4
Expand Down Expand Up @@ -319,7 +319,7 @@ private[spark] class Worker(
}

def generateWorkerId(): String = {
"worker-%s-%s-%d".format(DATE_FORMAT.format(new Date), host, port)
"worker-%s-%s-%d".format(createDateFormat.format(new Date), host, port)
}

override def postStop() {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,9 @@ class JobLogger(val user: String, val logDirName: String) extends SparkListener
private val jobIdToPrintWriter = new HashMap[Int, PrintWriter]
private val stageIdToJobId = new HashMap[Int, Int]
private val jobIdToStageIds = new HashMap[Int, Seq[Int]]
private val DATE_FORMAT = new SimpleDateFormat("yyyy/MM/dd HH:mm:ss")
private val dateFormat = new ThreadLocal[SimpleDateFormat]() {
override def initialValue() = new SimpleDateFormat("yyyy/MM/dd HH:mm:ss")
}
private val eventQueue = new LinkedBlockingQueue[SparkListenerEvent]

createLogDir()
Expand Down Expand Up @@ -128,7 +130,7 @@ class JobLogger(val user: String, val logDirName: String) extends SparkListener
var writeInfo = info
if (withTime) {
val date = new Date(System.currentTimeMillis())
writeInfo = DATE_FORMAT.format(date) + ": " + info
writeInfo = dateFormat.get.format(date) + ": " + info
}
jobIdToPrintWriter.get(jobId).foreach(_.println(writeInfo))
}
Expand Down
32 changes: 32 additions & 0 deletions core/src/main/scala/org/apache/spark/ui/WebUI.scala
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@

package org.apache.spark.ui

import java.text.SimpleDateFormat
import java.util.Date

private[spark] abstract class WebUI(name: String) {
protected var serverInfo: Option[ServerInfo] = None

Expand All @@ -35,3 +38,32 @@ private[spark] abstract class WebUI(name: String) {
serverInfo.get.server.stop()
}
}

/**
* Utilities used throughout the web UI.
*/
private[spark] object WebUI {
// SimpleDateFormat is not thread-safe. Don't expose it to avoid improper use.
private val dateFormat = new ThreadLocal[SimpleDateFormat]() {
override def initialValue(): SimpleDateFormat = new SimpleDateFormat("yyyy/MM/dd HH:mm:ss")
}

def formatDate(date: Date): String = dateFormat.get.format(date)

def formatDate(timestamp: Long): String = dateFormat.get.format(new Date(timestamp))

def formatDuration(milliseconds: Long): String = {
val seconds = milliseconds.toDouble / 1000
if (seconds < 60) {
return "%.0f s".format(seconds)
}
val minutes = seconds / 60
if (minutes < 10) {
return "%.1f min".format(minutes)
} else if (minutes < 60) {
return "%.0f min".format(minutes)
}
val hours = minutes / 60
return "%.1f h".format(hours)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@

package org.apache.spark.ui.jobs

import java.text.SimpleDateFormat
import javax.servlet.http.HttpServletRequest

import org.eclipse.jetty.servlet.ServletContextHandler
Expand All @@ -31,7 +30,6 @@ import org.apache.spark.util.Utils
/** Web UI showing progress status of all jobs in the given SparkContext. */
private[ui] class JobProgressUI(parent: SparkUI) {
val basePath = parent.basePath
val dateFmt = new SimpleDateFormat("yyyy/MM/dd HH:mm:ss")
val live = parent.live
val sc = parent.sc

Expand Down
5 changes: 2 additions & 3 deletions core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,12 @@ import javax.servlet.http.HttpServletRequest
import scala.xml.Node

import org.apache.spark.ui.Page._
import org.apache.spark.ui.UIUtils
import org.apache.spark.ui.{WebUI, UIUtils}
import org.apache.spark.util.{Utils, Distribution}

/** Page showing statistics and task list for a given stage */
private[ui] class StagePage(parent: JobProgressUI) {
private val basePath = parent.basePath
private val dateFmt = parent.dateFmt
private lazy val listener = parent.listener

private def appName = parent.appName
Expand Down Expand Up @@ -254,7 +253,7 @@ private[ui] class StagePage(parent: JobProgressUI) {
<td>{info.status}</td>
<td>{info.taskLocality}</td>
<td>{info.host}</td>
<td>{dateFmt.format(new Date(info.launchTime))}</td>
<td>{WebUI.formatDate(new Date(info.launchTime))}</td>
<td sorttable_customkey={duration.toString}>
{formatDuration}
</td>
Expand Down
5 changes: 2 additions & 3 deletions core/src/main/scala/org/apache/spark/ui/jobs/StageTable.scala
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,12 @@ import scala.collection.mutable.HashMap
import scala.xml.Node

import org.apache.spark.scheduler.{StageInfo, TaskInfo}
import org.apache.spark.ui.UIUtils
import org.apache.spark.ui.{WebUI, UIUtils}
import org.apache.spark.util.Utils

/** Page showing list of all ongoing and recently finished stages */
private[ui] class StageTable(stages: Seq[StageInfo], parent: JobProgressUI) {
private val basePath = parent.basePath
private val dateFmt = parent.dateFmt
private lazy val listener = parent.listener
private lazy val isFairScheduler = parent.isFairScheduler

Expand Down Expand Up @@ -82,7 +81,7 @@ private[ui] class StageTable(stages: Seq[StageInfo], parent: JobProgressUI) {
val description = listener.stageIdToDescription.get(s.stageId)
.map(d => <div><em>{d}</em></div><div>{nameLink}</div>).getOrElse(nameLink)
val submissionTime = s.submissionTime match {
case Some(t) => dateFmt.format(new Date(t))
case Some(t) => WebUI.formatDate(new Date(t))
case None => "Unknown"
}
val finishTime = s.completionTime.getOrElse(System.currentTimeMillis)
Expand Down
7 changes: 5 additions & 2 deletions core/src/main/scala/org/apache/spark/util/FileLogger.scala
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,10 @@ class FileLogger(
overwrite: Boolean = true)
extends Logging {

private val DATE_FORMAT = new SimpleDateFormat("yyyy/MM/dd HH:mm:ss")
private val dateFormat = new ThreadLocal[SimpleDateFormat]() {
override def initialValue(): SimpleDateFormat = new SimpleDateFormat("yyyy/MM/dd HH:mm:ss")
}

private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir))
var fileIndex = 0

Expand Down Expand Up @@ -110,7 +113,7 @@ class FileLogger(
def log(msg: String, withTime: Boolean = false) {
val writeInfo = if (!withTime) msg else {
val date = new Date(System.currentTimeMillis())
DATE_FORMAT.format(date) + ": " + msg
dateFormat.get.format(date) + ": " + msg
}
writer.foreach(_.print(writeInfo))
}
Expand Down
9 changes: 9 additions & 0 deletions docs/_layouts/global.html
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@
<li><a href="python-programming-guide.html">Spark in Python</a></li>
<li class="divider"></li>
<li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
<li><a href="sql-programming-guide.html">Spark SQL</a></li>
<li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
<li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
<li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
Expand All @@ -79,6 +80,14 @@
<li><a href="api/pyspark/index.html">Spark Core for Python</a></li>
<li class="divider"></li>
<li><a href="api/streaming/index.html#org.apache.spark.streaming.package">Spark Streaming</a></li>
<li class="dropdown-submenu">
<a tabindex="-1" href="#">Spark SQL</a>
<ul class="dropdown-menu">
<li><a href="api/sql/core/org/apache/spark/sql/SQLContext.html">Spark SQL Core</a></li>
<li><a href="api/sql/hive/org/apache/spark/sql/hive/package.html">Hive Support</a></li>
<li><a href="api/sql/catalyst/org/apache/spark/sql/catalyst/package.html">Catalyst (Optimization)</a></li>
</ul>
</li>
<li><a href="api/mllib/index.html#org.apache.spark.mllib.package">MLlib (Machine Learning)</a></li>
<li><a href="api/bagel/index.html#org.apache.spark.bagel.package">Bagel (Pregel on Spark)</a></li>
<li><a href="api/graphx/index.html#org.apache.spark.graphx.package">GraphX (Graph Processing)</a></li>
Expand Down
13 changes: 13 additions & 0 deletions docs/_plugins/copy_api_dirs.rb
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
# Build Scaladoc for Java/Scala
core_projects = ["core", "examples", "repl", "bagel", "graphx", "streaming", "mllib"]
external_projects = ["flume", "kafka", "mqtt", "twitter", "zeromq"]
sql_projects = ["catalyst", "core", "hive"]

projects = core_projects + external_projects.map { |project_name| "external/" + project_name }

Expand Down Expand Up @@ -49,6 +50,18 @@
cp_r(source + "/.", dest)
end

sql_projects.each do |project_name|
source = "../sql/" + project_name + "/target/scala-2.10/api/"
dest = "api/sql/" + project_name

puts "echo making directory " + dest
mkdir_p dest

# From the rubydoc: cp_r('src', 'dest') makes src/dest, but this doesn't.
puts "cp -r " + source + "/. " + dest
cp_r(source + "/.", dest)
end

# Build Epydoc for Python
puts "Moving to python directory and building epydoc."
cd("../python")
Expand Down
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ For this version of Spark (0.8.1) Hadoop 2.2.x (or newer) users will have to bui
* [Java Programming Guide](java-programming-guide.html): using Spark from Java
* [Python Programming Guide](python-programming-guide.html): using Spark from Python
* [Spark Streaming](streaming-programming-guide.html): Spark's API for processing data streams
* [Spark SQL](sql-programming-guide.html): Support for running relational queries on Spark
* [MLlib (Machine Learning)](mllib-guide.html): Spark's built-in machine learning library
* [Bagel (Pregel on Spark)](bagel-programming-guide.html): simple graph processing model
* [GraphX (Graphs on Spark)](graphx-programming-guide.html): Spark's new API for graphs
Expand Down
1 change: 1 addition & 0 deletions docs/mllib-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ The following links provide a detailed explanation of the methods and usage exam
* Gradient Descent and Stochastic Gradient Descent
* <a href="mllib-linear-algebra.html">Linear Algebra</a>
* Singular Value Decomposition
* Principal Component Analysis

# Dependencies
MLlib uses the [jblas](https://github.com/mikiobraun/jblas) linear algebra library, which itself
Expand Down
13 changes: 13 additions & 0 deletions docs/mllib-linear-algebra.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,3 +59,16 @@ val = decomposed.S.data

println("singular values = " + s.toArray.mkString)
{% endhighlight %}


# Principal Component Analysis

Computes the top k principal component coefficients for the m-by-n data matrix X.
Rows of X correspond to observations and columns correspond to variables.
The coefficient matrix is n-by-k. Each column of the return matrix contains coefficients
for one principal component, and the columns are in descending
order of component variance. This function centers the data and uses the
singular value decomposition (SVD) algorithm.

All input and output is expected in DenseMatrix matrix format. See the examples directory
under "SparkPCA.scala" for example usage.
Loading

0 comments on commit 5dbfbb4

Please sign in to comment.