Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22648] [K8S] Spark on Kubernetes - Documentation #19946

Closed
wants to merge 13 commits into from
1 change: 1 addition & 0 deletions docs/_layouts/global.html
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@
<li><a href="spark-standalone.html">Spark Standalone</a></li>
<li><a href="running-on-mesos.html">Mesos</a></li>
<li><a href="running-on-yarn.html">YARN</a></li>
<li><a href="running-on-kubernetes.html">Kubernetes</a></li>
</ul>
</li>

Expand Down
6 changes: 5 additions & 1 deletion docs/building-spark.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ To create a Spark distribution like those distributed by the
to be runnable, use `./dev/make-distribution.sh` in the project root directory. It can be configured
with Maven profile settings and so on like the direct Maven build. Example:

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn
./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use k8s? I kept bringing this up and that's because I can never spell Kubernetes properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's k8s:// in the URL scheme we use and the package names are also k8s - so, users should never have to type the name.
This is one of the last few holdouts. I'd say that it's consistent here, with the use of other cluster manager names in full in their maven projects. I can change it here if you feel strongly about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually second @rxin, I do get the spelling wrong at times :-)

Copy link
Contributor Author

@foxish foxish Dec 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you both also referring to the config options etc, which are still spark.kubernetes.*, or just the maven build target? If it's everything, it would be a fairly large change, doable certainly - but confirming how far the rename should go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @mridulm, @rxin - can you please confirm the scope of the renaming you were referring to here? Is it just the maven target? Changing all the config options etc would be a considerably large change at this point. Also, a point that was brought up today was - while k8s is common shorthand, it's not universal as is the full name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed https://issues.apache.org/jira/browse/SPARK-22853 to discuss this and unblock this PR. We should be able to reach consensus by release time. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I don't think you need to block this pr with this.


This will build Spark distribution along with Python pip and R packages. For more information on usage, run `./dev/make-distribution.sh --help`

Expand Down Expand Up @@ -90,6 +90,10 @@ like ZooKeeper and Hadoop itself.
## Building with Mesos support

./build/mvn -Pmesos -DskipTests clean package

## Building with Kubernetes support

./build/mvn -Pkubernetes -DskipTests clean package

## Building with Kafka 0.8 support

Expand Down
7 changes: 2 additions & 5 deletions docs/cluster-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,11 +52,8 @@ The system currently supports three cluster managers:
* [Apache Mesos](running-on-mesos.html) -- a general cluster manager that can also run Hadoop MapReduce
and service applications.
* [Hadoop YARN](running-on-yarn.html) -- the resource manager in Hadoop 2.
* [Kubernetes (experimental)](https://github.com/apache-spark-on-k8s/spark) -- In addition to the above,
there is experimental support for Kubernetes. Kubernetes is an open-source platform
for providing container-centric infrastructure. Kubernetes support is being actively
developed in an [apache-spark-on-k8s](https://github.com/apache-spark-on-k8s/) Github organization.
For documentation, refer to that project's README.
* [Kubernetes](running-on-kubernetes.html) -- [Kubernetes](https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/)
is an open-source platform that provides container-centric infrastructure.

A third-party project (not supported by the Spark project) exists to add support for
[Nomad](https://github.com/hashicorp/nomad-spark) as a cluster manager.
Expand Down
2 changes: 2 additions & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -2376,6 +2376,8 @@ can be found on the pages for each mode:

#### [Mesos](running-on-mesos.html#configuration)

#### [Kubernetes](running-on-kubernetes.html#configuration)

#### [Standalone Mode](spark-standalone.html#cluster-launch-scripts)

# Environment Variables
Expand Down
Binary file added docs/img/k8s-cluster-mode.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ options for deployment:
* [Standalone Deploy Mode](spark-standalone.html): simplest way to deploy Spark on a private cluster
* [Apache Mesos](running-on-mesos.html)
* [Hadoop YARN](running-on-yarn.html)
* [Kubernetes](running-on-kubernetes.html)

# Where to Go from Here

Expand Down Expand Up @@ -112,7 +113,7 @@ options for deployment:
* [Mesos](running-on-mesos.html): deploy a private cluster using
[Apache Mesos](http://mesos.apache.org)
* [YARN](running-on-yarn.html): deploy Spark on top of Hadoop NextGen (YARN)
* [Kubernetes (experimental)](https://github.com/apache-spark-on-k8s/spark): deploy Spark on top of Kubernetes
* [Kubernetes](running-on-kubernetes.html): deploy Spark on top of Kubernetes

**Other Documents:**

Expand Down
578 changes: 578 additions & 0 deletions docs/running-on-kubernetes.md

Large diffs are not rendered by default.

4 changes: 3 additions & 1 deletion docs/running-on-yarn.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,9 @@ Spark application's configuration (driver, executors, and the AM when running in

There are two deploy modes that can be used to launch Spark applications on YARN. In `cluster` mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In `client` mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

Unlike [Spark standalone](spark-standalone.html) and [Mesos](running-on-mesos.html) modes, in which the master's address is specified in the `--master` parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration. Thus, the `--master` parameter is `yarn`.
Unlike other cluster managers supported by Spark in which the master's address is specified in the `--master`
parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration.
Thus, the `--master` parameter is `yarn`.

To launch a Spark application in `cluster` mode:

Expand Down
16 changes: 16 additions & 0 deletions docs/submitting-applications.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,16 @@ export HADOOP_CONF_DIR=XXX
http://path/to/examples.jar \
1000

# Run on a Kubernetes cluster in cluster deploy mode
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master k8s://xx.yy.zz.ww:443 \
--deploy-mode cluster \
--executor-memory 20G \
--num-executors 50 \
http://path/to/examples.jar \
1000

{% endhighlight %}

# Master URLs
Expand Down Expand Up @@ -155,6 +165,12 @@ The master URL passed to Spark can be in one of the following formats:
<code>client</code> or <code>cluster</code> mode depending on the value of <code>--deploy-mode</code>.
The cluster location will be found based on the <code>HADOOP_CONF_DIR</code> or <code>YARN_CONF_DIR</code> variable.
</td></tr>
<tr><td> <code>k8s://HOST:PORT</code> </td><td> Connect to a <a href="running-on-kubernetes.html">Kubernetes</a> cluster in
<code>cluster</code> mode. Client mode is currently unsupported and will be supported in future releases.
The <code>HOST</code> and <code>PORT</code> refer to the [Kubernetes API Server](https://kubernetes.io/docs/reference/generated/kube-apiserver/).
It connects using TLS by default. In order to force it to use an unsecured connection, you can use
<code>k8s://http://HOST:PORT</code>.
</td></tr>
</table>


Expand Down
68 changes: 68 additions & 0 deletions sbin/build-push-docker-images.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
#!/usr/bin/env bash

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This script builds and pushes docker images when run from a release of Spark
# with Kubernetes support.

declare -A path=( [spark-driver]=kubernetes/dockerfiles/driver/Dockerfile \
[spark-executor]=kubernetes/dockerfiles/executor/Dockerfile )

function build {
docker build -t spark-base -f kubernetes/dockerfiles/spark-base/Dockerfile .
for image in "${!path[@]}"; do
docker build -t ${REPO}/$image:${TAG} -f ${path[$image]} .
done
}


function push {
for image in "${!path[@]}"; do
docker push ${REPO}/$image:${TAG}
done
}

function usage {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So likely not for this PR, but I'm wondering for the future what you think about the idea of extending this to package up a usable Python env so we can solve the dependency management issue as well?

Really excited to see the progress :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do already do that in our fork - and when we get PySpark and R submitted (in Spark 2.4 hopefully), we will extend this script.

echo "This script must be run from a runnable distribution of Apache Spark."
echo "Usage: ./sbin/build-push-docker-images.sh -r <repo> -t <tag> build"
echo " ./sbin/build-push-docker-images.sh -r <repo> -t <tag> push"
echo "for example: ./sbin/build-push-docker-images.sh -r docker.io/myrepo -t v2.3.0 push"
}

if [[ "$@" = *--help ]] || [[ "$@" = *-h ]]; then
usage
exit 0
fi

while getopts r:t: option
do
case "${option}"
in
r) REPO=${OPTARG};;
t) TAG=${OPTARG};;
esac
done

if [ -z "$REPO" ] || [ -z "$TAG" ]; then
usage
else
case "${@: -1}" in
build) build;;
push) push;;
*) usage;;
esac
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add extra empty line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done