Skip to content

Spark on Yarn

vnnv01 edited this page Mar 6, 2018 · 1 revision

If you are not familiar with the basic characteristics of a PSTL Job, please refer to the Job Overview Guide.

If you are not familiar with the basics of launching PSTL jobs, please refer to the Job Launching Guide

If you are not familiar with the basics of YARN, please refer to the Yarn Documentation.. If you are typically a CLI user, you may find the Yarn Commands Documentation documentation a useful reference.

What is YARN?

Yet Another Resource Negotiator [Yarn] is the cluster management technology. It was originally described as a redesigned resource manager by apache but now it is typified as a large-scale, distributed operating system for big data applications. Please refer apache YARN for more details.

Why YARN?

Using the Yarn as the spark's cluster manager instead of Mesos or Spark Standalone has few advantages like the number of executors can be chosen in the cluster and the main benefit is that Security feature is supported by only Yarn. With Yarn on Spark, it can against clusters that are kerberized and use authentication between its processes. Few more benefits are found here.

All the Spark applications can be submitted to Hadoop yarn cluster with yarn MasterUrl.

./bin/spark-submit --master yarn ...

Unless you have the class org.apache.spark.deploy.yarn.Client as your class path, { that is your Spark should be compiled with YARN support} you will have errors in your logs and Spark will exit.

Error: Could not load YARN classes. This copy of Spark may not have been compiled with YARN support.

More information about masterURL and spark-submit and YARN options can be found here. For information regarding on how to launch Spark on Yarn and how to add jar files etc please refer this link.

If we want any application to interact with secured Hadoop Filesystems, then the tokens that are needed to access these clusters should be requested at launch time explicitly and that can be done by listing them in spark.yarn.access.hadoopFileSystems property.

spark.yarn.access.hadoopFileSystems hdfs://ireland.example.org:8020/,webhdfs://frankfurt.example.org:50070/

For configuring Spark's shuffle service on the each NodeManager in the Yarn cluster can be found here. Some important links for Spark on Yarn.

On secure clusters it is actually recommended, to use the Spark History Server application page as a tracking URL for the running applications in case the application UI is disabled.

The allowTracking should be set true in the configuration of spark at the application Side for letting the spark know that it has to use the history servers URL as the tracking URL when application UL is unavailable.

spark.yarn.historyServer.allowTracking=true

On the Spark History Server org.apache.spark.deploy.yarn.YarnProxyRedirectServer is to be added to the list of filters in spark.ui.filters, can help replace the web UI of Spark with the Spark History Server.

After done setting the Spark on Yarn, the next step is the Yarn Job Management.