-
Notifications
You must be signed in to change notification settings - Fork 6
Spark on Yarn
If you are not familiar with the basic characteristics of a PSTL Job, please refer to the Job Overview Guide.
If you are not familiar with the basics of launching PSTL jobs, please refer to the Job Launching Guide
If you are not familiar with the basics of YARN, please refer to the Yarn Documentation.. If you are typically a CLI user, you may find the Yarn Commands Documentation documentation a useful reference.
Yet Another Resource Negotiator [Yarn] is the cluster management technology. It was originally described as a redesigned resource manager by apache
but now it is typified as a large-scale, distributed operating system for big data
applications. Please refer apache YARN for more details.
Using the Yarn as the spark's cluster manager instead of Mesos or Spark Standalone has few advantages like the number of executors can be chosen in the cluster and the main benefit is that Security feature is supported by only Yarn. With Yarn on Spark, it can against clusters that are kerberized and use authentication between its processes. Few more benefits are found here.
All the Spark applications can be submitted to Hadoop yarn cluster with yarn MasterUrl
.
./bin/spark-submit --master yarn ...
Unless you have the class org.apache.spark.deploy.yarn.Client
as your class path, { that is your Spark should be compiled with YARN support} you will have errors in your logs and Spark will exit.
Error: Could not load YARN classes. This copy of Spark may not have been compiled with YARN support.
More information about masterURL and spark-submit and YARN options can be found here. For information regarding on how to launch Spark on Yarn and how to add jar files etc please refer this link.
If we want any application to interact with secured Hadoop Filesystems, then the tokens that are needed to access these clusters should be requested at launch time explicitly and that can be done by listing them in spark.yarn.access.hadoopFileSystems
property.
spark.yarn.access.hadoopFileSystems hdfs://ireland.example.org:8020/,webhdfs://frankfurt.example.org:50070/
For configuring Spark's shuffle service on the each NodeManager in the Yarn cluster can be found here. Some important links for Spark on Yarn.
- Troubleshooting Kerberos
- Running Spark Applications on YARN
- Resource Planning for Spark on Yarn
- Using Spark to integrate to Yarn
On secure clusters it is actually recommended, to use the Spark History Server application page as a tracking URL for the running applications in case the application UI is disabled.
The allowTracking
should be set true
in the configuration of spark at the application Side for letting the spark know that it has to use the history servers URL as the tracking URL when application UL is unavailable.
spark.yarn.historyServer.allowTracking=true
On the Spark History Server org.apache.spark.deploy.yarn.YarnProxyRedirectServer
is to be added to the list of filters in spark.ui.filters
, can help replace the web UI of Spark with the Spark History Server.
After done setting the Spark on Yarn, the next step is the Yarn Job Management.