Skip to content

Apache Spark Features

Alex Bain edited this page Jun 8, 2017 · 8 revisions

Table of Contents

Apache Spark Features

(Since version 0.6.7) The Hadoop Plugin comes with features that should make it much easier for you to quickly run Apache Spark programs.

The main one is the ability to quickly run Spark programs on your Hadoop cluster through a gateway machine. Having a gateway machine is a common setup for Hadoop clusters, especially secure clusters (this is the setup we have at LinkedIn).

These tasks will build your project, rsync the contents of a zip archive to the gateway and run your Spark program.

If for some reason you need to disable the Plugin, you can pass -PdisableSparkPlugin on the Gradle command line or add disableSparkPlugin=true to your gradle.properties file.

Setup the Plugin

If you are using the Hadoop Plugin at LinkedIn, you don't need to do anything to set it up! By default, the Plugin comes setup out-of-the-box to run Spark programs on the cluster gateway.

In the out-of-the-box setup, the Plugin will use the directory ~/.hadoopPlugin on your local box and the directory /export/home/${user.name}/.hadoopPlugin on the gateway for its working files.

Customize the Plugin Setup

If you wish to customize the setup, in your Hadoop Plugin project add a file called .sparkProperties with the following:

# Example custom setup that runs on the cluster gateway at LinkedIn. In the file <projectDir>/.sparkProperties:
sparkCacheDir=/home/your-user-name/.hadoopPlugin
sparkCommand=/export/apps/spark/latest/bin/spark-submit
 
remoteCacheDir=/export/home/your-user-name/.hadoopPlugin
remoteHostName=theGatewayNode.linkedin.com
remoteSshOpts=-q -K

We recommend adding this file to your project's .gitignore, so that every developer on the project can have their own .sparkProperties file.

  • sparkCacheDir - Directory on your local machine that will be rsync'd to the remote machine.
  • sparkCommand - Command that runs Apache Spark
  • remoteCacheDir - Directory on the remote machine that will rsync'd with your local machine.
  • remoteHostName - Name of the remote machine on which you will run Spark
  • remoteSshOpts - ssh options that will be used when logging into the remote machine
Setup the Plugin to Run Spark Locally

If you have a local install of Apache Spark, you can setup the plugin to call your local install instead of Spark on a remote machine.

# Example setup to run on a local install of Spark. In the file <projectDir>/.sparkProperties:
sparkCacheDir=/home/<userName>/.hadoopPlugin
sparkCommand=/home/<userName>/app/spark-submit
 
# Must blank these out to override the out of the box LinkedIn setup for the cluster gateway
remoteCacheDir=
remoteHostName=
remoteSshOpts=
Run Spark Jobs on the Gateway

Now that you have the plugin set up, you can quickly run Spark programs on the gateway by configuring them with the Hadoop DSL.

showSparkJobs Task

Running this task will show all Spark jobs configured with the Hadoop DSL.

runSparkJob -PjobName=<jobName> -PzipTaskName=<zipTaskName> Task

This task runs a Spark job configured with the Hadoop DSL.

You must pass the fully qualified name of the Spark job you want to run as the job name. The fully qualified job names may not be obvious, so run the showSparkJobs task to see them. To see examples of how to configure Spark jobs with the Hadoop DSL, see the Hadoop DSL Language Reference.

You must also pass the name of a Gradle Zip task. The archive produced by the Zip task will be uploaded to the gateway and extracted there. The zip should contain all the files necessary to run the Spark job.

Although you can use the name of any Gradle Zip task, you usually want to use the name of a Zip task generated by configuring the hadoopZip block. To find the names of these tasks, see the section on the generated Zip tasks at Hadoop-Zip-Artifact-Tasks.

Known Issues

You should occasionally blow away the local directory ~/.hadoopPlugin and the directory ${remoteCacheDir} on the gateway, or they will keep getting bigger. If you get stuck in a series of errors where it seems like you have the wrong dependencies, you might try this as well.

When you run on the gateway, you will see the message tcgetattr: Invalid argument go by. This message is safe to ignore.