These templates setup a Grade I-level production environment for the Spark cluster.

Cluster can be used to run ML/HPC/DS workload, to perform rapid calculations on demand, etc


Apache Spark is a general-purpose data processing tool. It's heavily used by data engineers and data scientists to perform fast data queries on large amounts of data in the terabyte range. It competes with the classic Hadoop Map / Reduce by using the RAM available in the cluster for faster execution of jobs. The next evolutionary generation of Apache Spark is Apache Beam.

Prerequisites: set Environment vars

The GCP project ID is auto-configured from the GOOGLE_CLOUD_PROJECT environment variable, among several other sources.

The OAuth2 credentials are auto-configured from the GOOGLE_APPLICATION_CREDENTIALS environment variable.

This var can be set manually after auto-generating json with google account credentials:

gcloud auth application-default login

The path to gcloud creds usually has the form:


where variable $EMAIL can be obtained via command:

gcloud config list account --format "value(core.account)"

The tip of the day: Add GOOGLE_APPLICATION_CREDENTIALS as permanent variable to /etc/environment file:

sudo -H gedit /etc/environment

Prerequisites: install local Spark environment for testing/debugging

(1) Find the latest stable Spark distributive at Spark official website We are going to use Spark 3.2.2 + Scala 2.13 (Note the tie Spark<->Scala, this is because Spark has a dependency on Scala) Note, this should match the version you set in your pom

In this case we are going to download and use spark-3.2.2-bin-hadoop3.2-scala2.13.tgz

(2) Download it to local tmp folder and check sha sums:

cd ~
sha512sum spark-3.2.2-bin-hadoop3.2-scala2.13.tgz
cat spark-3.2.2-bin-hadoop3.2-scala2.13.tgz.sha512

(3) Extract Spark to /opt

sudo mkdir /opt/spark
sudo tar -xf spark*.tgz -C /opt/spark --strip-component 1
sudo chmod -R 777 /opt/spark

check the installation:

/opt/spark/bin/spark-shell --version

Here one can find all available commands that you can run locally:

ls /opt/spark/bin/

Note: as we have moved files to /opt directory, we have to run the Spark command in the terminal from /opt/spark

To change it, one can add all spark folder to the system path:

echo "export SPARK_HOME=/opt/spark" >> ~/.bashrc
echo "export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin" >> ~/.bashrc
echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.bashrc

source ~/.bashrc

Build and run workload job locally

mvn clean package
/opt/spark/bin/spark-submit --class net.ddp.mapreduce.PiComputeApp ./spark-core/target/spark-core-1.0-SNAPSHOT.jar

The output of last command (which actually the one that runs job) should contain the line:

22/07/18 12:28:31 INFO DAGScheduler: Job 0 finished: reduce at, took 1.640840 s
22/07/18 12:28:31 INFO PiComputeApp: Analyzing result in 3685 ms
22/07/18 12:28:31 INFO PiComputeApp: Pi is roughly 3.14034

Create cloud infrastructure to run Spark workload in the cloud

Default images available to run your jobs may not support the Run Time you need (f.e. Java 11, 17, etc) This problem can be solved with the help of custom images

(1) To build a custom image to run Java11 workload we start from the following command (as written in guide

git clone

and then from the root directory run the script generate_custom_image (choose some existing gcs bucket to hold the image, in our case it's gs://dataproc-cluster-custom-images):

python3 \
    --image-name "custom-debian10-java11" \
    --dataproc-version "2.0-debian10" \
    --disk-size 30 \
    --customization-script <path to terraform/data/> \
    --zone "us-central1-a" \
    --gcs-bucket "gs://dataproc-cluster-custom-images" \
    --shutdown-instance-timer-sec 500

After finishing check that the image was successfully created:

gcloud compute images list --filter="name=('custom-debian10-java11')"
gcloud compute images describe custom-debian10-java11

(2) Set the image_version var in to that extracted from the output of last command:

  goog-dataproc-version: 2-0-49-debian10
- '1001006'
name: custom-debian10-java11

(3) Install Terraform:

sudo -v
wget -O- | gpg --dearmor | sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install terraform=1.3.3

(4) Enable Dataproc

gcloud services enable dataproc

(5) Create infrastructure using Terraform:


(6) Run terraform init to download the latest version of the provider and build the .terraform directory

terraform init
terraform plan
terraform apply

Build and run workload job in the cloud (at Dataproc)

Here is an example of command to run uber jar on Dataproc cluster

First, compiled uber job has to be uploaded to GCS, as cloud spark cannot run it directly

gsutil cp ./spark-core/target/spark-core-1.0-SNAPSHOT.jar gs://dataproc-cluster-0/spark-core-1.0-SNAPSHOT.jar

and then the final command:

gcloud dataproc jobs submit spark \
    --cluster=dataproc-cluster-0-7f7a78317a21a70a \
    --region=us-central1 \
    --class=net.ddp.mapreduce.PiComputeApp \
    --jars=gs://dataproc-cluster-0/spark-core-1.0-SNAPSHOT.jar \
    -- 1000

The job should succeed and show the output similar to we saw earlier

Appendix I

Remote connection to master node can be done via command:

gcloud compute ssh --zone "us-central1-c" "dataproc-cluster-0-7f7a78317a21a70a-m"  --tunnel-through-iap --project <project name>

To increase the performance of the tunnel, consider installing NumPy. To install NumPy, see: After installing NumPy, run the following command to allow gcloud to access external packages: export CLOUDSDK_PYTHON_SITEPACKAGES=1


