An Apache Spark container image. The image is meant to be used for creating an standalone cluster with multiple workers.
This image contains a script named start-spark
(included in the PATH). This script is used to initialize the master and the workers.
The custom commands require an HDFS user to be set. The user's name if read from the HDFS_USER
environment variable and the user is automatically created by the commands.
To start a master run the following command:
start-spark master
To start a worker run the following command:
start-spark worker [MASTER]
The easiest way to create a standalone cluster with this image is by using Docker Compose. The following snippet can be used as a docker-compose.yml
for a simple cluster:
version: "2"
services:
master:
image: derrickoswald/spark-docker
command: start-spark master
hostname: master
ports:
- "4040:4040" # Cluster Manager Web UI
- "6066:6066" # Standalone Master REST port (spark.master.rest.port)
- "7077:7077" # Driver to Standalone Master, as in master = spark://sandbox:7077
- "8020:8020" # DFS Namenode IPC, e.g. hdfs dfs -fs hdfs://sandbox:8020 -ls
- "8080:8080" # Standalone Master Web UI
- "8081:8081" # Standalone Worker Web UI
- "10000:10000" # Thriftserver JDBC port
- "10001:10001" # Thriftserver HTTP protocol JDBC port
- "9866:9866" # DFS Datanode data transfer
- "9870:9870" # DFS Namenode Web UI
- "9864:9864" # DFS Datanode Web UI
worker:
image: derrickoswald/spark-docker
command: start-spark worker master
environment:
SPARK_WORKER_CORES: 1
SPARK_WORKER_MEMORY: 2g
links:
- master
The image has a volume mounted at /opt/hdfs
. To maintain states between restarts, mount a volume at this location.
This should be done for the master and the workers.
If you wish to increase the number of workers scale the worker
service by running the scale
command like follows:
docker-compose scale worker=2
The workers will automatically register themselves with the master.