The GPU Compute mode SPANK plugin for Slurm allows users to choose the compute mode of GPUs they submit jobs to.
NVIDIA GPUs can be set to operate under different compute modes:
Default
(shared): Multiple host threads can use the device at the same time.Exclusive-process
: Only one CUDA context may be created on the device across all processes in the system.Prohibited
: No CUDA context can be created on the device.
More information is available in the CUDA Programming guide
To ensure optimal performance, process isolation and avoid situations where
multiple processes unintentionally run on the same GPU, it's often recommended
to set the GPU compute mode to Exclusive-process
.
Unfortunately, some legacy applications may require to run concurrent processes
on a single GPU, to function properly, and thus need the GPUs to be set in the
Default
(i.e. shared) compute mode.
Hence the need for a mechanism that would allow users to choose the compute mode of the GPUs their job will run on.
Requirements:
slurm-spank-lua
(submission and execution nodes)nvidia-smi
(execution nodes)
The Slurm SPANK GPU Compute Mode plugin is written in Lua
and require the
slurm-spank-lua
plugin to
work.
Once the slurm-spank-lua
plugin is installed and configured, the
gpu_cmode.lua
script can be dropped in the appropriate directory (by default,
/etc/slurm/lua.d
, or any other location specified in the plugstack.conf
configuration file).
Note that both the Slurm SPANK Lua plugin and the GPU Compute mode Lua plugin
will need to be present on both the submission host(s) (where
srun
/sbatch
commands are executed) and the execution host(s) (where the job
actually runs).
Note: The plugin defines a default GPU compute mode (exclusive
), which is
used to re-set the GPUs at the end of a job. The default mode can be changed
by editing the value of default_cmode
in the script.
The Slurm SPANK GPU Compute Mode plugin introduces a new option to srun
and
sbatch
: --gpu_cmode
.
$ srun --help
[...]
--gpu_cmode=<shared|exclusive|prohibited>
Set the GPU compute mode on the allocated GPUs to
shared, exclusive or prohibited. Default is
exclusive
[...]
--gpu_cmode=shared
$ srun --gres gpu:1 --gpu_cmode=shared "nvidia-smi --query-gpu=compute_mode --format=csv,noheader"
Default
--gpu_cmode=exclusive
$ srun --gres gpu:1 --gpu_cmode=exclusive nvidia-smi --query-gpu=compute_mode --format=csv,noheader
Exclusive_Process
--gpu_cmode=prohibited
$ srun --gres gpu:1 --gpu_cmode=prohibited nvidia-smi --query-gpu=compute_mode --format=csv,noheader
Prohibited
$ srun --gres gpu:4 --gpu_cmode=shared nvidia-smi --query-gpu=compute_mode --format=csv,noheader
Default
Default
Default
Default
$ srun -l -N 2 --ntasks-per-node=1 --gres gpu:1 --gpu_cmode=shared nvidia-smi --query-gpu=compute_mode --format=csv,noheader
1: Default
0: Default
NB: If the --gpu_cmode
option is not used, no modification will be made
to the current compute mode of the GPUs, and the site default will be used.