-
Notifications
You must be signed in to change notification settings - Fork 312
(3.0.0‐3.8.0) Interactive job submission through srun can fail after increasing the number of compute nodes in the cluster
ParallelCluster allows to extend the size of a cluster without imposing the requirement to stop the compute fleet. Extending the size of a cluster includes adding new queues to the scheduler, new compute resources within a queue, or increasing MaxCount of a compute resource.
According to the documentation of Slurm reported below, adding or removing nodes from a cluster requires to restart both the slurmctld
on the head node and the slurmd
on all the compute nodes.
From https://slurm.schedmd.com/slurm.conf.html#SECTION_NODE-CONFIGURATION:
The configuration of nodes (or machines) to be managed by Slurm is also specified in /etc/slurm.conf. Changes in node configuration (e.g. adding nodes, changing their processor count, etc.) require restarting both the slurmctld daemon and the slurmd daemons. All slurmd daemons must know each node in the system to forward messages in support of hierarchical communications.
The slurmctld
on the head node is restarted during a cluster update operation, but the slurmd
daemons running on the compute nodes are not.
While there is no impact on the job submission through sbatch
(neither in calls to srun
within a batch job submitted via sbatch
), missing to restart slurmd
on the compute fleet when adding new nodes to the cluster may have an impact on direct srun
interactive job submissions:
-
srun
jobs involving both new and old nodes, with communications from old to new nodes, are affected: if at any time an old node must propagate a Slurm RPC to a new node, this propagation will fail, leading the wholesrun
to fail; -
srun
jobs ending only on new nodes are not impacted because the nodes have just started all together so they know about others; -
srun
jobs ending only on old nodes are not impacted because all nodes already know each other; - single-node
srun
are not affected.
The error message shown when facing this issue is
run: error: fwd_tree_thread: can't find address for host <hostname>, check slurm.conf
- ParallelCluster 3.0.0 - latest
- Slurm scheduler
In order to avoid any possible issue with srun job submissions, the simpler mitigation is to stop and start the compute fleet:
pcluster update-compute-fleet -r <region> -n <cluster-name> --status STOP_REQUESTED
# Wait until all the compute nodes are DOWN
pcluster update-compute-fleet -r <region> -n <cluster-name> --status START_REQUESTED
Alternatively, you can follow the SchedMD guideline and restart the slurmd on the active compute nodes.
You can retrieve the list of active nodes by issuing the sinfo command and filtering the nodes that are responding and not powered down:
[ec2-user@ip-172-31-29-2 ~]$ sinfo -t idle,alloc,allocated -h | grep -v "~" | tr -s " " | cut -d' ' -f6 > nodes.txt
[ec2-user@ip-172-31-29-2 ~]$ cat nodes.txt
q1-st-ttm-[1-2]
q2-st-tts-[1-2]
then using a parallel shell tool like ClusterShell in the example below, you can restart the slurmd demon in each host
[ec2-user@ip-172-31-29-2 ~]$ clush --hostfile ./nodes.txt -f 4 'sudo systemctl restart slurmd && echo "slurmd restarted on host $(hostname)"'
q1-st-ttm-1: slurmd restarted on host q1-st-ttm-1
q1-st-ttm-2: slurmd restarted on host q1-st-ttm-2
q2-st-tts-2: slurmd restarted on host q2-st-tts-2
q2-st-tts-1: slurmd restarted on host q2-st-tts-1
where -f N
is the level of parallelism you want to adopt.