Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release Slurm-GCP 6.5.6 #142

Merged
merged 2 commits into from
May 29, 2024
Merged

Release Slurm-GCP 6.5.6 #142

merged 2 commits into from
May 29, 2024

Conversation

tpdownes
Copy link
Member

@tpdownes tpdownes commented May 29, 2024

6.5.6

  • Fix bug where dynamic nodes with hostname set to FQDN could not be found by slurmsync script; in current architecture, this impacts login nodes. The resulting Exception was uncaught; Solution truncates all nodenames to short hostname before performing comparisons requiring short hostname

Extends work performed in e33c07e.

Observed results

✅ Test 1: confirm nodeset startup-script is working by activating a service at boot

srun -N64 systemctl is-active nvidia-dcgm.service
active
active
...

✅ Test 2: confirm exception no longer appears in slurmsync logs and the login node appears as idle to sudo sinfo (x-login partition is "root only")

$ sudo cat /var/log/slurm/slurmsync.log
2024-05-29 15:32:45,153 INFO: Restarting slurmctld to make changes take effect.
$ sudo sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
x-login      up   infinite      1   idle a3mega-login-001.c.hpc-toolkit-gsc.internal
a3mega*      up   infinite     64   idle a3mega-a3meganodeset-[0-63]
debug        up   infinite      4  idle~ a3mega-debugnodeset-[0-3]

@tpdownes tpdownes requested a review from mr0re1 May 29, 2024 15:56
@tpdownes
Copy link
Member Author

Manually deploying examples/hpc-slurm.yaml from the HPC Toolkit shows no errors/exceptions in the sync log:

# cat slurmsync.log 
2024-05-29 16:49:03,412 INFO: Restarting slurmctld to make changes take effect.

@mr0re1 mr0re1 assigned tpdownes and unassigned mr0re1 May 29, 2024
@tpdownes tpdownes merged commit 48f46cb into master May 29, 2024
2 checks passed
@tpdownes tpdownes deleted the fix_hostname branch May 29, 2024 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants