Ensure local SSD filesystem is assembled into a RAID even upon power off/on cycles #3129

tpdownes · 2024-10-14T17:53:29Z

The solution in #2720 has a known shortcoming for the case of powering off and powering on a node within a Slurm cluster without retaining the contents of the local SSD disks. In this case, the persistent disk is retained and Slurm startup scripts do not re-run the Ansible playbook that idempotently creates and assembles the RAID before formatting it.

This PR closes that gap by replacing the relevant Ansible tasks with a SystemD unit that performs the same function idempotently. It is guaranteed to run after local filesystems are mounted and does not act if the local SSD volume was successfully mounted. It is also guaranteed to complete execution before Slurmd starts. If it fails, it does not block slurmd.service execution. This is an intentional choice as I believe we should explore general designs for blocking Slurmd upon failure to mount filesystems required for workflows.

Testing

Tests were performed on the community/examples/hpc-slurm-local-ssd.yaml example which uses Rocky Linux 8. This should be a worst case scenario as part of the implementation relies upon the SystemD directive ConditionPathIsMountPoint which was not introduced until SystemD 244. It appears to have been backported to SystemD 239 in Rocky Linux 8. Manual inspection of other Linux distribution shows that all current distributions use SystemD 244 or above.

Additional reboot and power off/on testing performed with this blueprint. Summary:

WAI on Ubuntu 20.04, 22.04 and Debian 11, 12
Ansible fails to install on Rocky Linux 9 and Ubuntu 24.04

For Rocky Linux 9, our Ansible installer affirmatively quits upon seeing release 9:

Oct 18 15:28:53 r9-0 google_metadata_script_runner[1769]: startup-script: Unsupported version of centos/RHEL/Rocky

For Ubuntu 24.04, the problem traces to the new default of Python 3.12. See, e.g., https://stackoverflow.com/a/78464477:

ext_tpdownes_google_com@u22-0:~$ sudo apt install python3-distutils
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Package python3-distutils is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'python3-distutils' has no installation candidate

The Ansible installer is in need of modernization now that CentOS is EOL. I will ensure the documentation on this is up to date but it's unrelated to to this PR.

---

blueprint_name: test-ssd

vars:
  deployment_name: test-ssd
  project_id: toolkit-demo-zero-e913
  region: us-central1
  zone: us-central1-c


deployment_groups:
- group: primary
  modules:
  - id: network
    source: modules/network/vpc

  - id: script
    source: modules/scripts/startup-script
    settings:
      local_ssd_filesystem:
        mountpoint: /mnt/localssd

  - id: deb
    source: modules/compute/vm-instance
    use:
    - network
    - script
    settings:
      machine_type: c2-standard-4
      instance_image:
        project: debian-cloud
        family: debian-11
      local_ssd_count: 1
      name_prefix: deb

  - id: d12
    source: modules/compute/vm-instance
    use:
    - network
    - script
    settings:
      machine_type: c2-standard-4
      instance_image:
        project: debian-cloud
        family: debian-11
      local_ssd_count: 1
      name_prefix: d12

  - id: u
    source: modules/compute/vm-instance
    use:
    - network
    - script
    settings:
      machine_type: c2-standard-4
      instance_image:
        project: ubuntu-os-cloud
        family: ubuntu-2004-lts
      local_ssd_count: 2
      name_prefix: u20

  - id: u22
    source: modules/compute/vm-instance
    use:
    - network
    - script
    settings:
      machine_type: c2-standard-4
      instance_image:
        project: ubuntu-os-cloud
        family: ubuntu-2204-lts
      local_ssd_count: 2
      name_prefix: u22

  - id: u24
    source: modules/compute/vm-instance
    use:
    - network
    - script
    settings:
      machine_type: c2-standard-4
      instance_image:
        project: ubuntu-os-cloud
        family: ubuntu-2404-lts-amd64
      local_ssd_count: 2
      name_prefix: u24

  - id: r9
    source: modules/compute/vm-instance
    use:
    - network
    - script
    settings:
      machine_type: c2-standard-4
      instance_image:
        project: rocky-linux-cloud
        family: rocky-linux-9-optimized-gcp
      local_ssd_count: 1
      name_prefix: r9

Reboot (local SSD contents retained)

In this scenario, we see the local SSD service block execution because the mountpoint is already mounted

● create-localssd-raid.service
   Loaded: loaded (/etc/systemd/system/create-localssd-raid.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
Condition: start condition failed at Thu 2024-10-17 21:40:45 UTC; 5min ago
           └─ ConditionPathIsMountPoint=!/mnt/localssd was not met

Power off, Power on (discards local SSD contents, so requires a reformat)

In this scenario, we see the local SSD service execute and succeed before slurmd.service starts.

[ext_tpdownes_google_com@hpclocalss-nodeset-0 ~]$ sudo systemctl status -l create-localssd-raid.service 
● create-localssd-raid.service
   Loaded: loaded (/etc/systemd/system/create-localssd-raid.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Thu 2024-10-17 21:52:32 UTC; 53s ago
  Process: 759 ExecStartPost=/usr/sbin/mkfs -t ext4 -m 0 /dev/md/localssd (code=exited, status=0/SUCCESS)
  Process: 698 ExecStart=/usr/bin/bash -c /usr/sbin/mdadm --create /dev/md/localssd --name=localssd --homehost=any --level=0 --raid-devices=2 /dev/disk/by-id/google-local-nvme-ssd-* (code=exited, status=0/SUCCESS)
 Main PID: 698 (code=exited, status=0/SUCCESS)

Oct 17 21:52:31 hpclocalss-nodeset-0 mkfs[759]: Superblock backups stored on blocks:
Oct 17 21:52:31 hpclocalss-nodeset-0 mkfs[759]:         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
Oct 17 21:52:31 hpclocalss-nodeset-0 mkfs[759]:         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
Oct 17 21:52:31 hpclocalss-nodeset-0 mkfs[759]:         102400000
Oct 17 21:52:31 hpclocalss-nodeset-0 mkfs[759]: [65B blob data]
Oct 17 21:52:31 hpclocalss-nodeset-0 mkfs[759]: [62B blob data]
Oct 17 21:52:32 hpclocalss-nodeset-0 mkfs[759]: Creating journal (262144 blocks): done
Oct 17 21:52:32 hpclocalss-nodeset-0 mkfs[759]: [99B blob data]
Oct 17 21:52:32 hpclocalss-nodeset-0 systemd[1]: create-localssd-raid.service: Succeeded.
Oct 17 21:52:32 hpclocalss-nodeset-0 systemd[1]: Started create-localssd-raid.service.
[ext_tpdownes_google_com@hpclocalss-nodeset-0 ~]$ sudo systemctl status slurmd.service
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmd.service.d
           └─overrides.conf
   Active: active (running) since Thu 2024-10-17 21:52:36 UTC; 1min 15s ago
 Main PID: 1185 (slurmd)
    Tasks: 1
   Memory: 8.4M
   CGroup: /system.slice/slurmd.service
           └─1185 /usr/local/sbin/slurmd --systemd --conf-server=hpclocalss-controller:6820-6830

Oct 17 21:52:32 hpclocalss-nodeset-0 systemd[1]: Starting Slurm node daemon...
Oct 17 21:52:34 hpclocalss-nodeset-0 slurmd[1185]: slurmd: CPU frequency setting not configured for this node
Oct 17 21:52:35 hpclocalss-nodeset-0 slurmd[1185]: slurmd: pyxis: version v0.19.0
Oct 17 21:52:35 hpclocalss-nodeset-0 slurmd[1185]: slurmd: slurmd version 24.05.3 started
Oct 17 21:52:36 hpclocalss-nodeset-0 slurmd[1185]: slurmd: slurmd started on Thu, 17 Oct 2024 21:52:36 +0000
Oct 17 21:52:36 hpclocalss-nodeset-0 slurmd[1185]: slurmd: CPUs=2 Boards=1 Sockets=1 Cores=2 Threads=1 Memory=16014 TmpDisk=50988 Uptime=27 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
Oct 17 21:52:36 hpclocalss-nodeset-0 systemd[1]: Started Slurm node daemon.

POWER OFF/ON

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

Fork your PR branch from the Toolkit "develop" branch (not main)
Test all changes with pre-commit in a local branch #
Confirm that "make tests" passes all tests
Add or modify unit tests to cover code changes
Ensure that unit test coverage remains above 80%
Update all applicable documentation
Follow Cluster Toolkit Contribution guidelines #

When the local SSD mountpoint has not been mounted use SystemD to create the RAID array and format it. This addresses the known behavior of the Slurm-GCP solution in which it does not re-run startup-scripts upon a power off/on (or reboot) cycle. During a typical power off/on cycle, the local SSD contents are discarded and the disks must be re-assembled and formatted.

The local SSD RAID solution is written in Ansible which will successfully handle re-creating the RAID array and mounting it in scenarios where the VM has been re-created and the contents of local SSD have been discared. The Slurm solutions do not re-run startup scripts after the first boot using a given persistent disk. During maintenance events, the persistent disk is retained while the local SSD disks are discarded. PR GoogleCloudPlatform#3129 addressed re-creating, formatting and mounting the RAID array but left a gap in setting the mode of the mounted directory after power off/on cycles. This PR refactors mounting and mode-setting to resolve this gap.

tpdownes self-assigned this Oct 14, 2024

tpdownes changed the title ~~Ensure slurmd fails to start if local SSD filesystem is not mounted~~ Ensure local SSD filesystem is assembled into a RAID even upon power off/on cycles Oct 17, 2024

tpdownes requested a review from alyssa-sm October 17, 2024 22:08

tpdownes marked this pull request as ready for review October 17, 2024 22:08

tpdownes added 2 commits October 17, 2024 17:10

Refactor default value for mountpoint in local SSD solution

e329312

tpdownes force-pushed the localssd_fail_slurmd branch from ab54cbc to fa3f3a6 Compare October 17, 2024 22:11

tpdownes assigned alyssa-sm and unassigned tpdownes Oct 17, 2024

tpdownes added release-key-new-features Added to release notes under the "Key New Features" heading. release-improvements Added to release notes under the "Improvements" heading. and removed release-key-new-features Added to release notes under the "Key New Features" heading. labels Oct 17, 2024

alyssa-sm approved these changes Oct 17, 2024

View reviewed changes

alyssa-sm assigned tpdownes and unassigned alyssa-sm Oct 17, 2024

tpdownes merged commit efbea12 into GoogleCloudPlatform:develop Oct 18, 2024
15 of 60 checks passed

tpdownes deleted the localssd_fail_slurmd branch October 18, 2024 18:40

harshthakkar01 mentioned this pull request Oct 24, 2024

Release v1.41.0 #3148

Merged

tpdownes mentioned this pull request Nov 4, 2024

Refactor mount/mode setting for local SSD RAID #3214

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure local SSD filesystem is assembled into a RAID even upon power off/on cycles #3129

Ensure local SSD filesystem is assembled into a RAID even upon power off/on cycles #3129

tpdownes commented Oct 14, 2024 •

edited

Loading

Ensure local SSD filesystem is assembled into a RAID even upon power off/on cycles #3129

Ensure local SSD filesystem is assembled into a RAID even upon power off/on cycles #3129

Conversation

tpdownes commented Oct 14, 2024 • edited Loading

Testing

Reboot (local SSD contents retained)

Power off, Power on (discards local SSD contents, so requires a reformat)

Submission Checklist

tpdownes commented Oct 14, 2024 •

edited

Loading