Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure local SSD filesystem is assembled into a RAID even upon power off/on cycles #3129

Merged
merged 2 commits into from
Oct 18, 2024

Conversation

tpdownes
Copy link
Member

@tpdownes tpdownes commented Oct 14, 2024

The solution in #2720 has a known shortcoming for the case of powering off and powering on a node within a Slurm cluster without retaining the contents of the local SSD disks. In this case, the persistent disk is retained and Slurm startup scripts do not re-run the Ansible playbook that idempotently creates and assembles the RAID before formatting it.

This PR closes that gap by replacing the relevant Ansible tasks with a SystemD unit that performs the same function idempotently. It is guaranteed to run after local filesystems are mounted and does not act if the local SSD volume was successfully mounted. It is also guaranteed to complete execution before Slurmd starts. If it fails, it does not block slurmd.service execution. This is an intentional choice as I believe we should explore general designs for blocking Slurmd upon failure to mount filesystems required for workflows.

Testing

Tests were performed on the community/examples/hpc-slurm-local-ssd.yaml example which uses Rocky Linux 8. This should be a worst case scenario as part of the implementation relies upon the SystemD directive ConditionPathIsMountPoint which was not introduced until SystemD 244. It appears to have been backported to SystemD 239 in Rocky Linux 8. Manual inspection of other Linux distribution shows that all current distributions use SystemD 244 or above.

Additional reboot and power off/on testing performed with this blueprint. Summary:

  • WAI on Ubuntu 20.04, 22.04 and Debian 11, 12
  • Ansible fails to install on Rocky Linux 9 and Ubuntu 24.04

For Rocky Linux 9, our Ansible installer affirmatively quits upon seeing release 9:

Oct 18 15:28:53 r9-0 google_metadata_script_runner[1769]: startup-script: Unsupported version of centos/RHEL/Rocky

For Ubuntu 24.04, the problem traces to the new default of Python 3.12. See, e.g., https://stackoverflow.com/a/78464477:

ext_tpdownes_google_com@u22-0:~$ sudo apt install python3-distutils
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Package python3-distutils is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'python3-distutils' has no installation candidate

The Ansible installer is in need of modernization now that CentOS is EOL. I will ensure the documentation on this is up to date but it's unrelated to to this PR.

---

blueprint_name: test-ssd

vars:
  deployment_name: test-ssd
  project_id: toolkit-demo-zero-e913
  region: us-central1
  zone: us-central1-c


deployment_groups:
- group: primary
  modules:
  - id: network
    source: modules/network/vpc

  - id: script
    source: modules/scripts/startup-script
    settings:
      local_ssd_filesystem:
        mountpoint: /mnt/localssd

  - id: deb
    source: modules/compute/vm-instance
    use:
    - network
    - script
    settings:
      machine_type: c2-standard-4
      instance_image:
        project: debian-cloud
        family: debian-11
      local_ssd_count: 1
      name_prefix: deb

  - id: d12
    source: modules/compute/vm-instance
    use:
    - network
    - script
    settings:
      machine_type: c2-standard-4
      instance_image:
        project: debian-cloud
        family: debian-11
      local_ssd_count: 1
      name_prefix: d12

  - id: u
    source: modules/compute/vm-instance
    use:
    - network
    - script
    settings:
      machine_type: c2-standard-4
      instance_image:
        project: ubuntu-os-cloud
        family: ubuntu-2004-lts
      local_ssd_count: 2
      name_prefix: u20

  - id: u22
    source: modules/compute/vm-instance
    use:
    - network
    - script
    settings:
      machine_type: c2-standard-4
      instance_image:
        project: ubuntu-os-cloud
        family: ubuntu-2204-lts
      local_ssd_count: 2
      name_prefix: u22

  - id: u24
    source: modules/compute/vm-instance
    use:
    - network
    - script
    settings:
      machine_type: c2-standard-4
      instance_image:
        project: ubuntu-os-cloud
        family: ubuntu-2404-lts-amd64
      local_ssd_count: 2
      name_prefix: u24

  - id: r9
    source: modules/compute/vm-instance
    use:
    - network
    - script
    settings:
      machine_type: c2-standard-4
      instance_image:
        project: rocky-linux-cloud
        family: rocky-linux-9-optimized-gcp
      local_ssd_count: 1
      name_prefix: r9

Reboot (local SSD contents retained)

In this scenario, we see the local SSD service block execution because the mountpoint is already mounted

● create-localssd-raid.service
   Loaded: loaded (/etc/systemd/system/create-localssd-raid.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
Condition: start condition failed at Thu 2024-10-17 21:40:45 UTC; 5min ago
           └─ ConditionPathIsMountPoint=!/mnt/localssd was not met

Power off, Power on (discards local SSD contents, so requires a reformat)

In this scenario, we see the local SSD service execute and succeed before slurmd.service starts.

[ext_tpdownes_google_com@hpclocalss-nodeset-0 ~]$ sudo systemctl status -l create-localssd-raid.service 
● create-localssd-raid.service
   Loaded: loaded (/etc/systemd/system/create-localssd-raid.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Thu 2024-10-17 21:52:32 UTC; 53s ago
  Process: 759 ExecStartPost=/usr/sbin/mkfs -t ext4 -m 0 /dev/md/localssd (code=exited, status=0/SUCCESS)
  Process: 698 ExecStart=/usr/bin/bash -c /usr/sbin/mdadm --create /dev/md/localssd --name=localssd --homehost=any --level=0 --raid-devices=2 /dev/disk/by-id/google-local-nvme-ssd-* (code=exited, status=0/SUCCESS)
 Main PID: 698 (code=exited, status=0/SUCCESS)

Oct 17 21:52:31 hpclocalss-nodeset-0 mkfs[759]: Superblock backups stored on blocks:
Oct 17 21:52:31 hpclocalss-nodeset-0 mkfs[759]:         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
Oct 17 21:52:31 hpclocalss-nodeset-0 mkfs[759]:         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
Oct 17 21:52:31 hpclocalss-nodeset-0 mkfs[759]:         102400000
Oct 17 21:52:31 hpclocalss-nodeset-0 mkfs[759]: [65B blob data]
Oct 17 21:52:31 hpclocalss-nodeset-0 mkfs[759]: [62B blob data]
Oct 17 21:52:32 hpclocalss-nodeset-0 mkfs[759]: Creating journal (262144 blocks): done
Oct 17 21:52:32 hpclocalss-nodeset-0 mkfs[759]: [99B blob data]
Oct 17 21:52:32 hpclocalss-nodeset-0 systemd[1]: create-localssd-raid.service: Succeeded.
Oct 17 21:52:32 hpclocalss-nodeset-0 systemd[1]: Started create-localssd-raid.service.
[ext_tpdownes_google_com@hpclocalss-nodeset-0 ~]$ sudo systemctl status slurmd.service
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmd.service.d
           └─overrides.conf
   Active: active (running) since Thu 2024-10-17 21:52:36 UTC; 1min 15s ago
 Main PID: 1185 (slurmd)
    Tasks: 1
   Memory: 8.4M
   CGroup: /system.slice/slurmd.service
           └─1185 /usr/local/sbin/slurmd --systemd --conf-server=hpclocalss-controller:6820-6830

Oct 17 21:52:32 hpclocalss-nodeset-0 systemd[1]: Starting Slurm node daemon...
Oct 17 21:52:34 hpclocalss-nodeset-0 slurmd[1185]: slurmd: CPU frequency setting not configured for this node
Oct 17 21:52:35 hpclocalss-nodeset-0 slurmd[1185]: slurmd: pyxis: version v0.19.0
Oct 17 21:52:35 hpclocalss-nodeset-0 slurmd[1185]: slurmd: slurmd version 24.05.3 started
Oct 17 21:52:36 hpclocalss-nodeset-0 slurmd[1185]: slurmd: slurmd started on Thu, 17 Oct 2024 21:52:36 +0000
Oct 17 21:52:36 hpclocalss-nodeset-0 slurmd[1185]: slurmd: CPUs=2 Boards=1 Sockets=1 Cores=2 Threads=1 Memory=16014 TmpDisk=50988 Uptime=27 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
Oct 17 21:52:36 hpclocalss-nodeset-0 systemd[1]: Started Slurm node daemon.

POWER OFF/ON

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@tpdownes tpdownes self-assigned this Oct 14, 2024
@tpdownes tpdownes changed the title Ensure slurmd fails to start if local SSD filesystem is not mounted Ensure local SSD filesystem is assembled into a RAID even upon power off/on cycles Oct 17, 2024
@tpdownes tpdownes requested a review from alyssa-sm October 17, 2024 22:08
@tpdownes tpdownes marked this pull request as ready for review October 17, 2024 22:08
When the local SSD mountpoint has not been mounted use SystemD to create
the RAID array and format it. This addresses the known behavior of the
Slurm-GCP solution in which it does not re-run startup-scripts upon
a power off/on (or reboot) cycle. During a typical power off/on cycle,
the local SSD contents are discarded and the disks must be re-assembled
and formatted.
@tpdownes tpdownes force-pushed the localssd_fail_slurmd branch from ab54cbc to fa3f3a6 Compare October 17, 2024 22:11
@tpdownes tpdownes assigned alyssa-sm and unassigned tpdownes Oct 17, 2024
@tpdownes tpdownes added release-key-new-features Added to release notes under the "Key New Features" heading. release-improvements Added to release notes under the "Improvements" heading. and removed release-key-new-features Added to release notes under the "Key New Features" heading. labels Oct 17, 2024
@alyssa-sm alyssa-sm assigned tpdownes and unassigned alyssa-sm Oct 17, 2024
@tpdownes tpdownes merged commit efbea12 into GoogleCloudPlatform:develop Oct 18, 2024
15 of 60 checks passed
@tpdownes tpdownes deleted the localssd_fail_slurmd branch October 18, 2024 18:40
@harshthakkar01 harshthakkar01 mentioned this pull request Oct 24, 2024
tpdownes added a commit to tpdownes/hpc-toolkit that referenced this pull request Nov 4, 2024
The local SSD RAID solution is written in Ansible which will
successfully handle re-creating the RAID array and mounting it in
scenarios where the VM has been re-created and the contents of local SSD
have been discared. The Slurm solutions do not re-run startup scripts
after the first boot using a given persistent disk. During maintenance
events, the persistent disk is retained while the local SSD disks are
discarded. PR GoogleCloudPlatform#3129 addressed re-creating, formatting and mounting the
RAID array but left a gap in setting the mode of the mounted directory
after power off/on cycles. This PR refactors mounting and mode-setting
to resolve this gap.
tpdownes added a commit to tpdownes/hpc-toolkit that referenced this pull request Nov 4, 2024
The local SSD RAID solution is written in Ansible which will
successfully handle re-creating the RAID array and mounting it in
scenarios where the VM has been re-created and the contents of local SSD
have been discared. The Slurm solutions do not re-run startup scripts
after the first boot using a given persistent disk. During maintenance
events, the persistent disk is retained while the local SSD disks are
discarded. PR GoogleCloudPlatform#3129 addressed re-creating, formatting and mounting the
RAID array but left a gap in setting the mode of the mounted directory
after power off/on cycles. This PR refactors mounting and mode-setting
to resolve this gap.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-improvements Added to release notes under the "Improvements" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants