Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor mount/mode setting for local SSD RAID #3214

Merged

Conversation

tpdownes
Copy link
Member

@tpdownes tpdownes commented Nov 4, 2024

The local SSD RAID solution is written in Ansible which will successfully handle re-creating the RAID array and mounting it in scenarios where the VM has been re-created and the contents of local SSD have been discared. The Slurm solutions do not re-run startup scripts after the first boot using a given persistent disk. During maintenance events, the persistent disk is retained while the local SSD disks are discarded. PR #3129 addressed re-creating, formatting and mounting the RAID array but left a gap in setting the mode of the mounted directory after power off/on cycles. This PR refactors mounting and mode-setting to resolve this gap.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@tpdownes tpdownes marked this pull request as ready for review November 4, 2024 21:02
@tpdownes tpdownes added the release-bugfix Added to release notes under the "Bug fixes" heading. label Nov 4, 2024
The local SSD RAID solution is written in Ansible which will
successfully handle re-creating the RAID array and mounting it in
scenarios where the VM has been re-created and the contents of local SSD
have been discared. The Slurm solutions do not re-run startup scripts
after the first boot using a given persistent disk. During maintenance
events, the persistent disk is retained while the local SSD disks are
discarded. PR GoogleCloudPlatform#3129 addressed re-creating, formatting and mounting the
RAID array but left a gap in setting the mode of the mounted directory
after power off/on cycles. This PR refactors mounting and mode-setting
to resolve this gap.
@tpdownes tpdownes force-pushed the refactor_mount_perms_ssd branch from 97db986 to e1455af Compare November 4, 2024 22:34
@tpdownes
Copy link
Member Author

tpdownes commented Nov 4, 2024

Noting: PR-test-ml-a3-megagpu-slurm passed before squashing commits prior to merge.

@tpdownes
Copy link
Member Author

tpdownes commented Nov 4, 2024

Testing with

#!/bin/bash

set -ex
mountpoint /mnt/localssd
ls -lhd /mnt/localssd
cat /etc/enroot/enroot.conf

and

srun --container-image=alpine grep PRETTY /etc/os-release

both demonstrated end-to-end functionality of enroot on local SSD volumes on first-boot/reboot/power off-on/re-creation.

@tpdownes tpdownes merged commit 5a62dda into GoogleCloudPlatform:develop Nov 4, 2024
9 of 56 checks passed
@tpdownes tpdownes deleted the refactor_mount_perms_ssd branch November 4, 2024 22:56
@rohitramu rohitramu mentioned this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-bugfix Added to release notes under the "Bug fixes" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants