Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the way Pyxis and Enroot are configured #2820

Merged

Conversation

gmarciani
Copy link
Contributor

@gmarciani gmarciani commented Oct 15, 2024

Description of changes

Fix the way Pyxis and Enroot are configured.

  1. Pyxis is disabled by default. In particular, the Enroot, SPANK and Pyxis config files required to enable it are stored in /opt/parallelcluster/examples folder so that they are ineffective but can be used by the user to enable Pyxis by moving them to the expected location or using custom ones.

  2. Moved Pyxis and Enroot configuration to build time (there was no reason to configure Pyxis and Enroot at runtime)

  3. Skip Enroot installation if Enroot is already installed.

  4. Skip Pyxis installation if Pyxis is already installed.

  5. The sample configuration (not the real ones) provided for Pyxis uses runtime path to /run/pyxis. As per documentation a tmpfs should be used.

  6. The sample configuration (not the real ones) provided for Enroot uses the following paths, as suggested in documentation
    1. Using tmpfs storage for ENROOT_RUNTIME_PATH and ENROOT_DATA_PATH
    2. Using a persistent local storage for ENROOT_CACHE_PATH and ENROOT_CONFIG_PATH.

  7. We do not create any directory used in the Pyxis or Enroot sample configuration. The user is supposed to create the desired directories.

  8. Minor: Moved Pyxis attributes from platform cookbook to slurm cookbook because Pyxis is a SLURM plugin so it would be conceptually wrong to have its attributes defined in platform cookbook.

  9. Added missing unit tests.

Set the label skip-recursive-deletion-check because it detects a false positive in this PR: the recursions (recursive: true) in this PR are not on the delete but on the creation of directories.

User Experience

With this change, Pyxis is disabled by default.
Assuming that the user wants to use the configurations provided as examples, he needs to execute the below script in the head node as a OnNodeConfigured custom action:

#!/bin/bash
set -e

echo "Executing $0"

# Configure Enroot
ENROOT_PERSISTENT_DIR="/var/enroot"
ENROOT_VOLATILE_DIR="/run/enroot"

sudo mkdir -p $ENROOT_PERSISTENT_DIR
sudo chmod 1777 $ENROOT_PERSISTENT_DIR
sudo mkdir -p $ENROOT_VOLATILE_DIR
sudo chmod 1777 $ENROOT_VOLATILE_DIR
sudo mv /opt/parallelcluster/examples/enroot/enroot.conf /etc/enroot/enroot.conf
sudo chmod 0644 /etc/enroot/enroot.conf

# Configure Pyxis
PYXIS_RUNTIME_DIR="/run/pyxis"

sudo mkdir -p $PYXIS_RUNTIME_DIR
sudo chmod 1777 $PYXIS_RUNTIME_DIR

sudo mkdir -p /opt/slurm/etc/plugstack.conf.d/
sudo mv /opt/parallelcluster/examples/spank/plugstack.conf /opt/slurm/etc/
sudo mv /opt/parallelcluster/examples/pyxis/pyxis.conf /opt/slurm/etc/plugstack.conf.d/
sudo -i scontrol reconfigure

and the following script on every compute node as OnNodeStart custom action:

#!/bin/bash
set -e

echo "Executing $0"

# Configure Enroot
ENROOT_PERSISTENT_DIR="/var/enroot"
ENROOT_VOLATILE_DIR="/run/enroot"

sudo mkdir -p $ENROOT_PERSISTENT_DIR
sudo chmod 1777 $ENROOT_PERSISTENT_DIR
sudo mkdir -p $ENROOT_VOLATILE_DIR
sudo chmod 1777 $ENROOT_VOLATILE_DIR
sudo mv /opt/parallelcluster/examples/enroot/enroot.conf /etc/enroot/enroot.conf
sudo chmod 0644 /etc/enroot/enroot.conf

# Configure Pyxis
PYXIS_RUNTIME_DIR="/run/pyxis"

sudo mkdir -p $PYXIS_RUNTIME_DIR
sudo chmod 1777 $PYXIS_RUNTIME_DIR

Note: Pyxis and Enroot paths are defined as cookbook attributes, so if necessary such paths can be customized at build time injecting custom attributes:

default['cluster']['enroot']['temporary_dir'] = '/run/enroot'
default['cluster']['enroot']['persistent_dir'] = '/var/enroot'
default['cluster']['pyxis']['runtime_path'] = '/run/pyxis'

Tests

Manually tested on AL2, verifying that Pyxis works as expected once the user applies the the steps required to enable it.

In particular, I've created the cluster with:

  • OnNodeConfigured custom action on the head node with the script above.
  • OnNodestart custom action on the compute nodes with the script above.

The standard job submission works from both login nodes and head node:

[ec2-user@ip-27-6-32-134 ~]$ sbatch -N 3 --wrap='srun hostname'

The containerized job submission works both from login nodes and head node:

[ec2-user@ip-27-6-86-56 ~]$ srun -N 2 --container-image docker://ubuntu:22.04 hostname
pyxis: imported docker image: docker://ubuntu:22.04
pyxis: imported docker image: docker://ubuntu:22.04
q1-st-cr1-1
q1-dy-cr1-1

[ec2-user@ip-27-6-86-56 ~]$ sbatch -N 2 --wrap='srun --container-image docker://ubuntu:22.04 hostname'
Submitted batch job 2
[ec2-user@ip-27-6-86-56 ~]$ cat slurm-2.out
pyxis: imported docker image: docker://ubuntu:22.04
q1-st-cr1-1
pyxis: imported docker image: docker://ubuntu:22.04
q1-dy-cr1-1

References

  1. Pyxis: https://github.com/NVIDIA/pyxis/wiki/Setup#slurm-plugstack-configuration
  2. Enroot: https://github.com/NVIDIA/pyxis/wiki/Setup#enroot-configuration-example + https://github.com/NVIDIA/enroot/blob/master/doc/configuration.md#runtime-configuration
  3. tmpfiles Config: https://www.freedesktop.org/software/systemd/man/latest/tmpfiles.d.html

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani force-pushed the wip/mgiacomo/3.11.1/fix-pyxis-1014-1 branch 4 times, most recently from a0505f9 to d319cb6 Compare October 15, 2024 14:20
@gmarciani gmarciani added the skip-recursive-deletion-check Skip the checks regarding the use of recursive deletion. label Oct 15, 2024
@gmarciani gmarciani marked this pull request as ready for review October 15, 2024 15:35
@gmarciani gmarciani requested review from a team as code owners October 15, 2024 15:35
chmod 1777 /tmp/enroot/data

chmod 1777 ${SHARED_DIR}/enroot
directory node['cluster']['enroot']['persistent_dir'] do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to keep the mode as 1777? or 755?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aligned with your security concern. However, we should use 1777 by design because enroot writes in there as whatever user submitting a containerized job.

Example of files written by enroot in /var/enroot after a containerized job is submitted as ec2-user:

[root@q1-st-cr1-1 ~]# tree /var/enroot
tree /var/enroot
/var/enroot
└── cache
    └── group-1000
        ├── 6414378b647780fee8fd903ddb9541d134a1947ce092d08bdeb23a54cb3684ac
        └── 97271d29cb7956f0908cfb1449610a2cd9cb46b004ac8af25f0255663eb364ba

2 directories, 2 files
[root@q1-st-cr1-1 ~]# ls -la /var/enroot
total 0
drwxrwxrwt  3 root     root      19 Oct 15 15:26 .
drwxr-xr-x 22 root     root     307 Oct 15 11:22 ..
drwx------  3 ec2-user ec2-user  24 Oct 15 15:26 cache
[root@q1-st-cr1-1 ~]# ls -la /var/enroot/cache
total 0
drwx------ 3 ec2-user ec2-user  24 Oct 15 15:26 .
drwxrwxrwt 3 root     root      19 Oct 15 15:26 ..
drwx------ 3 ec2-user ec2-user 170 Oct 15 15:26 group-1000
[root@q1-st-cr1-1 ~]# ls -la /var/enroot/cache/group-1000/
total 29848
drwx------ 3 ec2-user ec2-user      170 Oct 15 15:26 .
drwx------ 3 ec2-user ec2-user       24 Oct 15 15:26 ..
-rw-r----- 1 ec2-user ec2-user 30559455 Oct 15 15:26 6414378b647780fee8fd903ddb9541d134a1947ce092d08bdeb23a54cb3684ac
-rw-r----- 1 ec2-user ec2-user      806 Oct 15 15:26 97271d29cb7956f0908cfb1449610a2cd9cb46b004ac8af25f0255663eb364ba
drwx------ 2 ec2-user ec2-user        6 Oct 15 15:31 .tokens.1000

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

himani2411
himani2411 previously approved these changes Oct 15, 2024
@gmarciani gmarciani force-pushed the wip/mgiacomo/3.11.1/fix-pyxis-1014-1 branch from 683d613 to 7b77b65 Compare October 15, 2024 16:25
@gmarciani gmarciani force-pushed the wip/mgiacomo/3.11.1/fix-pyxis-1014-1 branch 6 times, most recently from 373cd69 to db107cf Compare October 16, 2024 10:30
  1. Pyxis is disabled by default. In particular, the Enroot, SPANK and Pyxis config files required to enable it are stored in `/opt/parallelcluster/examples` folder so that they are ineffective but can be used by the user to enable Pyxis by simply moving them to the expected location.

  2. Moved Pyxis and Enroot configuration to build time (there was no reason to configure Pyxis and Enroot at runtime)

  3. Skip Enroot installation if Enroot is already installed.

  4. Skip Pyxis installation if Pyxis is already installed.

  5. The sample configurations provided for Pyxis uses runtime path to `/run/pyxis`. As per [documentation](https://github.com/NVIDIA/pyxis/wiki/Setup#slurm-plugstack-configuration) a tmpfs should be used.

  6. The sample configuration provided for Enroot uses the following paths, as suggested in [documentation](https://github.com/NVIDIA/pyxis/wiki/Setup#enroot-configuration-example)
    1. Using tmpfs storage for `ENROOT_RUNTIME_PATH` and `ENROOT_DATA_PATH`
    2. Using a persistent local storage for `ENROOT_CACHE_PATH` and `ENROOT_CONFIG_PATH`.

  7. We do not create any directory used in the Pyxis or Enroot sample configuration. The user is supposed to create the desired directories.

  8. *Minor*: Moved Pyxis attributes from platform cookbook to slurm cookbook because Pyxis is a SLURM plugin so it would be conceptually wrong to have its attributes defined in platform cookbook.

  9. Added missing unit tests.

Signed-off-by: Giacomo Marciani <[email protected]>
@gmarciani gmarciani force-pushed the wip/mgiacomo/3.11.1/fix-pyxis-1014-1 branch from db107cf to 1d98760 Compare October 16, 2024 11:59
Copy link
Contributor

@demartinofra demartinofra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@gmarciani gmarciani merged commit a38b423 into aws:release-3.11 Oct 16, 2024
28 of 30 checks passed
@gmarciani gmarciani deleted the wip/mgiacomo/3.11.1/fix-pyxis-1014-1 branch October 16, 2024 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.x bug skip-changelog-update skip-recursive-deletion-check Skip the checks regarding the use of recursive deletion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants