-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the way Pyxis and Enroot are configured #2820
Fix the way Pyxis and Enroot are configured #2820
Conversation
a0505f9
to
d319cb6
Compare
chmod 1777 /tmp/enroot/data | ||
|
||
chmod 1777 ${SHARED_DIR}/enroot | ||
directory node['cluster']['enroot']['persistent_dir'] do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to keep the mode as 1777? or 755?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aligned with your security concern. However, we should use 1777
by design because enroot writes in there as whatever user submitting a containerized job.
Example of files written by enroot in /var/enroot
after a containerized job is submitted as ec2-user:
[root@q1-st-cr1-1 ~]# tree /var/enroot
tree /var/enroot
/var/enroot
└── cache
└── group-1000
├── 6414378b647780fee8fd903ddb9541d134a1947ce092d08bdeb23a54cb3684ac
└── 97271d29cb7956f0908cfb1449610a2cd9cb46b004ac8af25f0255663eb364ba
2 directories, 2 files
[root@q1-st-cr1-1 ~]# ls -la /var/enroot
total 0
drwxrwxrwt 3 root root 19 Oct 15 15:26 .
drwxr-xr-x 22 root root 307 Oct 15 11:22 ..
drwx------ 3 ec2-user ec2-user 24 Oct 15 15:26 cache
[root@q1-st-cr1-1 ~]# ls -la /var/enroot/cache
total 0
drwx------ 3 ec2-user ec2-user 24 Oct 15 15:26 .
drwxrwxrwt 3 root root 19 Oct 15 15:26 ..
drwx------ 3 ec2-user ec2-user 170 Oct 15 15:26 group-1000
[root@q1-st-cr1-1 ~]# ls -la /var/enroot/cache/group-1000/
total 29848
drwx------ 3 ec2-user ec2-user 170 Oct 15 15:26 .
drwx------ 3 ec2-user ec2-user 24 Oct 15 15:26 ..
-rw-r----- 1 ec2-user ec2-user 30559455 Oct 15 15:26 6414378b647780fee8fd903ddb9541d134a1947ce092d08bdeb23a54cb3684ac
-rw-r----- 1 ec2-user ec2-user 806 Oct 15 15:26 97271d29cb7956f0908cfb1449610a2cd9cb46b004ac8af25f0255663eb364ba
drwx------ 2 ec2-user ec2-user 6 Oct 15 15:31 .tokens.1000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
683d613
to
7b77b65
Compare
cookbooks/aws-parallelcluster-slurm/recipes/install/install_pyxis.rb
Outdated
Show resolved
Hide resolved
cookbooks/aws-parallelcluster-slurm/recipes/install/install_pyxis.rb
Outdated
Show resolved
Hide resolved
cookbooks/aws-parallelcluster-platform/resources/enroot/partial/_enroot_common.rb
Outdated
Show resolved
Hide resolved
373cd69
to
db107cf
Compare
1. Pyxis is disabled by default. In particular, the Enroot, SPANK and Pyxis config files required to enable it are stored in `/opt/parallelcluster/examples` folder so that they are ineffective but can be used by the user to enable Pyxis by simply moving them to the expected location. 2. Moved Pyxis and Enroot configuration to build time (there was no reason to configure Pyxis and Enroot at runtime) 3. Skip Enroot installation if Enroot is already installed. 4. Skip Pyxis installation if Pyxis is already installed. 5. The sample configurations provided for Pyxis uses runtime path to `/run/pyxis`. As per [documentation](https://github.com/NVIDIA/pyxis/wiki/Setup#slurm-plugstack-configuration) a tmpfs should be used. 6. The sample configuration provided for Enroot uses the following paths, as suggested in [documentation](https://github.com/NVIDIA/pyxis/wiki/Setup#enroot-configuration-example) 1. Using tmpfs storage for `ENROOT_RUNTIME_PATH` and `ENROOT_DATA_PATH` 2. Using a persistent local storage for `ENROOT_CACHE_PATH` and `ENROOT_CONFIG_PATH`. 7. We do not create any directory used in the Pyxis or Enroot sample configuration. The user is supposed to create the desired directories. 8. *Minor*: Moved Pyxis attributes from platform cookbook to slurm cookbook because Pyxis is a SLURM plugin so it would be conceptually wrong to have its attributes defined in platform cookbook. 9. Added missing unit tests. Signed-off-by: Giacomo Marciani <[email protected]>
db107cf
to
1d98760
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
Description of changes
Fix the way Pyxis and Enroot are configured.
Pyxis is disabled by default. In particular, the Enroot, SPANK and Pyxis config files required to enable it are stored in
/opt/parallelcluster/examples
folder so that they are ineffective but can be used by the user to enable Pyxis by moving them to the expected location or using custom ones.Moved Pyxis and Enroot configuration to build time (there was no reason to configure Pyxis and Enroot at runtime)
Skip Enroot installation if Enroot is already installed.
Skip Pyxis installation if Pyxis is already installed.
The sample configuration (not the real ones) provided for Pyxis uses runtime path to
/run/pyxis
. As per documentation a tmpfs should be used.The sample configuration (not the real ones) provided for Enroot uses the following paths, as suggested in documentation
1. Using tmpfs storage for
ENROOT_RUNTIME_PATH
andENROOT_DATA_PATH
2. Using a persistent local storage for
ENROOT_CACHE_PATH
andENROOT_CONFIG_PATH
.We do not create any directory used in the Pyxis or Enroot sample configuration. The user is supposed to create the desired directories.
Minor: Moved Pyxis attributes from platform cookbook to slurm cookbook because Pyxis is a SLURM plugin so it would be conceptually wrong to have its attributes defined in platform cookbook.
Added missing unit tests.
Set the label skip-recursive-deletion-check because it detects a false positive in this PR: the recursions (recursive: true) in this PR are not on the delete but on the creation of directories.
User Experience
With this change, Pyxis is disabled by default.
Assuming that the user wants to use the configurations provided as examples, he needs to execute the below script in the head node as a OnNodeConfigured custom action:
and the following script on every compute node as OnNodeStart custom action:
Note: Pyxis and Enroot paths are defined as cookbook attributes, so if necessary such paths can be customized at build time injecting custom attributes:
Tests
Manually tested on AL2, verifying that Pyxis works as expected once the user applies the the steps required to enable it.
In particular, I've created the cluster with:
OnNodeConfigured
custom action on the head node with the script above.OnNodestart
custom action on the compute nodes with the script above.The standard job submission works from both login nodes and head node:
The containerized job submission works both from login nodes and head node:
References
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.