Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for AWS ECR issue not released? #189

Open
gallanik opened this issue Apr 17, 2024 · 6 comments
Open

Fix for AWS ECR issue not released? #189

gallanik opened this issue Apr 17, 2024 · 6 comments

Comments

@gallanik
Copy link

gallanik commented Apr 17, 2024

I have successfully integrated (proof-of-concept) enroot with AWS ECR after installing enroot from source (referring to #159).
Just wanted to check when can this is officially released? We have a production deployment of DGX SuperPOD and would like to install a stable release rather than from source. There hasn't been a release since more than an year now (Feb 8, 2023).

@RHudsonH
Copy link

I'm struggling with this as well. Any news? Looks like it's been more than a year since the fix was in main.

@zyndagj
Copy link

zyndagj commented Apr 19, 2024

Assuming the main branch passes all legacy tests, it would be great if a release could be tagged with this fix.

@3XX0
Copy link
Member

3XX0 commented Apr 19, 2024

The fix is not correct, see discussion #159 (comment)
We need a proper fix and make sure it is validated

@3XX0
Copy link
Member

3XX0 commented Apr 19, 2024

Just committed 6425a53 which hopefully does the trick

@gallanik
Copy link
Author

I will test this, but the earlier fix also worked without any issues. Any plans on validating and releasing this soon?

@astrophys
Copy link

astrophys commented Sep 3, 2024

I tested the commit 6425a53 as a hot fix on a BCM-10 cluster, and still got the error :

user@head01:~$ srun --container-image=docker://123456789.dkr.ecr.us-west-2.amazonaws.com#hello-world --nodelist=node02 --pty bash
pyxis: importing docker image: docker://123456789.dkr.ecr.us-west-2.amazonaws.com#hello-world
slurmstepd: error: pyxis: child 737010 failed with error code: 1
slurmstepd: error: pyxis: failed to import docker image
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     [INFO] Querying registry for permission grant
slurmstepd: error: pyxis:     [INFO] Authenticating with user: AWS
slurmstepd: error: pyxis:     [INFO] Using credentials from file: /home/user/.config/enroot/.credentials
slurmstepd: error: pyxis:     [INFO] Fetching image manifest list
slurmstepd: error: pyxis:     [ERROR] Could not process JSON input
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: node02: task 0: Exited with exit code 1

Based on conversations I've had with others, it seems like ECR has some non-standard registry configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants