Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Dockerfile.rocm #141

Merged
merged 3 commits into from
Aug 16, 2024
Merged

[Bugfix] Dockerfile.rocm #141

merged 3 commits into from
Aug 16, 2024

Conversation

zstreet87
Copy link

@zstreet87 zstreet87 commented Aug 15, 2024

Need to have different names for ARG and ENV for dockerfile to function properly

@gshtras
Copy link
Collaborator

gshtras commented Aug 15, 2024

@zstreet87 could you describe what the issue is?
Also, I think it's better to use ARG_PYTORCH_ROCM_ARCH instead of PYTORCH_ROCM_ARCH_TMP

@zstreet87
Copy link
Author

as the dockerfile is now that env var doesn't get overwritten properly so it's value is "gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx940;gfx941;gfx942"

Dockerfile doesn't work when using the same variable name for various things.

@gshtras
Copy link
Collaborator

gshtras commented Aug 15, 2024

Do you have a container built from this Dockerfile, or another scenario to reproduce?
From our experience this does not happen. Also, a simple test shows the same:

$ cat Dockerfile.test
FROM ubuntu:20.04
ARG MY_VAR="abc"
ENV MY_VAR=${MY_VAR}
CMD ["/bin/bash"]

$ docker build -f Dockerfile.test --build-arg MY_VAR=test123 -t test_docker .
[+] Building 0.1s (5/5) FINISHED                                                                                                                                                  docker:default
 => [internal] load build definition from Dockerfile.test                                                                                                                                   0.1s
 => => transferring dockerfile: 116B                                                                                                                                                        0.0s
 => [internal] load metadata for docker.io/library/ubuntu:20.04                                                                                                                             0.0s
 => [internal] load .dockerignore                                                                                                                                                           0.0s
 => => transferring context: 2B                                                                                                                                                             0.0s
 => CACHED [1/1] FROM docker.io/library/ubuntu:20.04                                                                                                                                        0.0s
 => exporting to image                                                                                                                                                                      0.0s
 => => exporting layers                                                                                                                                                                     0.0s
 => => writing image sha256:6ec0cca46e5cd9611df4144b598b9e56313dc6894aefd4e3d2691a5156e766e0                                                                                                0.0s
 => => naming to docker.io/library/test_docker                                                                                                                                              0.0s

$ docker run -it --rm test_docker
root@7e7081e1597a:/# echo $MY_VAR
test123

$ docker build -f Dockerfile.test -t test_docker .
[+] Building 0.1s (5/5) FINISHED                                                                                                                                                  docker:default
 => [internal] load build definition from Dockerfile.test                                                                                                                                   0.0s
 => => transferring dockerfile: 116B                                                                                                                                                        0.0s
 => [internal] load metadata for docker.io/library/ubuntu:20.04                                                                                                                             0.0s
 => [internal] load .dockerignore                                                                                                                                                           0.0s
 => => transferring context: 2B                                                                                                                                                             0.0s
 => CACHED [1/1] FROM docker.io/library/ubuntu:20.04                                                                                                                                        0.0s
 => exporting to image                                                                                                                                                                      0.0s
 => => exporting layers                                                                                                                                                                     0.0s
 => => writing image sha256:7fb58b8c36eef9150bd1bcb9df1ffbd91b138f90f12c4ecfbf779668226da1b0                                                                                                0.0s
 => => naming to docker.io/library/test_docker                                                                                                                                              0.0s

$ docker run -it --rm test_docker
root@3360a90a3bc4:/# echo $MY_VAR
abc

@zstreet87
Copy link
Author

zstreet87 commented Aug 15, 2024

Unfortunately, I don't - this is coming from a client who had the error and then fixed it by changing the name as I have done.

Hmm, interesting - the one difference I can think of with your test case is in the dockerfile, that env var is already set from the base docker.

@gshtras
Copy link
Collaborator

gshtras commented Aug 15, 2024

Just tried using TERM (one of the few vars already present in ubuntu:20.04), and the results are the same
I could find an issue similar to what you're describing: moby/moby#34494
But it seems to apply to a rather old docker version. Which one do you use?

@zstreet87
Copy link
Author

Okay the plot thickens...

This error was found on the AAC cluster which uses podman so the command was
podman build -f Dockerfile.rocm -t vllm_test .

Perhaps podman still has this issue?

@Alexei-V-Ivanov-AMD
Copy link

IMHO supporting podman is good feature potentially. My reason is that podman's containers are exclusively rootless, which adds stability & security (at the very least to the CI pipeline).

That is some extra motivation to explore this issue and find a resolution to it.

@zstreet87
Copy link
Author

Oh yes, I agree. As more people use vllm in cluster settings, odds are more in favor they will use podman instead of docker for the permission issue @Alexei-V-Ivanov-AMD brought up.

Very quirky thing to stumble upon though, for sure. I asked the client to test the minimal reproducer with podman and well update with what they have.

@gshtras
Copy link
Collaborator

gshtras commented Aug 15, 2024

I'm ok with this change if podman has this issue still and this solves it

@zstreet87
Copy link
Author

appreciate it @gshtras! Please merge

@gshtras gshtras merged commit c1860d6 into main Aug 16, 2024
13 checks passed
@gshtras gshtras deleted the dockerfile_fix branch August 16, 2024 17:53
@mawong-amd
Copy link

mawong-amd commented Aug 27, 2024

@zstreet87 @Alexei-V-Ivanov-AMD On further reflection I'm currently of the opinion that this change should be reverted.

First, Docker explicitly supports aliasing ARG with ENV: see the official example here (also included below). This is not a bug.

FROM ubuntu
ARG CONT_IMG_VER
ENV CONT_IMG_VER=${CONT_IMG_VER:-v1.0.0}
RUN echo $CONT_IMG_VER

If Podman fails to implement the Docker spec correctly, this is not an issue we should fix in vLLM. Particularly because developers are used to using PYTORCH_ROCM_ARCH as the build argument. And also because the use of Podman is not officially supported by vLLM. So this PR provides speculative benefits at best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants