Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: vllm bundle has invalid kubernetes spec and won't properly deploy on internal infrastructure #965

Closed
joelmccoy opened this issue Aug 29, 2024 · 5 comments · Fixed by #937 or #974
Assignees
Labels
possible-bug 🐛 Something may not be working

Comments

@joelmccoy
Copy link

Environment

  1. OS and Architecture: RHEL8 Kubernetes on rke2
  2. App or Package Name: vllm
  3. App or Package Version: 0.11.0
  4. Kubernetes Distribution: rke2
  5. Kubernetes Version: rke2 1.28.12
  6. Other: This bug is showing up in our internal Defense Unicorns Infrastructure

Steps to reproduce

  1. Deploy vllm bundle 0.11.0

Expected result

vllm zarf package should deploy successfully

Actual Result

zarf deployment hangs and is unable to create the data injection target

Visual Proof (screenshots, videos, text, etc)

image

W0829 00:07:19.536249     407 warnings.go:70] unknown field "spec.template.spec.containers[0].securityContext.fsGroup"
W0829 00:07:19.536261     407 warnings.go:70] unknown field "spec.template.spec.initContainers[0].securityContext.fsGroup"
  •  Processing helm chart vllm-model:0.11.0 from Zarf-generated helm chart
 NOTE  Using config file
mkdir: can't create directory '/data/.model': Permission denied
command terminated with exit code 1
 WARNING  Unable to create the data injection target directory /data/.model in pod
 vllm-model-5c99c4fc68-xpql5
 NOTE  Using config file
mkdir: can't create directory '/data/.model': Permission denied
command terminated with exit code 1
 WARNING  Unable to create the data injection target directory /data/.model in pod
 vllm-model-5c99c4fc68-xpql5

Additional Context

There is actually an invalid kubernetes manifest spec that may be causing this. Defining fsGroup under initContainers or containers is not valid, and therefore that may be the issue creating this directory. Instead, fsGroup should be defined at the spec.template.spec.securityContext level, not within the containers/initContainers array.
Improper definitions can be found here, here and here.

@joelmccoy joelmccoy added the possible-bug 🐛 Something may not be working label Aug 29, 2024
@justinthelaw
Copy link
Contributor

justinthelaw commented Aug 29, 2024

PR #937 is a WIP that will solve the manifest warning issues.

The data injection issue is a known issue that only occurs in environments that don't use local-path (e.g., variants like Rancher's LPP) as the StorageClass, and is not directly related to the manifest warnings.

A recommended fix, separate from PR #937, to the actual issue you are encountering is to mount the volume to a directory that has had its permissions changed in the Docker manifest OR to provide root to the data injection's job container. These possible fixes are also WIP, I think? cc: @YrrepNoj @CollectiveUnicorn

@joelmccoy
Copy link
Author

Thanks for the clarification!

It sounds like there are no immediate hot fixes we could apply for 0.11.0? And if so, would a rollback to 0.9.2 to be suggested? It looks like the zarf data injection was introduced in 0.10.0.

@justinthelaw
Copy link
Contributor

justinthelaw commented Aug 29, 2024

For now, please rollback to 0.9.2. The team needs to discuss at stand-up later today on the solution, as the best place to test the solution is on our staging environment that uses EFS (which I don't currently have access to). I've also been out for a little bit, so they may already have thought about or completed the fix!

The ultimate fix may come in the form of another minor version, due to the nature of this breaking change and also some other changes we may wrap in.

@joelmccoy
Copy link
Author

as the best place to test the solution is on our staging environment that uses EFS (which I don't currently have access to).

I'd also be happy to get you and anyone else setup in this environment

@joelmccoy
Copy link
Author

related: zarf-dev/zarf#2263

I'm not sure there is a known work around for using zarf data injection as a nonroot user :/

@justinthelaw justinthelaw changed the title vllm bundle has invalid kubernetes spec and won't properly deploy on internal infrastructure bug: vllm bundle has invalid kubernetes spec and won't properly deploy on internal infrastructure Sep 4, 2024
@justinthelaw justinthelaw self-assigned this Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment