Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-device-plugin failed to run on GPU nodes created by Node Auto-Provisioning #407

Open
hongchaodeng opened this issue Sep 24, 2024 · 1 comment

Comments

@hongchaodeng
Copy link

How to reproduce

  • Create a GKE cluster in Standard mode
  • Enable Node Auto-Provisioning with L4 GPU capacity
  • Try to create a pod with nvidia.com/gpu resource request
  • The pod will be stuck in PodInitializing state

Analysis

Using GPU nodes with Node-Auto-Provisioning in GKE will try to run nvidia-gpu-device-plugin job on GPU nodes. It's only after this job finishes successfully that the node will have allocatable nvidia.com/gpu resources. However, this job is stuck, and no pod with GPU requests can be scheduled onto it.

$ kubectl -n kube-system get pod

kube-system       nvidia-gpu-device-plugin-small-cos-4fpl4                         0/2     Init:0/2   0          16m

$ kubectl -n kube-system logs nvidia-gpu-device-plugin-small-cos-hzvds -c nvidia-driver-installer

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   598  100   598    0     0   273k      0 --:--:-- --:--:-- --:--:--  291k
GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available.
@MeCode4Food
Copy link

MeCode4Food commented Dec 26, 2024

#402 might be related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants