`nvidia-device-plugin` failed to run on GPU nodes created by Node Auto-Provisioning #407

hongchaodeng · 2024-09-24T23:21:52Z

How to reproduce

Create a GKE cluster in Standard mode
Enable Node Auto-Provisioning with L4 GPU capacity
Try to create a pod with nvidia.com/gpu resource request
The pod will be stuck in PodInitializing state

Analysis

Using GPU nodes with Node-Auto-Provisioning in GKE will try to run nvidia-gpu-device-plugin job on GPU nodes. It's only after this job finishes successfully that the node will have allocatable nvidia.com/gpu resources. However, this job is stuck, and no pod with GPU requests can be scheduled onto it.

$ kubectl -n kube-system get pod

kube-system       nvidia-gpu-device-plugin-small-cos-4fpl4                         0/2     Init:0/2   0          16m

$ kubectl -n kube-system logs nvidia-gpu-device-plugin-small-cos-hzvds -c nvidia-driver-installer

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   598  100   598    0     0   273k      0 --:--:-- --:--:-- --:--:--  291k
GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available.

The text was updated successfully, but these errors were encountered:

MeCode4Food · 2024-12-26T02:20:01Z

#402 might be related

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`nvidia-device-plugin` failed to run on GPU nodes created by Node Auto-Provisioning #407

`nvidia-device-plugin` failed to run on GPU nodes created by Node Auto-Provisioning #407

hongchaodeng commented Sep 24, 2024

MeCode4Food commented Dec 26, 2024 •

edited

Loading

nvidia-device-plugin failed to run on GPU nodes created by Node Auto-Provisioning #407

nvidia-device-plugin failed to run on GPU nodes created by Node Auto-Provisioning #407

Comments

hongchaodeng commented Sep 24, 2024

How to reproduce

Analysis

MeCode4Food commented Dec 26, 2024 • edited Loading

`nvidia-device-plugin` failed to run on GPU nodes created by Node Auto-Provisioning #407

`nvidia-device-plugin` failed to run on GPU nodes created by Node Auto-Provisioning #407

MeCode4Food commented Dec 26, 2024 •

edited

Loading