Node Auto-Provisioning failing for certain GPU nodes (T4) #402

agam · 2024-08-19T20:59:39Z

How to re-create

A job that is marked as requiring nvidia.com/gpu, if results in a new node being spun up in GKE, will fail to be scheduled on that node.

Why is this bad

Using GPU nodes with Node-Auto-Provisioning in GKE is broken (at least for T4s, not sure which other GPU types are affected)
It feels strange that such a core "elasticity behavior" is unacknowledged -- hoping this issue gets attention and results in at least an ETA for the fix

Details on error

The provisioned node has a nvidia-device-plugin pod
This pod has a nvidia-driver-installer container which is an init container
This container is stuck on startup

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 100   720  100   720    0     0   113k      0 --:--:-- --:--:-- --:--:--  117k
GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available.

As a result, the kubelet never registers the nvidia.com/gpu resource, which means that the job (which triggered the node in the first place!) can't get its pods scheduled on it.

Prior context:

This is based off the following issue, which is no longer fixed (but which I cannot reopen)

#356

The text was updated successfully, but these errors were encountered:

agam · 2024-08-28T17:54:05Z

fwiw I'm seeing this on A100 nodes too now

jian-mo · 2024-09-21T00:34:26Z

I'm seeing this on T4 as well

itssubhodiproy · 2024-09-29T03:06:10Z

for l4 gpus, it is also not working. I'm using COS image and didn't disabled automatic installation

failed to get driver version with error: Failed to download and read GPU driver versions proto with error: failed to download gpu_driver_versions.bin from GCS bucket with error: failed to download gpu_driver_versions.bin artifact from bucket: cos-tools, object: 18244.151.14/lakitu/gpu_driver_versions.bin to /root/home/kubernetes/bin/nvidia/gpu_driver_versions.bin with error: failed to create the reader from GCS client: googleapi: got HTTP response code 403 with body: <?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message></Error>

AssenDimitrov · 2024-12-11T11:38:11Z

Seems like this is still a problem.... T4 in London:

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 759 100 759 0 0 328k 0 --:--:-- --:--:-- --:--:-- 370k
GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available

stuck like this forever....

MeCode4Food · 2024-12-19T09:33:08Z

I'm seeing this for T4 GPU nodes on my end as well in asia-southeast.

GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available

MeCode4Food · 2024-12-19T11:03:27Z

> curl -H "Metadata-Flavor:Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/kube-labels
# prettified for clarity
cloud.google.com/gke-accelerator=nvidia-tesla-t4,
cloud.google.com/gke-boot-disk=pd-balanced,
cloud.google.com/gke-container-runtime=containerd,
cloud.google.com/gke-cpu-scaling-level=2,
cloud.google.com/gke-logging-variant=DEFAULT,
cloud.google.com/gke-max-pods-per-node=110,
cloud.google.com/gke-memory-gb-scaling-level=7,
cloud.google.com/gke-nodepool=nap-n1-standard-2-gpu1-xxxxxxxx,
cloud.google.com/gke-os-distribution=cos,
cloud.google.com/gke-provisioning=standard,
cloud.google.com/gke-stack-type=IPV4,
cloud.google.com/machine-family=n1,
cloud.google.com/private-node=false

From the following init container's command:

...
  initContainers:
  - command:
    - bash
    - -c
    - |
      LABELS=$( curl --retry 5 -H "Metadata-Flavor:Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/kube-labels || exit 1 )
      IFS=,; for label in $LABELS; do
        IFS==; read -r LABEL VALUE <<< "$label"
        if [[ "${LABEL}" == "cloud.google.com/gke-gpu-driver-version" ]]; then
          GPU_DRIVER_VERSION=$VALUE
        fi
      done
      if [[ "${GPU_DRIVER_VERSION}" == "latest" ]]; then
        echo "latest" > /etc/nvidia/gpu_driver_version_config.txt
        /cos-gpu-installer install --version=latest || exit 1
      elif [[ "${GPU_DRIVER_VERSION}" == "default" ]]; then
        echo "default" > /etc/nvidia/gpu_driver_version_config.txt
        /cos-gpu-installer install || exit 1
      else
        echo "disabled" > /etc/nvidia/gpu_driver_version_config.txt
        echo "GPU driver auto installation is disabled."
      fi
      echo "Waiting for GPU driver libraries to be available."
      while ! [[ -f /usr/local/nvidia/lib64/libcuda.so ]]; do
        sleep 5
      done
      echo "GPU driver is installed."
      echo "InitContainer succeeded. Start nvidia-gpu-device-plugin container."
      exit 0

Seems like the key cloud.google.com/gke-gpu-driver-version is missing from the metadata response

AssenDimitrov · 2024-12-19T12:49:40Z

Workaround for me was to use Ubuntu OS and manually install the driver with applying their example: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers

MeCode4Food mentioned this issue Dec 26, 2024

nvidia-device-plugin failed to run on GPU nodes created by Node Auto-Provisioning #407

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node Auto-Provisioning failing for certain GPU nodes (T4) #402

Node Auto-Provisioning failing for certain GPU nodes (T4) #402

agam commented Aug 19, 2024

agam commented Aug 28, 2024

jian-mo commented Sep 21, 2024

itssubhodiproy commented Sep 29, 2024 •

edited

Loading

AssenDimitrov commented Dec 11, 2024

MeCode4Food commented Dec 19, 2024 •

edited

Loading

MeCode4Food commented Dec 19, 2024 •

edited

Loading

AssenDimitrov commented Dec 19, 2024

Node Auto-Provisioning failing for certain GPU nodes (T4) #402

Node Auto-Provisioning failing for certain GPU nodes (T4) #402

Comments

agam commented Aug 19, 2024

How to re-create

Why is this bad

Details on error

Prior context:

agam commented Aug 28, 2024

jian-mo commented Sep 21, 2024

itssubhodiproy commented Sep 29, 2024 • edited Loading

AssenDimitrov commented Dec 11, 2024

MeCode4Food commented Dec 19, 2024 • edited Loading

MeCode4Food commented Dec 19, 2024 • edited Loading

AssenDimitrov commented Dec 19, 2024

itssubhodiproy commented Sep 29, 2024 •

edited

Loading

MeCode4Food commented Dec 19, 2024 •

edited

Loading

MeCode4Food commented Dec 19, 2024 •

edited

Loading