Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node Auto-Provisioning failing for certain GPU nodes (T4) #402

Open
agam opened this issue Aug 19, 2024 · 7 comments
Open

Node Auto-Provisioning failing for certain GPU nodes (T4) #402

agam opened this issue Aug 19, 2024 · 7 comments

Comments

@agam
Copy link

agam commented Aug 19, 2024

How to re-create

A job that is marked as requiring nvidia.com/gpu, if results in a new node being spun up in GKE, will fail to be scheduled on that node.

Why is this bad

  • Using GPU nodes with Node-Auto-Provisioning in GKE is broken (at least for T4s, not sure which other GPU types are affected)
  • It feels strange that such a core "elasticity behavior" is unacknowledged -- hoping this issue gets attention and results in at least an ETA for the fix

Details on error

The provisioned node has a nvidia-device-plugin pod
This pod has a nvidia-driver-installer container which is an init container
This container is stuck on startup

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 100   720  100   720    0     0   113k      0 --:--:-- --:--:-- --:--:--  117k
GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available.

As a result, the kubelet never registers the nvidia.com/gpu resource, which means that the job (which triggered the node in the first place!) can't get its pods scheduled on it.

Prior context:

This is based off the following issue, which is no longer fixed (but which I cannot reopen)

#356

@agam
Copy link
Author

agam commented Aug 28, 2024

fwiw I'm seeing this on A100 nodes too now

@jian-mo
Copy link

jian-mo commented Sep 21, 2024

I'm seeing this on T4 as well

@itssubhodiproy
Copy link

itssubhodiproy commented Sep 29, 2024

for l4 gpus, it is also not working. I'm using COS image and didn't disabled automatic installation

failed to get driver version with error: Failed to download and read GPU driver versions proto with error: failed to download gpu_driver_versions.bin from GCS bucket with error: failed to download gpu_driver_versions.bin artifact from bucket: cos-tools, object: 18244.151.14/lakitu/gpu_driver_versions.bin to /root/home/kubernetes/bin/nvidia/gpu_driver_versions.bin with error: failed to create the reader from GCS client: googleapi: got HTTP response code 403 with body: <?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message></Error>

@AssenDimitrov
Copy link

Seems like this is still a problem.... T4 in London:

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 759 100 759 0 0 328k 0 --:--:-- --:--:-- --:--:-- 370k
GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available

stuck like this forever....

@MeCode4Food
Copy link

MeCode4Food commented Dec 19, 2024

I'm seeing this for T4 GPU nodes on my end as well in asia-southeast.

GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available

@MeCode4Food
Copy link

MeCode4Food commented Dec 19, 2024

> curl -H "Metadata-Flavor:Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/kube-labels
# prettified for clarity
cloud.google.com/gke-accelerator=nvidia-tesla-t4,
cloud.google.com/gke-boot-disk=pd-balanced,
cloud.google.com/gke-container-runtime=containerd,
cloud.google.com/gke-cpu-scaling-level=2,
cloud.google.com/gke-logging-variant=DEFAULT,
cloud.google.com/gke-max-pods-per-node=110,
cloud.google.com/gke-memory-gb-scaling-level=7,
cloud.google.com/gke-nodepool=nap-n1-standard-2-gpu1-xxxxxxxx,
cloud.google.com/gke-os-distribution=cos,
cloud.google.com/gke-provisioning=standard,
cloud.google.com/gke-stack-type=IPV4,
cloud.google.com/machine-family=n1,
cloud.google.com/private-node=false

From the following init container's command:

...
  initContainers:
  - command:
    - bash
    - -c
    - |
      LABELS=$( curl --retry 5 -H "Metadata-Flavor:Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/kube-labels || exit 1 )
      IFS=,; for label in $LABELS; do
        IFS==; read -r LABEL VALUE <<< "$label"
        if [[ "${LABEL}" == "cloud.google.com/gke-gpu-driver-version" ]]; then
          GPU_DRIVER_VERSION=$VALUE
        fi
      done
      if [[ "${GPU_DRIVER_VERSION}" == "latest" ]]; then
        echo "latest" > /etc/nvidia/gpu_driver_version_config.txt
        /cos-gpu-installer install --version=latest || exit 1
      elif [[ "${GPU_DRIVER_VERSION}" == "default" ]]; then
        echo "default" > /etc/nvidia/gpu_driver_version_config.txt
        /cos-gpu-installer install || exit 1
      else
        echo "disabled" > /etc/nvidia/gpu_driver_version_config.txt
        echo "GPU driver auto installation is disabled."
      fi
      echo "Waiting for GPU driver libraries to be available."
      while ! [[ -f /usr/local/nvidia/lib64/libcuda.so ]]; do
        sleep 5
      done
      echo "GPU driver is installed."
      echo "InitContainer succeeded. Start nvidia-gpu-device-plugin container."
      exit 0

Seems like the key cloud.google.com/gke-gpu-driver-version is missing from the metadata response

@AssenDimitrov
Copy link

Workaround for me was to use Ubuntu OS and manually install the driver with applying their example: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants