nvidia-driver-installer failing to install cuda libraries on some pods #139

chrisroat · 2020-06-22T14:16:59Z

I have many pods running on the same cluster/node pool, which has the nvidia-driver-installer daemonset installed. A small fraction of them (~few percent) have workloads that fail due to missing libcuda.so.1. When I manually check, I find the /usr/local/nvidia directory is not present. See below, showing two pods -- one incorrectly installed, the other correctly installed.

Worst case, if this is not easily resolvable, is there some way to automatically detect and remove the pods/nodes that get setup incorrectly?

xxxx@cloudshell:~ (xxx)$ kubectl exec dask-worker-38dee33fc0e144d0bb7d34cdbefe3741-567h6 --namespace dask-gateway -c dask-worker -- ls /usr/local/nvidia            
ls: cannot access '/usr/local/nvidia': No such file or directory
command terminated with exit code 2
xxxx@cloudshell:~ (xxx)$ kubectl exec dask-worker-38dee33fc0e144d0bb7d34cdbefe3741-2p8kg --namespace dask-gateway -c dask-worker -- ls /usr/local/nvidia
NVIDIA-Linux-x86_64-418.67_77-12371-208-0.cos
bin
bin-workdir
drivers
drivers-workdir
lib64
lib64-workdir
nvidia-installer.log
share
vulkan

The text was updated successfully, but these errors were encountered:

allanlei · 2020-06-26T06:08:00Z

From my experience, this is due to the GPU not being "ready to use". Usually happens when

resources:
  limits:
    nvidia.com/gpu: 1

is not specified. The same problem occurs when trying to access GPU on docker without the --gpus all flag.

From what I remember, the driver daemonset sets cloud.google.com/gke-accelerator label after it has completed the installation which can potentially take a couple of minutes.

Setting

nodeSelector:
   cloud.google.com/gke-accelerator: XXX

resources:
  limits:
    nvidia.com/gpu: 1

should help.

chrisroat · 2020-06-26T15:23:12Z

Thanks for the suggestion. The specification I use is correct, and most pods are setup correctly and do run successfully. It's only an occasional problem.

The issue turns out to be with preemptible nodes, which can restart so quickly that the system does not correctly setup the GPU (via the deamonset, I think). Here is the note from Google support:

"Since this period was very short, it means that api-server and k8s-scheduler were not aware that the node was preempted in the first place (this is a known issue in GKE with preemptible VMs). Since after the preemption, the node started with the same name, the workloads that were scheduled on the node were simply restarted by the kubelet."

The workaround provided by Google is to add a node termination handler which shuts down the pods gracefully on node termination.
https://github.com/GoogleCloudPlatform/k8s-node-termination-handler

It seems to be working for my case.

chrisroat · 2021-07-01T01:59:54Z

This is still an intermittent issue for me.

I upgraded to gke 1.21 in hopes that the Graceful Shutdown, which is enabled in gke 1.21, would mean the end of the Node Termination Handler, as indicated in its README.

But I still get crashes because of the missing libcuda.so.1

chrisroat changed the title ~~nvidia-driver-installer failing to install on some workers~~ nvidia-driver-installer failing to install cuda libraries on some pods Jun 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-driver-installer failing to install cuda libraries on some pods #139

nvidia-driver-installer failing to install cuda libraries on some pods #139

chrisroat commented Jun 22, 2020 •

edited

Loading

allanlei commented Jun 26, 2020

chrisroat commented Jun 26, 2020

chrisroat commented Jul 1, 2021 •

edited

Loading

nvidia-driver-installer failing to install cuda libraries on some pods #139

nvidia-driver-installer failing to install cuda libraries on some pods #139

Comments

chrisroat commented Jun 22, 2020 • edited Loading

allanlei commented Jun 26, 2020

chrisroat commented Jun 26, 2020

chrisroat commented Jul 1, 2021 • edited Loading

chrisroat commented Jun 22, 2020 •

edited

Loading

chrisroat commented Jul 1, 2021 •

edited

Loading