-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvidia-driver-installer crash loop during GKE scale ups #132
Comments
I can't repro this with 1.15 on GKE.
Then:
Then I scaled instances from 1 to 2, and saw that the driver is running fine.
Are you still seeing this? If so, can you please provide repro steps? |
We are unable to locate the exact root cause since the repro steps are missing here. |
This can be reproduced under Ubuntu but not in COS. Just rolled new cluster with Ubuntu and got this error. |
Can you provide the GKE version and the OS where you reproduced this error? |
I used rapid channel Kubernetes 1.19. Ubuntu version I am not sure but I just selected it from drop down in create cluster wizard. GPU I used was Nvidia T4
… On 25-Mar-2021, at 2:25 AM, Ruiwen Zhao ***@***.***> wrote:
This can be reproduced under Ubuntu but not in COS. Just rolled new cluster with Ubuntu and got this error.
Can you provide the GKE version and the OS where you reproduced this error?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I encountered the same issue on:
What's weird is that /root/home doesn't even exists on the node so I've no idea why the link is failing at being created by the pod. I tried to update to the latest version of the daemonset and it didn't help. |
Getting the same issue here on:
I can't get any logs out of the pod but describing it shows the error:
|
We've been using the nvidia-driver-installer on Ubuntu node groups via GKE v1.15 per the official How-to GPU instructions specified here.
The daemonset deployed via
daemonset-preloaded.yaml
appeared to work correctly for some time, however we started noticing issues last Friday when new nodes were added to the node group via cluster autoscaling. The nvidia-driver-installer daemonset pods that were scheduled to these new nodes began to crash loop, as their initContainers were exiting with non-zero exit codes.Upon examining pod logs, it appears that the failed pods contain the following lines as their last output before exiting.
See here for full log output from one of the failed pods.
I've logged into one of the nodes and manually removed the
/root/home/kubernetes/bin/nvidia
folder (which is presumably created by the very first instance of the nvidia-driver-installer pod scheduled to a node when it comes up) but the folder re-appears and the daemonset pods continue to crash in loop. Nodes that have daemonset pods in this state don't have the drivers correctly installed, and jobs that require them fail to import CUDA due to driver issues.We've been experiencing this issue for 4 days now with nodes that receive live production traffic. Not every node that scales up experiences this problem, but most do. If a node comes up and its nvidia-driver-installer pod begins to crash, we've had no luck bringing it out of that state. Instead we've manually marked the node as unschedulable and brought it down, hoping the next to come up won't experience the same problem.
From our perspective, nothing has changed with our cluster configuration, node group configuration, or K8s manifests that would cause this issue to start occurring. We did experience something similar for a few hours in mid December, but the issue resolved itself within a few hours and we didn't think much of it. I'm happy to provide more logs or detailed information about the errors upon request!
Any thoughts about what could be causing this?
The text was updated successfully, but these errors were encountered: