-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nvidia container toolkit daemonset pod fails with ErrImagePull #388
Comments
@Ankitasw do you know which tag the image sha was referring to? I know that we recently saw that the When you say GPU operator |
@elezar Sorry for typo, its v1.6.2. This is the operator validator we are using: operator:
defaultRuntime: containerd
validator:
repository: nvcr.io/nvidia/k8s
image: cuda-sample
version: vectoradd-cuda10.2
imagePullPolicy: IfNotPresent |
But i think the issue is coming in init container, as init container is using above mentioned sha. Controlled By: DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
driver-validation:
Container ID:
Image: nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
export SYS_LIBRARY_PATH=$(ldconfig -v 2>/dev/null | grep -v '^[[:space:]]' | cut -d':' -f1 | tr '[[:space:]]' ':'); export NVIDIA_LIBRARY_PATH=/run/nvidia/driver/usr/lib/x86_64-linux-gnu/:/run/nvidia/driver/usr/lib64; export LD_LIBRARY_PATH=${SYS_LIBRARY_PATH}:${NVIDIA_LIBRARY_PATH}; echo ${LD_LIBRARY_PATH}; export PATH=/run/nvidia/driver/usr/bin/:${PATH}; until nvidia-smi; do echo waiting for nvidia drivers to be loaded; sleep 5; done
State: Waiting
Reason: ImagePullBackOff
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/run/nvidia from nvidia-install-path (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5jf99 (ro)
|
From the source we see:
Which may have been removed. There does not seem to be a way to override this in |
I tried upgrading the operator to latest version, but the pods under namespace gpu-operator-resources were not coming up at all, so couldn't check for any errors as well since pods were not there. containers:
- name: gpu-operator
image: nvcr.io/nvidia/gpu-operator:v1.11.1
imagePullPolicy: IfNotPresent
command: ["gpu-operator"]
args:
- "--zap-time-encoding=epoch"
env:
- name: WATCH_NAMESPACE
value: ""
- name: OPERATOR_NAME
value: "gpu-operator"
- name: OPERATOR_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.name |
I will again try to upgrade the operator to use latest version and let you know. |
@elezar getting below error because of which pods are not coming up after using latest operator version. Could you please help resolving this issue?
Update: I think I missed runtimeClass: nvidia in the operator manifest. |
I will let @shivamerla respond here as he would have more information as to what could be the issue there. |
I got the resolution of it, but getting some other errors 😄 maybe newer version has more validations, so I am modifying manifests based on that. Will keep this issue open for sometime till I am successfully able to use recent operator version. |
Yes, since you are maintaining manifests for these, please pull in the recent ones from |
@shivamerla do you mean the cluster policy crd? |
basically all manifests you have here needs to be updated for latest version.(Roles, NFD manifests, CRD and CR from values.yaml). |
Below is the template from helm for reference for v1.11.1.
|
The cluster was deployed successfully with CUDA vector add test successful,
but above pod fails with below error:
@shivamerla could you please help resolve this? |
Also for future reference, where can I find this template for gpu operator resources? |
For the plugin-validator error, it needs an available GPU on the node, if you have other pods consuming GPUs that might happen. If you want to disable plugin validation you can set as below with validator component in ClusterPolicy.
For getting templates for each release you can run
|
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
We are using GPU operator v1.6.2 in one of our E2E tests in cluster-api-provider-aws, it was working 2 days back, but now it has started failing, as nvidia-container-toolkit-daemonset pod failed to come up with below error:
Is there anything changed recently which could be causing this issue?
2. Steps to reproduce the issue
Reference manifest used.
The text was updated successfully, but these errors were encountered: