NVIDIA Container Toolkit Daemonset Fails with ImagePullBackOff #3782

kacole2 · 2022-10-18T21:42:18Z

/kind bug

What steps did you take and what happened:
I am attempting to install the NVIDIA GPU Components following kubectl apply -f clusterpolicy-crd.yaml and kubectl apply -f gpu-operator-components.yaml. The nvidia-container-toolkit-daemonset pod is failing with Init:ImagePullBackOff. That will then make the nvidia-driver-daemonset fail with CrashLoopBackOff. Does this have anything to do with NVIDIA/gpu-operator#388?

What did you expect to happen:
The container images should be pulling

Anything else you would like to add:
Clean environment using Tanzu Kubernetes Grid.

kendrickc@kendrickc-a01 clusterconfigs % kubectl get pods -A
NAMESPACE                NAME                                                                 READY   STATUS                  RESTARTS       AGE
default                  gpu-operator-ff8587768-j8gp4                                         1/1     Running                 0              8m2s
default                  gpu-operator-node-feature-discovery-master-bd7745d8d-xdl6m           1/1     Running                 0              8m2s
default                  gpu-operator-node-feature-discovery-worker-8c7dv                     1/1     Running                 0              8m3s
default                  gpu-operator-node-feature-discovery-worker-hnh2t                     1/1     Running                 0              8m3s
default                  gpu-operator-node-feature-discovery-worker-j5lcz                     1/1     Running                 0              8m3s
gpu-operator-resources   nvidia-container-toolkit-daemonset-9wrhl                             0/1     Init:ImagePullBackOff   0              7m13s
gpu-operator-resources   nvidia-container-toolkit-daemonset-ft5dw                             0/1     Init:ImagePullBackOff   0              7m13s
gpu-operator-resources   nvidia-driver-daemonset-mc2v7                                        0/1     CrashLoopBackOff        5 (2m3s ago)   7m28s
gpu-operator-resources   nvidia-driver-daemonset-swppw                                        0/1     CrashLoopBackOff        5 (2m1s ago)   7m28s
kube-system              antrea-agent-2f8ld                                                   2/2     Running                 0              76m
kube-system              antrea-agent-7fjlb                                                   2/2     Running                 0              84m
kube-system              antrea-agent-sfmmj                                                   2/2     Running                 0              84m
kube-system              antrea-controller-f9b4b78c-mhvqp                                     1/1     Running                 0              84m
kube-system              coredns-6544689bdd-mbsn5                                             1/1     Running                 0              86m
kube-system              coredns-6544689bdd-r25nk                                             1/1     Running                 0              86m
kube-system              ebs-csi-controller-5454b74dd9-czpbb                                  6/6     Running                 0              84m
kube-system              ebs-csi-node-7v76t                                                   3/3     Running                 0              76m
kube-system              ebs-csi-node-wfphw                                                   3/3     Running                 0              84m
kube-system              etcd-ip-10-187-27-68.us-east-2.compute.internal                      1/1     Running                 0              86m
kube-system              kube-apiserver-ip-10-187-27-68.us-east-2.compute.internal            1/1     Running                 0              86m
kube-system              kube-controller-manager-ip-10-187-27-68.us-east-2.compute.internal   1/1     Running                 0              86m
kube-system              kube-proxy-h2cgv                                                     1/1     Running                 0              76m
kube-system              kube-proxy-ptsxg                                                     1/1     Running                 0              86m
kube-system              kube-proxy-xj8db                                                     1/1     Running                 0              85m
kube-system              kube-scheduler-ip-10-187-27-68.us-east-2.compute.internal            1/1     Running                 0              86m
kube-system              metrics-server-7f4d846c4d-k87jb                                      1/1     Running                 0              84m
kube-system              snapshot-controller-57f9bf5d55-dvplw                                 1/1     Running                 0              84m
tanzu-system             secretgen-controller-69ff6c4b75-c2gqf                                1/1     Running                 0              84m
tkg-system               kapp-controller-7f69c8b6b-dzzgs                                      2/2     Running                 0              86m
tkg-system               tanzu-capabilities-controller-manager-954bb6b47-ddcjr                1/1     Running                 0              86m

Environment:
Machine type: g4dn.8xlarge

Cluster-api-provider-aws version: Core Cluster API (v1.1.5), Cluster API Provider AWS (v1.2.0)
Kubernetes version: (use kubectl version): Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.3", GitCommit:"434bfd82814af038ad94d62ebe59b133fcb50506", GitTreeState:"clean", BuildDate:"2022-10-12T10:47:25Z", GoVersion:"go1.19.2", Compiler:"gc", Platform:"darwin/amd64"}
OS (e.g. from /etc/os-release): Ubuntu 20.04

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-10-18T21:42:27Z

@kacole2: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kacole2 · 2022-10-19T01:05:25Z

In the gpu-operator-components.yaml file I changed the registry to docker and it pulled and executed. Not sure why it's not pulling from nvidia's repo.

  toolkit:
    repository: docker.io/nvidia
    image: container-toolkit
    version: 1.4.7-ubuntu18.04

Ankitasw · 2022-10-20T15:47:13Z

We have an E2E test running with the current changes for GPU instances in CAPA clusters, if there was an issue, that test would fail.

Ankitasw · 2022-11-16T11:04:25Z

/triage needs-information

Ankitasw · 2023-01-02T09:21:57Z

We use nvidia registry in our tests which works fine

 toolkit:
    enabled: true
    repository: nvcr.io/nvidia/k8s
    image: container-toolkit
    version: v1.11.0
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
      seLinuxOptions:
        level: s0

Maybe it is the env problem ??

dlipovetsky · 2023-03-06T17:39:57Z

CAPA has e2e tests that cover GPU functionality, and there is no evidence this issue was related to CAPA.

/triage unresolved

dlipovetsky · 2023-03-06T17:40:15Z

/close

k8s-ci-robot · 2023-03-06T17:40:21Z

@dlipovetsky: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 18, 2022

k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Nov 16, 2022

k8s-ci-robot added the triage/unresolved Indicates an issue that can not or will not be resolved. label Mar 6, 2023

k8s-ci-robot closed this as completed Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA Container Toolkit Daemonset Fails with ImagePullBackOff #3782

NVIDIA Container Toolkit Daemonset Fails with ImagePullBackOff #3782

kacole2 commented Oct 18, 2022

k8s-ci-robot commented Oct 18, 2022

kacole2 commented Oct 19, 2022

Ankitasw commented Oct 20, 2022

Ankitasw commented Nov 16, 2022

Ankitasw commented Jan 2, 2023

dlipovetsky commented Mar 6, 2023 •

edited

Loading

dlipovetsky commented Mar 6, 2023

k8s-ci-robot commented Mar 6, 2023

NVIDIA Container Toolkit Daemonset Fails with ImagePullBackOff #3782

NVIDIA Container Toolkit Daemonset Fails with ImagePullBackOff #3782

Comments

kacole2 commented Oct 18, 2022

k8s-ci-robot commented Oct 18, 2022

kacole2 commented Oct 19, 2022

Ankitasw commented Oct 20, 2022

Ankitasw commented Nov 16, 2022

Ankitasw commented Jan 2, 2023

dlipovetsky commented Mar 6, 2023 • edited Loading

dlipovetsky commented Mar 6, 2023

k8s-ci-robot commented Mar 6, 2023

dlipovetsky commented Mar 6, 2023 •

edited

Loading