Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA Container Toolkit Daemonset Fails with ImagePullBackOff #3782

Closed
kacole2 opened this issue Oct 18, 2022 · 8 comments
Closed

NVIDIA Container Toolkit Daemonset Fails with ImagePullBackOff #3782

kacole2 opened this issue Oct 18, 2022 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it. triage/unresolved Indicates an issue that can not or will not be resolved.

Comments

@kacole2
Copy link

kacole2 commented Oct 18, 2022

/kind bug

What steps did you take and what happened:
I am attempting to install the NVIDIA GPU Components following kubectl apply -f clusterpolicy-crd.yaml and kubectl apply -f gpu-operator-components.yaml. The nvidia-container-toolkit-daemonset pod is failing with Init:ImagePullBackOff. That will then make the nvidia-driver-daemonset fail with CrashLoopBackOff. Does this have anything to do with NVIDIA/gpu-operator#388?

What did you expect to happen:
The container images should be pulling

Anything else you would like to add:
Clean environment using Tanzu Kubernetes Grid.

kendrickc@kendrickc-a01 clusterconfigs % kubectl get pods -A
NAMESPACE                NAME                                                                 READY   STATUS                  RESTARTS       AGE
default                  gpu-operator-ff8587768-j8gp4                                         1/1     Running                 0              8m2s
default                  gpu-operator-node-feature-discovery-master-bd7745d8d-xdl6m           1/1     Running                 0              8m2s
default                  gpu-operator-node-feature-discovery-worker-8c7dv                     1/1     Running                 0              8m3s
default                  gpu-operator-node-feature-discovery-worker-hnh2t                     1/1     Running                 0              8m3s
default                  gpu-operator-node-feature-discovery-worker-j5lcz                     1/1     Running                 0              8m3s
gpu-operator-resources   nvidia-container-toolkit-daemonset-9wrhl                             0/1     Init:ImagePullBackOff   0              7m13s
gpu-operator-resources   nvidia-container-toolkit-daemonset-ft5dw                             0/1     Init:ImagePullBackOff   0              7m13s
gpu-operator-resources   nvidia-driver-daemonset-mc2v7                                        0/1     CrashLoopBackOff        5 (2m3s ago)   7m28s
gpu-operator-resources   nvidia-driver-daemonset-swppw                                        0/1     CrashLoopBackOff        5 (2m1s ago)   7m28s
kube-system              antrea-agent-2f8ld                                                   2/2     Running                 0              76m
kube-system              antrea-agent-7fjlb                                                   2/2     Running                 0              84m
kube-system              antrea-agent-sfmmj                                                   2/2     Running                 0              84m
kube-system              antrea-controller-f9b4b78c-mhvqp                                     1/1     Running                 0              84m
kube-system              coredns-6544689bdd-mbsn5                                             1/1     Running                 0              86m
kube-system              coredns-6544689bdd-r25nk                                             1/1     Running                 0              86m
kube-system              ebs-csi-controller-5454b74dd9-czpbb                                  6/6     Running                 0              84m
kube-system              ebs-csi-node-7v76t                                                   3/3     Running                 0              76m
kube-system              ebs-csi-node-wfphw                                                   3/3     Running                 0              84m
kube-system              etcd-ip-10-187-27-68.us-east-2.compute.internal                      1/1     Running                 0              86m
kube-system              kube-apiserver-ip-10-187-27-68.us-east-2.compute.internal            1/1     Running                 0              86m
kube-system              kube-controller-manager-ip-10-187-27-68.us-east-2.compute.internal   1/1     Running                 0              86m
kube-system              kube-proxy-h2cgv                                                     1/1     Running                 0              76m
kube-system              kube-proxy-ptsxg                                                     1/1     Running                 0              86m
kube-system              kube-proxy-xj8db                                                     1/1     Running                 0              85m
kube-system              kube-scheduler-ip-10-187-27-68.us-east-2.compute.internal            1/1     Running                 0              86m
kube-system              metrics-server-7f4d846c4d-k87jb                                      1/1     Running                 0              84m
kube-system              snapshot-controller-57f9bf5d55-dvplw                                 1/1     Running                 0              84m
tanzu-system             secretgen-controller-69ff6c4b75-c2gqf                                1/1     Running                 0              84m
tkg-system               kapp-controller-7f69c8b6b-dzzgs                                      2/2     Running                 0              86m
tkg-system               tanzu-capabilities-controller-manager-954bb6b47-ddcjr                1/1     Running                 0              86m

Environment:
Machine type: g4dn.8xlarge

  • Cluster-api-provider-aws version: Core Cluster API (v1.1.5), Cluster API Provider AWS (v1.2.0)
  • Kubernetes version: (use kubectl version): Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.3", GitCommit:"434bfd82814af038ad94d62ebe59b133fcb50506", GitTreeState:"clean", BuildDate:"2022-10-12T10:47:25Z", GoVersion:"go1.19.2", Compiler:"gc", Platform:"darwin/amd64"}
  • OS (e.g. from /etc/os-release): Ubuntu 20.04
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 18, 2022
@k8s-ci-robot
Copy link
Contributor

@kacole2: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kacole2
Copy link
Author

kacole2 commented Oct 19, 2022

In the gpu-operator-components.yaml file I changed the registry to docker and it pulled and executed. Not sure why it's not pulling from nvidia's repo.

  toolkit:
    repository: docker.io/nvidia
    image: container-toolkit
    version: 1.4.7-ubuntu18.04

@Ankitasw
Copy link
Member

We have an E2E test running with the current changes for GPU instances in CAPA clusters, if there was an issue, that test would fail.

@Ankitasw
Copy link
Member

/triage needs-information

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Nov 16, 2022
@Ankitasw
Copy link
Member

Ankitasw commented Jan 2, 2023

We use nvidia registry in our tests which works fine

 toolkit:
    enabled: true
    repository: nvcr.io/nvidia/k8s
    image: container-toolkit
    version: v1.11.0
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
      seLinuxOptions:
        level: s0

Maybe it is the env problem ??

@dlipovetsky
Copy link
Contributor

dlipovetsky commented Mar 6, 2023

CAPA has e2e tests that cover GPU functionality, and there is no evidence this issue was related to CAPA.

/triage unresolved

@k8s-ci-robot k8s-ci-robot added the triage/unresolved Indicates an issue that can not or will not be resolved. label Mar 6, 2023
@dlipovetsky
Copy link
Contributor

/close

@k8s-ci-robot
Copy link
Contributor

@dlipovetsky: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it. triage/unresolved Indicates an issue that can not or will not be resolved.
Projects
None yet
Development

No branches or pull requests

4 participants