Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia container toolkit daemonset pod fails with ErrImagePull #388

Closed
4 of 5 tasks
Ankitasw opened this issue Aug 10, 2022 · 16 comments
Closed
4 of 5 tasks

Nvidia container toolkit daemonset pod fails with ErrImagePull #388

Ankitasw opened this issue Aug 10, 2022 · 16 comments

Comments

@Ankitasw
Copy link

Ankitasw commented Aug 10, 2022

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node?
  • Are you running Kubernetes v1.13+?
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

We are using GPU operator v1.6.2 in one of our E2E tests in cluster-api-provider-aws, it was working 2 days back, but now it has started failing, as nvidia-container-toolkit-daemonset pod failed to come up with below error:

Failed to pull image "nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59": rpc error: code = NotFound desc = failed to pull and unpack image "nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59": failed to resolve reference "nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59": nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59: not found

Is there anything changed recently which could be causing this issue?

2. Steps to reproduce the issue

Reference manifest used.

@elezar
Copy link
Member

elezar commented Aug 10, 2022

@Ankitasw do you know which tag the image sha was referring to? I know that we recently saw that the cuda:11.1-base image stopped working. My understanding is that the image tags that do not explicitly specify the base OS (e.g. ubuntu18.04) were deprecated.

When you say GPU operator v0.6.1 do you mean v1.6.0?

@Ankitasw
Copy link
Author

Ankitasw commented Aug 10, 2022

When you say GPU operator v0.6.1 do you mean v1.6.0?

@elezar Sorry for typo, its v1.6.2.

This is the operator validator we are using:

 operator:
    defaultRuntime: containerd
    validator:
      repository: nvcr.io/nvidia/k8s
      image: cuda-sample
      version: vectoradd-cuda10.2
      imagePullPolicy: IfNotPresent

@Ankitasw
Copy link
Author

Ankitasw commented Aug 10, 2022

But i think the issue is coming in init container, as init container is using above mentioned sha.

Controlled By:  DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
  driver-validation:
    Container ID:  
    Image:         nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      export SYS_LIBRARY_PATH=$(ldconfig -v 2>/dev/null | grep -v '^[[:space:]]' | cut -d':' -f1 | tr '[[:space:]]' ':');   export NVIDIA_LIBRARY_PATH=/run/nvidia/driver/usr/lib/x86_64-linux-gnu/:/run/nvidia/driver/usr/lib64; export LD_LIBRARY_PATH=${SYS_LIBRARY_PATH}:${NVIDIA_LIBRARY_PATH}; echo ${LD_LIBRARY_PATH}; export PATH=/run/nvidia/driver/usr/bin/:${PATH}; until nvidia-smi; do echo waiting for nvidia drivers to be loaded; sleep 5; done
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /run/nvidia from nvidia-install-path (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5jf99 (ro)

@elezar
Copy link
Member

elezar commented Aug 10, 2022

From the source we see:

		// using cuda 11.0 base image(11.0-base-ubi8) as default for initContainer on RHEL/OCP as this is minimal and used as base image for most gpu-operator images.
		// Also, this allows us to not having separate configurable for initContainer repo/image/tag etc.
		initContainerImage := "nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59"

Which may have been removed.

There does not seem to be a way to override this in v1.6.2. Would upgrading the GPU Operator version be an option, or is there some restriction that requires this version to be used?

@Ankitasw
Copy link
Author

Ankitasw commented Aug 10, 2022

I tried upgrading the operator to latest version, but the pods under namespace gpu-operator-resources were not coming up at all, so couldn't check for any errors as well since pods were not there.

 containers:
        - name: gpu-operator
          image: nvcr.io/nvidia/gpu-operator:v1.11.1
          imagePullPolicy: IfNotPresent
          command: ["gpu-operator"]
          args:
            - "--zap-time-encoding=epoch"
          env:
            - name: WATCH_NAMESPACE
              value: ""
            - name: OPERATOR_NAME
              value: "gpu-operator"
            - name: OPERATOR_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name

@Ankitasw
Copy link
Author

I will again try to upgrade the operator to use latest version and let you know.

@Ankitasw
Copy link
Author

Ankitasw commented Aug 10, 2022

@elezar getting below error because of which pods are not coming up after using latest operator version. Could you please help resolving this issue?

W0810 10:52:11.580070 1 reflector.go:324] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: failed to list *v1beta1.RuntimeClass: runtimeclasses.node.k8s.io is forbidden: User "system:serviceaccount:default:gpu-operator" cannot list resource "runtimeclasses" in API group "node.k8s.io" at the cluster scope
E0810 10:52:11.580094 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1beta1.RuntimeClass: failed to list *v1beta1.RuntimeClass: runtimeclasses.node.k8s.io is forbidden: User "system:serviceaccount:default:gpu-operator" cannot list resource "runtimeclasses" in API group "node.k8s.io" at the cluster scope

Update: I think I missed runtimeClass: nvidia in the operator manifest.

@elezar
Copy link
Member

elezar commented Aug 10, 2022

I will let @shivamerla respond here as he would have more information as to what could be the issue there.

@Ankitasw
Copy link
Author

I got the resolution of it, but getting some other errors 😄 maybe newer version has more validations, so I am modifying manifests based on that. Will keep this issue open for sometime till I am successfully able to use recent operator version.

@shivamerla
Copy link
Contributor

Yes, since you are maintaining manifests for these, please pull in the recent ones from release-1.11 branch.

@Ankitasw
Copy link
Author

Ankitasw commented Aug 10, 2022

@shivamerla do you mean the cluster policy crd?

@shivamerla
Copy link
Contributor

basically all manifests you have here needs to be updated for latest version.(Roles, NFD manifests, CRD and CR from values.yaml).

@shivamerla
Copy link
Contributor

shivamerla commented Aug 10, 2022

Below is the template from helm for reference for v1.11.1.

---
# Source: gpu-operator/charts/node-feature-discovery/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: node-feature-discovery
  labels:
    helm.sh/chart: node-feature-discovery-0.10.1
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "v0.10.1"
    app.kubernetes.io/managed-by: Helm
---
# Source: gpu-operator/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpu-operator
  labels:
    app.kubernetes.io/component: "gpu-operator"
---
# Source: gpu-operator/charts/node-feature-discovery/templates/nfd-worker-conf.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: release-name-node-feature-discovery-worker-conf
  labels:
    helm.sh/chart: node-feature-discovery-0.10.1
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "v0.10.1"
    app.kubernetes.io/managed-by: Helm
data:
  nfd-worker.conf: |-
    sources:
      pci:
        deviceClassWhitelist:
        - "02"
        - "0200"
        - "0207"
        - "0300"
        - "0302"
        deviceLabelFields:
        - vendor
---
# Source: gpu-operator/charts/node-feature-discovery/templates/nodefeaturerule-crd.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    controller-gen.kubebuilder.io/version: v0.7.0
  creationTimestamp: null
  name: nodefeaturerules.nfd.k8s-sigs.io
spec:
  group: nfd.k8s-sigs.io
  names:
    kind: NodeFeatureRule
    listKind: NodeFeatureRuleList
    plural: nodefeaturerules
    singular: nodefeaturerule
  scope: Cluster
  versions:
  - name: v1alpha1
    schema:
      openAPIV3Schema:
        description: NodeFeatureRule resource specifies a configuration for feature-based
          customization of node objects, such as node labeling.
        properties:
          apiVersion:
            description: 'APIVersion defines the versioned schema of this representation
              of an object. Servers should convert recognized schemas to the latest
              internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
            type: string
          kind:
            description: 'Kind is a string value representing the REST resource this
              object represents. Servers may infer this from the endpoint the client
              submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
            type: string
          metadata:
            type: object
          spec:
            description: NodeFeatureRuleSpec describes a NodeFeatureRule.
            properties:
              rules:
                description: Rules is a list of node customization rules.
                items:
                  description: Rule defines a rule for node customization such as
                    labeling.
                  properties:
                    labels:
                      additionalProperties:
                        type: string
                      description: Labels to create if the rule matches.
                      type: object
                    labelsTemplate:
                      description: LabelsTemplate specifies a template to expand for
                        dynamically generating multiple labels. Data (after template
                        expansion) must be keys with an optional value (<key>[=<value>])
                        separated by newlines.
                      type: string
                    matchAny:
                      description: MatchAny specifies a list of matchers one of which
                        must match.
                      items:
                        description: MatchAnyElem specifies one sub-matcher of MatchAny.
                        properties:
                          matchFeatures:
                            description: MatchFeatures specifies a set of matcher
                              terms all of which must match.
                            items:
                              description: FeatureMatcherTerm defines requirements
                                against one feature set. All requirements (specified
                                as MatchExpressions) are evaluated against each element
                                in the feature set.
                              properties:
                                feature:
                                  type: string
                                matchExpressions:
                                  additionalProperties:
                                    description: "MatchExpression specifies an expression
                                      to evaluate against a set of input values. It
                                      contains an operator that is applied when matching
                                      the input and an array of values that the operator
                                      evaluates the input against. \n NB: CreateMatchExpression
                                      or MustCreateMatchExpression() should be used
                                      for     creating new instances. NB: Validate()
                                      must be called if Op or Value fields are modified
                                      or if a new     instance is created from scratch
                                      without using the helper functions."
                                    properties:
                                      op:
                                        description: Op is the operator to be applied.
                                        enum:
                                        - In
                                        - NotIn
                                        - InRegexp
                                        - Exists
                                        - DoesNotExist
                                        - Gt
                                        - Lt
                                        - GtLt
                                        - IsTrue
                                        - IsFalse
                                        type: string
                                      value:
                                        description: Value is the list of values that
                                          the operand evaluates the input against.
                                          Value should be empty if the operator is
                                          Exists, DoesNotExist, IsTrue or IsFalse.
                                          Value should contain exactly one element
                                          if the operator is Gt or Lt and exactly
                                          two elements if the operator is GtLt. In
                                          other cases Value should contain at least
                                          one element.
                                        items:
                                          type: string
                                        type: array
                                    required:
                                    - op
                                    type: object
                                  description: MatchExpressionSet contains a set of
                                    MatchExpressions, each of which is evaluated against
                                    a set of input values.
                                  type: object
                              required:
                              - feature
                              - matchExpressions
                              type: object
                            type: array
                        required:
                        - matchFeatures
                        type: object
                      type: array
                    matchFeatures:
                      description: MatchFeatures specifies a set of matcher terms
                        all of which must match.
                      items:
                        description: FeatureMatcherTerm defines requirements against
                          one feature set. All requirements (specified as MatchExpressions)
                          are evaluated against each element in the feature set.
                        properties:
                          feature:
                            type: string
                          matchExpressions:
                            additionalProperties:
                              description: "MatchExpression specifies an expression
                                to evaluate against a set of input values. It contains
                                an operator that is applied when matching the input
                                and an array of values that the operator evaluates
                                the input against. \n NB: CreateMatchExpression or
                                MustCreateMatchExpression() should be used for     creating
                                new instances. NB: Validate() must be called if Op
                                or Value fields are modified or if a new     instance
                                is created from scratch without using the helper functions."
                              properties:
                                op:
                                  description: Op is the operator to be applied.
                                  enum:
                                  - In
                                  - NotIn
                                  - InRegexp
                                  - Exists
                                  - DoesNotExist
                                  - Gt
                                  - Lt
                                  - GtLt
                                  - IsTrue
                                  - IsFalse
                                  type: string
                                value:
                                  description: Value is the list of values that the
                                    operand evaluates the input against. Value should
                                    be empty if the operator is Exists, DoesNotExist,
                                    IsTrue or IsFalse. Value should contain exactly
                                    one element if the operator is Gt or Lt and exactly
                                    two elements if the operator is GtLt. In other
                                    cases Value should contain at least one element.
                                  items:
                                    type: string
                                  type: array
                              required:
                              - op
                              type: object
                            description: MatchExpressionSet contains a set of MatchExpressions,
                              each of which is evaluated against a set of input values.
                            type: object
                        required:
                        - feature
                        - matchExpressions
                        type: object
                      type: array
                    name:
                      description: Name of the rule.
                      type: string
                    vars:
                      additionalProperties:
                        type: string
                      description: Vars is the variables to store if the rule matches.
                        Variables do not directly inflict any changes in the node
                        object. However, they can be referenced from other rules enabling
                        more complex rule hierarchies, without exposing intermediary
                        output values as labels.
                      type: object
                    varsTemplate:
                      description: VarsTemplate specifies a template to expand for
                        dynamically generating multiple variables. Data (after template
                        expansion) must be keys with an optional value (<key>[=<value>])
                        separated by newlines.
                      type: string
                  required:
                  - name
                  type: object
                type: array
            required:
            - rules
            type: object
        required:
        - spec
        type: object
    served: true
    storage: true
status:
  acceptedNames:
    kind: ""
    plural: ""
  conditions: []
  storedVersions: []
---
# Source: gpu-operator/charts/node-feature-discovery/templates/clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: release-name-node-feature-discovery
  labels:
    helm.sh/chart: node-feature-discovery-0.10.1
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "v0.10.1"
    app.kubernetes.io/managed-by: Helm
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  # when using command line flag --resource-labels to create extended resources
  # you will need to uncomment "- nodes/status"
  # - nodes/status
  verbs:
  - get
  - patch
  - update
  - list
- apiGroups:
  - nfd.k8s-sigs.io
  resources:
  - nodefeaturerules
  verbs:
  - get
  - list
  - watch
---
# Source: gpu-operator/templates/role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  creationTimestamp: null
  name: gpu-operator
  labels:
    app.kubernetes.io/component: "gpu-operator"
    
rules:
- apiGroups:
  - config.openshift.io
  resources:
  - proxies
  verbs:
  - get
- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - roles
  - rolebindings
  - clusterroles
  - clusterrolebindings
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - pods
  - services
  - endpoints
  - persistentvolumeclaims
  - events
  - configmaps
  - secrets
  - serviceaccounts
  - nodes
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - namespaces
  verbs:
  - get
  - list
  - create
  - watch
  - update
- apiGroups:
  - apps
  resources:
  - deployments
  - daemonsets
  - replicasets
  - statefulsets
  verbs:
  - '*'
- apiGroups:
  - monitoring.coreos.com
  resources:
  - servicemonitors
  - prometheusrules
  verbs:
  - get
  - list
  - create
  - watch
  - update
- apiGroups:
  - nvidia.com
  resources:
  - '*'
  verbs:
  - '*'
- apiGroups:
  - scheduling.k8s.io
  resources:
  - priorityclasses
  verbs:
  - get
  - list
  - watch
  - create
- apiGroups:
  - security.openshift.io
  resources:
  - securitycontextconstraints
  verbs:
  - '*'
- apiGroups:
  - policy
  resources:
  - podsecuritypolicies
  verbs:
  - use
  resourceNames:
  - gpu-operator-restricted
- apiGroups:
  - policy
  resources:
  - podsecuritypolicies
  verbs:
  - create
  - get
  - update
  - list
- apiGroups:
  - config.openshift.io
  resources:
  - clusterversions
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  - coordination.k8s.io
  resources:
  - configmaps
  - leases
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - patch
  - delete
- apiGroups:
  - node.k8s.io
  resources:
  - runtimeclasses
  verbs:
  - get
  - list
  - create
  - update
  - watch
- apiGroups:
  - image.openshift.io
  resources:
  - imagestreams
  verbs:
  - get
  - list
  - watch
---
# Source: gpu-operator/charts/node-feature-discovery/templates/clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: release-name-node-feature-discovery
  labels:
    helm.sh/chart: node-feature-discovery-0.10.1
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "v0.10.1"
    app.kubernetes.io/managed-by: Helm
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: release-name-node-feature-discovery
subjects:
- kind: ServiceAccount
  name: node-feature-discovery
  namespace: default
---
# Source: gpu-operator/templates/rolebinding.yaml
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: gpu-operator
  labels:
    app.kubernetes.io/component: "gpu-operator"
    
subjects:
- kind: ServiceAccount
  name: gpu-operator
  namespace: default
- kind: ServiceAccount
  name: node-feature-discovery
  namespace: default
roleRef:
  kind: ClusterRole
  name: gpu-operator
  apiGroup: rbac.authorization.k8s.io
---
# Source: gpu-operator/charts/node-feature-discovery/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: release-name-node-feature-discovery-master
  labels:
    helm.sh/chart: node-feature-discovery-0.10.1
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "v0.10.1"
    app.kubernetes.io/managed-by: Helm
    role: master
spec:
  type: ClusterIP
  ports:
    - port: 8080
      targetPort: grpc
      protocol: TCP
      name: grpc
  selector:
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: release-name
---
# Source: gpu-operator/charts/node-feature-discovery/templates/worker.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name:  release-name-node-feature-discovery-worker
  labels:
    helm.sh/chart: node-feature-discovery-0.10.1
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "v0.10.1"
    app.kubernetes.io/managed-by: Helm
    role: worker
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-feature-discovery
      app.kubernetes.io/instance: release-name
      role: worker
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-feature-discovery
        app.kubernetes.io/instance: release-name
        role: worker
      annotations:
        {}
    spec:
      dnsPolicy: ClusterFirstWithHostNet
      serviceAccountName: node-feature-discovery
      securityContext:
        {}
      containers:
      - name: worker
        securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
              - ALL
            readOnlyRootFilesystem: true
            runAsNonRoot: true
        image: "k8s.gcr.io/nfd/node-feature-discovery:v0.10.1"
        imagePullPolicy: IfNotPresent
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        resources:
            {}
        command:
        - "nfd-worker"
        args:
        - "--server=release-name-node-feature-discovery-master:8080"
        volumeMounts:
        - name: host-boot
          mountPath: "/host-boot"
          readOnly: true
        - name: host-os-release
          mountPath: "/host-etc/os-release"
          readOnly: true
        - name: host-sys
          mountPath: "/host-sys"
          readOnly: true
        - name: host-usr-lib
          mountPath: "/host-usr/lib"
          readOnly: true
        - name: source-d
          mountPath: "/etc/kubernetes/node-feature-discovery/source.d/"
          readOnly: true
        - name: features-d
          mountPath: "/etc/kubernetes/node-feature-discovery/features.d/"
          readOnly: true
        - name: nfd-worker-conf
          mountPath: "/etc/kubernetes/node-feature-discovery"
          readOnly: true
      volumes:
        - name: host-boot
          hostPath:
            path: "/boot"
        - name: host-os-release
          hostPath:
            path: "/etc/os-release"
        - name: host-sys
          hostPath:
            path: "/sys"
        - name: host-usr-lib
          hostPath:
            path: "/usr/lib"
        - name: source-d
          hostPath:
            path: "/etc/kubernetes/node-feature-discovery/source.d/"
        - name: features-d
          hostPath:
            path: "/etc/kubernetes/node-feature-discovery/features.d/"
        - name: nfd-worker-conf
          configMap:
            name: release-name-node-feature-discovery-worker-conf
            items:
              - key: nfd-worker.conf
                path: nfd-worker.conf
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
          operator: Equal
          value: ""
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: Equal
          value: present
---
# Source: gpu-operator/charts/node-feature-discovery/templates/master.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name:  release-name-node-feature-discovery-master
  labels:
    helm.sh/chart: node-feature-discovery-0.10.1
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "v0.10.1"
    app.kubernetes.io/managed-by: Helm
    role: master
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: node-feature-discovery
      app.kubernetes.io/instance: release-name
      role: master
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-feature-discovery
        app.kubernetes.io/instance: release-name
        role: master
      annotations:
        {}
    spec:
      serviceAccountName: node-feature-discovery
      securityContext:
        {}
      containers:
        - name: master
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
              - ALL
            readOnlyRootFilesystem: true
            runAsNonRoot: true
          image: "k8s.gcr.io/nfd/node-feature-discovery:v0.10.1"
          imagePullPolicy: IfNotPresent
          livenessProbe:
            exec:
              command:
              - "/usr/bin/grpc_health_probe"
              - "-addr=:8080"
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            exec:
              command:
              - "/usr/bin/grpc_health_probe"
              - "-addr=:8080"
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 10
          ports:
          - containerPort: 8080
            name: grpc
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
          command:
            - "nfd-master"
          resources:
            {}
          args:
            - "--extra-label-ns=nvidia.com"
            ## By default, disable NodeFeatureRules controller for other than the default instances
            - "-featurerules-controller=true"
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: node-role.kubernetes.io/master
                operator: In
                values:
                - ""
            weight: 1
          - preference:
              matchExpressions:
              - key: node-role.kubernetes.io/control-plane
                operator: In
                values:
                - ""
            weight: 1
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
          operator: Equal
          value: ""
        - effect: NoSchedule
          key: node-role.kubernetes.io/control-plane
          operator: Equal
          value: ""
---
# Source: gpu-operator/templates/operator.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-operator
  labels:
    app.kubernetes.io/component: "gpu-operator"
    
spec:
  replicas: 1
  selector:
    matchLabels:
      
      app.kubernetes.io/component: "gpu-operator"
      app: "gpu-operator"
  template:
    metadata:
      labels:
        
        app.kubernetes.io/component: "gpu-operator"
        app: "gpu-operator"
      annotations:
        openshift.io/scc: restricted-readonly
    spec:
      serviceAccountName: gpu-operator
      priorityClassName: system-node-critical
      containers:
      - name: gpu-operator
        image: nvcr.io/nvidia/gpu-operator:v1.11.1
        imagePullPolicy: IfNotPresent
        command: ["gpu-operator"]
        args:
        - --leader-elect
        env:
        - name: WATCH_NAMESPACE
          value: ""
        - name: OPERATOR_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        volumeMounts:
          - name: host-os-release
            mountPath: "/host-etc/os-release"
            readOnly: true
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8081
          initialDelaySeconds: 15
          periodSeconds: 20
        readinessProbe:
          httpGet:
            path: /readyz
            port: 8081
          initialDelaySeconds: 5
          periodSeconds: 10
        resources:
          limits:
            cpu: 500m
            memory: 350Mi
          requests:
            cpu: 200m
            memory: 100Mi
        ports:
          - name: metrics
            containerPort: 8080
      volumes:
        - name: host-os-release
          hostPath:
            path: "/etc/os-release"
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: node-role.kubernetes.io/master
                operator: In
                values:
                - ""
            weight: 1
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
          operator: Equal
          value: ""
---
# Source: gpu-operator/templates/clusterpolicy.yaml
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
  namespace: default
  labels:
    app.kubernetes.io/component: "gpu-operator"
    
spec:
  operator:
    defaultRuntime: docker
    runtimeClass: nvidia
    initContainer:
      repository: nvcr.io/nvidia
      image: cuda
      version: "11.6.0-base-ubi8"
      imagePullPolicy: IfNotPresent
  daemonsets:
    tolerations: 
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
    priorityClassName: system-node-critical
  validator:
    repository: nvcr.io/nvidia/cloud-native
    image: gpu-operator-validator
    version: "v1.11.1"
    imagePullPolicy: IfNotPresent
    plugin:
      env: 
        - name: WITH_WORKLOAD
          value: "true"

  mig:
    strategy: single
  psp:
    enabled: false
  driver:
    enabled: true
    repository: nvcr.io/nvidia
    image: driver
    version: "515.48.07"
    imagePullPolicy: IfNotPresent
    rdma:
      enabled: false
      useHostMofed: false
    manager:
      repository: nvcr.io/nvidia/cloud-native
      image: k8s-driver-manager
      version: "v0.4.1"
      imagePullPolicy: IfNotPresent
      env: 
        - name: ENABLE_AUTO_DRAIN
          value: "true"
        - name: DRAIN_USE_FORCE
          value: "false"
        - name: DRAIN_POD_SELECTOR_LABEL
          value: ""
        - name: DRAIN_TIMEOUT_SECONDS
          value: 0s
        - name: DRAIN_DELETE_EMPTYDIR_DATA
          value: "false"
    repoConfig: 
      configMapName: ""
    certConfig: 
      name: ""
    licensingConfig: 
      configMapName: ""
      nlsEnabled: false
    virtualTopology: 
      config: ""
    kernelModuleConfig: 
      name: ""
    rollingUpdate:
      maxUnavailable: "1"
  toolkit:
    enabled: true
    repository: nvcr.io/nvidia/k8s
    image: container-toolkit
    version: "v1.10.0-ubuntu20.04"
    imagePullPolicy: IfNotPresent
  devicePlugin:
    repository: nvcr.io/nvidia
    image: k8s-device-plugin
    version: "v0.12.2-ubi8"
    imagePullPolicy: IfNotPresent
    env: 
      - name: PASS_DEVICE_SPECS
        value: "true"
      - name: FAIL_ON_INIT_ERROR
        value: "true"
      - name: DEVICE_LIST_STRATEGY
        value: envvar
      - name: DEVICE_ID_STRATEGY
        value: uuid
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: all
    config: 
      default: ""
      name: ""
  dcgm:
    enabled: false
    repository: nvcr.io/nvidia/cloud-native
    image: dcgm
    version: "2.4.5-1-ubuntu20.04"
    imagePullPolicy: IfNotPresent
    hostPort: 5555
  dcgmExporter:
    repository: nvcr.io/nvidia/k8s
    image: dcgm-exporter
    version: "2.4.5-2.6.7-ubuntu20.04"
    imagePullPolicy: IfNotPresent
    env: 
      - name: DCGM_EXPORTER_LISTEN
        value: :9400
      - name: DCGM_EXPORTER_KUBERNETES
        value: "true"
      - name: DCGM_EXPORTER_COLLECTORS
        value: /etc/dcgm-exporter/dcp-metrics-included.csv
  gfd:
    repository: nvcr.io/nvidia
    image: gpu-feature-discovery
    version: "v0.6.1-ubi8"
    imagePullPolicy: IfNotPresent
    env: 
      - name: GFD_SLEEP_INTERVAL
        value: 60s
      - name: GFD_FAIL_ON_INIT_ERROR
        value: "true"
  migManager:
    enabled: true
    repository: nvcr.io/nvidia/cloud-native
    image: k8s-mig-manager
    version: "v0.4.2-ubuntu20.04"
    imagePullPolicy: IfNotPresent
    env: 
      - name: WITH_REBOOT
        value: "false"
    config: 
      name: ""
    gpuClientsConfig: 
      name: ""
  nodeStatusExporter:
    enabled: false
    repository: nvcr.io/nvidia/cloud-native
    image: gpu-operator-validator
    version: "v1.11.1"
    imagePullPolicy: IfNotPresent

@Ankitasw
Copy link
Author

Ankitasw commented Aug 10, 2022

The cluster was deployed successfully with CUDA vector add test successful,

gpu-operator-resources nvidia-device-plugin-validator-9tfhw 0/1 UnexpectedAdmissionError 0 2m30s

but above pod fails with below error:

Warning Unhealthy 2m30s (x10 over 4m) kubelet Startup probe failed:
Warning NodeNotReady 30s node-controller Node is not ready

@shivamerla could you please help resolve this?

@Ankitasw
Copy link
Author

Ankitasw commented Aug 10, 2022

Below is the template from helm for reference for v1.11.1.

Also for future reference, where can I find this template for gpu operator resources?

@shivamerla
Copy link
Contributor

shivamerla commented Aug 10, 2022

For the plugin-validator error, it needs an available GPU on the node, if you have other pods consuming GPUs that might happen. If you want to disable plugin validation you can set as below with validator component in ClusterPolicy.

  validator:
    repository: nvcr.io/nvidia/cloud-native
    image: gpu-operator-validator
    version: "v1.11.1"
    imagePullPolicy: IfNotPresent
    plugin:
      env: 
        - name: WITH_WORKLOAD
          value: "false"

For getting templates for each release you can run

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
   && helm repo update
helm template nvidia/gpu-operator --version=v1.11.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants