✨ Support GPU nodes with "nvidia-gpu" flavor #1002

mboersma · 2020-10-18T19:25:20Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds the nvidia-gpu flavor to support Azure N-series SKUs with NVIDIA GPUs. Creating a workload cluster from that flavor provides nvidia.com/gpu schedulable resources on agent nodes.

Which issue(s) this PR fixes:

Fixes #426

Special notes for your reviewer:

Many thanks to @sozercan for figuring out the essential commands and containerd config used here!

Note that NVv4-series GPUs are not supported. (Those VMs use an AMD GPU and are only supported on Windows.)

TODOs:

squashed commits
includes documentation
adds e2e tests

Release note:

✨ Support GPU nodes with "nvidia-gpu" flavor

CecileRobertMichon · 2020-10-19T18:57:45Z

templates/flavors/nvidia-gpu/patches/kubeadm-controlplane.yaml

+    useExperimentalRetryJoin: true
+    postKubeadmCommands:
+      # Install the NVIDIA device plugin for Kubernetes
+      - KUBECONFIG=/etc/kubernetes/admin.conf kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml


why run the nvidia install script on the control plane nodes if only the worker nodes are GPU enabled?

Because the manifest is just a daemonset that will schedule itself on GPU agent nodes, so it could be installed from anywhere (the original version of this PR had the user do it), and because I know I have kubectl and a kubeconfig on the control plane nodes, but I don't think that's true on the agent nodes.

docs/book/src/topics/gpu.md

mboersma · 2020-10-19T20:37:59Z

/retest

devigned

lgtm

devigned · 2020-10-19T21:41:14Z

/approve cancel

devigned · 2020-10-19T21:42:39Z

@CecileRobertMichon looks like GH code review approve is triggering the /approve behavior. Just a heads up.

Also, me canceling approve is just to give other a chance to comment.

/assign @CecileRobertMichon

CecileRobertMichon · 2020-10-19T22:31:33Z

templates/flavors/nvidia-gpu/machine-deployment.yaml

+          secret:
+            name: ${CLUSTER_NAME}-md-0-azure-json
+            key: worker-node-azure.json
+      - path: /etc/containerd/nvidia-config.toml


was this file taken from somewhere?

Yes, from @sozercan's gist at https://gist.github.com/sozercan/51a569cf173ef7e57a375978af8edf26 which he linked to in #426. Not sure if it has other origins.

docs/book/src/topics/gpu.md

CecileRobertMichon · 2020-10-20T20:06:20Z

I tried deploying a GPU cluster using tilt with the nvidia-gpu flavor and calico is not coming up on the worker nodes:

k --kubeconfig ./kubeconfig get pods -A -o wide  
NAMESPACE     NAME                                                              READY   STATUS              RESTARTS   AGE     IP                NODE                                      NOMINATED NODE   READINESS GATES
kube-system   calico-kube-controllers-59d7f84b55-jwh22                          1/1     Running             0          12m     192.168.150.131   nvidia-gpu-template-control-plane-lswkl   <none>           <none>
kube-system   calico-node-6q975                                                 1/1     Running             0          12m     10.0.0.4          nvidia-gpu-template-control-plane-lswkl   <none>           <none>
kube-system   calico-node-cxlrz                                                 0/1     Init:0/3            0          9m7s    10.1.0.5          nvidia-gpu-template-md-0-zd5gx            <none>           <none>
kube-system   calico-node-d9frf                                                 0/1     Init:0/3            0          9m19s   10.1.0.4          nvidia-gpu-template-md-0-95bmz            <none>           <none>
kube-system   coredns-66bff467f8-56mt8                                          1/1     Running             0          12m     192.168.150.130   nvidia-gpu-template-control-plane-lswkl   <none>           <none>
kube-system   coredns-66bff467f8-fz8d9                                          1/1     Running             0          12m     192.168.150.129   nvidia-gpu-template-control-plane-lswkl   <none>           <none>
kube-system   etcd-nvidia-gpu-template-control-plane-lswkl                      1/1     Running             0          12m     10.0.0.4          nvidia-gpu-template-control-plane-lswkl   <none>           <none>
kube-system   kube-apiserver-nvidia-gpu-template-control-plane-lswkl            1/1     Running             0          12m     10.0.0.4          nvidia-gpu-template-control-plane-lswkl   <none>           <none>
kube-system   kube-controller-manager-nvidia-gpu-template-control-plane-lswkl   1/1     Running             0          12m     10.0.0.4          nvidia-gpu-template-control-plane-lswkl   <none>           <none>
kube-system   kube-proxy-5mzdj                                                  0/1     ContainerCreating   0          9m7s    10.1.0.5          nvidia-gpu-template-md-0-zd5gx            <none>           <none>
kube-system   kube-proxy-htkg2                                                  0/1     ContainerCreating   0          9m19s   10.1.0.4          nvidia-gpu-template-md-0-95bmz            <none>           <none>
kube-system   kube-proxy-l6pvz                                                  1/1     Running             0          12m     10.0.0.4          nvidia-gpu-template-control-plane-lswkl   <none>           <none>
kube-system   kube-scheduler-nvidia-gpu-template-control-plane-lswkl            1/1     Running             0          12m     10.0.0.4          nvidia-gpu-template-control-plane-lswkl   <none>           <none>

I'm seeing this when describing the pod:

  Warning  FailedCreatePodSandBox  5m58s                  kubelet, nvidia-gpu-template-md-0-zd5gx  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/0d282c496afbc8ec5cf57125fcd88e99d85ef43d7668ab7a92d82378a02fe29b/log.json: no such file or directory): exec: "nvidia-container-runtime": executable file not found in $PATH: unknown

This is with VM size Standard_NV6 (because that's where I had quota) and Standard_LRS storage type.

mboersma · 2020-10-20T20:13:49Z

I also had to use Standard_NV6 for testing because I didn't have quota for the other types. (I do now.) I've made a bunch of GPU-enabled clusters with this code but haven't seen that error (yet).

nvidia-container-runtime": executable file not found in $PATH

Do you have access to the nodes? Could you see if nvidia-smi works there and if the nvidia-plugin daemonset is running?

CecileRobertMichon · 2020-10-20T20:26:26Z

I see

 k --kubeconfig ./kubeconfig get daemonsets.apps -A
NAMESPACE     NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-system   calico-node                      3         3         1       3            1           kubernetes.io/os=linux   33m
kube-system   kube-proxy                       3         3         1       3            1           kubernetes.io/os=linux   33m
kube-system   nvidia-device-plugin-daemonset   0         0         0       0            0           <none>                   33m

CecileRobertMichon · 2020-10-20T20:28:10Z

Ha

[  182.744200] cloud-init[1793]: [2020-10-20 19:55:04] Reading package lists...
[  182.744558] cloud-init[1793]: [2020-10-20 19:55:04] Building dependency tree...
[  182.744926] cloud-init[1793]: [2020-10-20 19:55:04] Reading state information...
[  182.745302] cloud-init[1793]: [2020-10-20 19:55:04] E: Unable to locate package nvidia-container-runtime

from cloud init logs on one of the nodes

mboersma · 2020-10-23T18:40:34Z

I added an e2e test spec for the nvidia-gpu flavor following the pattern of machinepool and friends. Some things still to be considered here:

Use Standard_LRS storage in conjunction with Standard_NV6 for the least expensive test SKU with a GPU
Skip this spec entirely by default and set it up as a periodic job
Investigate whether this test sub has access to N-series SKUs in multiple regions and restrict accordingly
Should we add a GPU-enabled node pool to an existing spec instead of building a separate cluster?

CecileRobertMichon · 2020-10-23T18:42:53Z

Skip this spec entirely by default and set it up as a periodic job

I would add setting up a presubmit job "e2e-full" or something like that to run the whole spec optionally on PRs

config/manager/manager_image_patch.yaml

config/manager/manager_pull_policy.yaml

test/e2e/azure_test.go

mboersma · 2020-10-27T21:31:06Z

/retest

devigned

Hate to do this, but 1 super small item to address. Outside of that, lgtm.

templates/test/prow-nvidia-gpu/cni-resource-set.yaml

mboersma · 2020-10-28T21:07:16Z

/test ?

k8s-ci-robot · 2020-10-28T21:07:18Z

@mboersma: The following commands are available to trigger jobs:

/test pull-cluster-api-provider-azure-test
/test pull-cluster-api-provider-azure-build
/test pull-cluster-api-provider-azure-e2e
/test pull-cluster-api-provider-azure-e2e-full
/test pull-cluster-api-provider-azure-capi-e2e
/test pull-cluster-api-provider-azure-verify
/test pull-cluster-api-provider-azure-conformance-v1alpha3
/test pull-cluster-api-provider-azure-apidiff
/test pull-cluster-api-provider-azure-coverage

Use /test all to run the following jobs:

pull-cluster-api-provider-azure-test
pull-cluster-api-provider-azure-build
pull-cluster-api-provider-azure-e2e
pull-cluster-api-provider-azure-e2e-full
pull-cluster-api-provider-azure-verify
pull-cluster-api-provider-azure-apidiff
pull-cluster-api-provider-azure-coverage

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mboersma · 2020-10-28T21:07:34Z

/test pull-cluster-api-provider-azure-e2e-full

mboersma · 2020-10-29T15:29:43Z

Looks like the GPU-enabled cluster provisioned and passed:

...
�[1mSTEP�[0m: Waiting for the workload nodes to exist
INFO: Waiting for the machine pools to be provisioned
�[1mSTEP�[0m: creating a Kubernetes client to the workload cluster
�[1mSTEP�[0m: running a CUDA vector calculation job
�[1mSTEP�[0m: waiting for job default/cuda-vector-add to be complete
�[1mSTEP�[0m: creating Azure clients with the workload cluster's subscription
�[1mSTEP�[0m: verifying EnableAcceleratedNetworking for the primary NIC of each VM
�[1mSTEP�[0m: Dumping logs from the "capz-e2e-796qgc" workload cluster
...

devigned

/lgtm

/assign @CecileRobertMichon

test/e2e/azure_test.go

CecileRobertMichon

/approve

I commented on kubernetes/test-infra#19715 (comment) after it merged, the e2e-full job should not run by default on PRs (right now it's being auto triggered because of the runIfChanged value), this requires a follow up to test-infra

k8s-ci-robot · 2020-10-29T17:03:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [CecileRobertMichon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mboersma · 2020-10-29T17:04:07Z

requires a follow up to test-infra

Yes, I saw--thanks for catching that. I'll make a PR to fix it.

Update: see kubernetes/test-infra#19756

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 18, 2020

k8s-ci-robot requested review from CecileRobertMichon and justaugustus October 18, 2020 19:25

k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 18, 2020

mboersma force-pushed the nvidia-gpu-flavor branch 2 times, most recently from bf62412 to b20cb18 Compare October 19, 2020 15:27

CecileRobertMichon reviewed Oct 19, 2020

View reviewed changes

docs/book/src/topics/gpu.md Show resolved Hide resolved

mboersma force-pushed the nvidia-gpu-flavor branch from b20cb18 to c1086c4 Compare October 19, 2020 19:27

devigned approved these changes Oct 19, 2020

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 19, 2020

k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 19, 2020

CecileRobertMichon mentioned this pull request Oct 19, 2020

📖 Windows Support CAEP kubernetes-sigs/cluster-api#3616

Merged

k8s-ci-robot assigned CecileRobertMichon Oct 19, 2020

CecileRobertMichon reviewed Oct 19, 2020

View reviewed changes

CecileRobertMichon reviewed Oct 20, 2020

View reviewed changes

docs/book/src/topics/gpu.md Show resolved Hide resolved

mboersma force-pushed the nvidia-gpu-flavor branch from c1086c4 to a82c2b0 Compare October 20, 2020 19:06

CecileRobertMichon reviewed Oct 20, 2020

View reviewed changes

docs/book/src/topics/gpu.md Outdated Show resolved Hide resolved

mboersma force-pushed the nvidia-gpu-flavor branch from a82c2b0 to a563124 Compare October 20, 2020 20:11

CecileRobertMichon reviewed Oct 23, 2020

View reviewed changes

config/manager/manager_image_patch.yaml Outdated Show resolved Hide resolved

CecileRobertMichon reviewed Oct 23, 2020

View reviewed changes

config/manager/manager_pull_policy.yaml Outdated Show resolved Hide resolved

CecileRobertMichon reviewed Oct 23, 2020

View reviewed changes

test/e2e/azure_test.go Outdated Show resolved Hide resolved

mboersma force-pushed the nvidia-gpu-flavor branch 2 times, most recently from f7f366f to 2a2c913 Compare October 27, 2020 20:36

mboersma mentioned this pull request Oct 27, 2020

CAPZ: add "e2e-full" job that includes GPU-based nodes kubernetes/test-infra#19715

Merged

mboersma changed the title ~~[WIP] ✨ Support GPU nodes with "nvidia-gpu" flavor~~ ✨ Support GPU nodes with "nvidia-gpu" flavor Oct 27, 2020

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 27, 2020

devigned suggested changes Oct 28, 2020

View reviewed changes

templates/test/prow-nvidia-gpu/cni-resource-set.yaml Outdated Show resolved Hide resolved

✨ Support GPU nodes with "nvidia-gpu" flavor

140bbc3

mboersma force-pushed the nvidia-gpu-flavor branch from 2a2c913 to 140bbc3 Compare October 28, 2020 16:04

devigned approved these changes Oct 29, 2020

View reviewed changes

k8s-ci-robot assigned devigned Oct 29, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 29, 2020

CecileRobertMichon reviewed Oct 29, 2020

View reviewed changes

test/e2e/azure_test.go Show resolved Hide resolved

CecileRobertMichon approved these changes Oct 29, 2020

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 29, 2020

k8s-ci-robot merged commit 272261b into kubernetes-sigs:master Oct 29, 2020

k8s-ci-robot added this to the v0.4.10 milestone Oct 29, 2020

mboersma deleted the nvidia-gpu-flavor branch October 29, 2020 18:03

mboersma mentioned this pull request Oct 29, 2020

Use NVIDIA's gpu-operator for GPU node support #1017

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Support GPU nodes with "nvidia-gpu" flavor #1002

✨ Support GPU nodes with "nvidia-gpu" flavor #1002

mboersma commented Oct 18, 2020 •

edited

Loading

CecileRobertMichon Oct 19, 2020

mboersma Oct 19, 2020 •

edited

Loading

mboersma commented Oct 19, 2020

devigned left a comment

devigned commented Oct 19, 2020

devigned commented Oct 19, 2020

CecileRobertMichon Oct 19, 2020

mboersma Oct 20, 2020

CecileRobertMichon commented Oct 20, 2020

mboersma commented Oct 20, 2020

CecileRobertMichon commented Oct 20, 2020

CecileRobertMichon commented Oct 20, 2020

mboersma commented Oct 23, 2020 •

edited

Loading

CecileRobertMichon commented Oct 23, 2020

mboersma commented Oct 27, 2020

devigned left a comment

mboersma commented Oct 28, 2020

k8s-ci-robot commented Oct 28, 2020

mboersma commented Oct 28, 2020

mboersma commented Oct 29, 2020

devigned left a comment

CecileRobertMichon left a comment

k8s-ci-robot commented Oct 29, 2020

mboersma commented Oct 29, 2020 •

edited

Loading

✨ Support GPU nodes with "nvidia-gpu" flavor #1002

✨ Support GPU nodes with "nvidia-gpu" flavor #1002

Conversation

mboersma commented Oct 18, 2020 • edited Loading

CecileRobertMichon Oct 19, 2020

Choose a reason for hiding this comment

mboersma Oct 19, 2020 • edited Loading

Choose a reason for hiding this comment

mboersma commented Oct 19, 2020

devigned left a comment

Choose a reason for hiding this comment

devigned commented Oct 19, 2020

devigned commented Oct 19, 2020

CecileRobertMichon Oct 19, 2020

Choose a reason for hiding this comment

mboersma Oct 20, 2020

Choose a reason for hiding this comment

CecileRobertMichon commented Oct 20, 2020

mboersma commented Oct 20, 2020

CecileRobertMichon commented Oct 20, 2020

CecileRobertMichon commented Oct 20, 2020

mboersma commented Oct 23, 2020 • edited Loading

CecileRobertMichon commented Oct 23, 2020

mboersma commented Oct 27, 2020

devigned left a comment

Choose a reason for hiding this comment

mboersma commented Oct 28, 2020

k8s-ci-robot commented Oct 28, 2020

mboersma commented Oct 28, 2020

mboersma commented Oct 29, 2020

devigned left a comment

Choose a reason for hiding this comment

CecileRobertMichon left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 29, 2020

mboersma commented Oct 29, 2020 • edited Loading

mboersma commented Oct 18, 2020 •

edited

Loading

mboersma Oct 19, 2020 •

edited

Loading

mboersma commented Oct 23, 2020 •

edited

Loading

mboersma commented Oct 29, 2020 •

edited

Loading