-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Support GPU nodes with "nvidia-gpu" flavor #1002
✨ Support GPU nodes with "nvidia-gpu" flavor #1002
Conversation
bf62412
to
b20cb18
Compare
useExperimentalRetryJoin: true | ||
postKubeadmCommands: | ||
# Install the NVIDIA device plugin for Kubernetes | ||
- KUBECONFIG=/etc/kubernetes/admin.conf kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why run the nvidia install script on the control plane nodes if only the worker nodes are GPU enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the manifest is just a daemonset that will schedule itself on GPU agent nodes, so it could be installed from anywhere (the original version of this PR had the user do it), and because I know I have kubectl
and a kubeconfig on the control plane nodes, but I don't think that's true on the agent nodes.
b20cb18
to
c1086c4
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
/approve cancel |
@CecileRobertMichon looks like GH code review approve is triggering the Also, me canceling approve is just to give other a chance to comment. /assign @CecileRobertMichon |
secret: | ||
name: ${CLUSTER_NAME}-md-0-azure-json | ||
key: worker-node-azure.json | ||
- path: /etc/containerd/nvidia-config.toml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this file taken from somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, from @sozercan's gist at https://gist.github.com/sozercan/51a569cf173ef7e57a375978af8edf26 which he linked to in #426. Not sure if it has other origins.
c1086c4
to
a82c2b0
Compare
I tried deploying a GPU cluster using tilt with the nvidia-gpu flavor and calico is not coming up on the worker nodes:
I'm seeing this when describing the pod:
This is with VM size |
a82c2b0
to
a563124
Compare
I also had to use
Do you have access to the nodes? Could you see if |
I see
|
Ha
from cloud init logs on one of the nodes |
I added an e2e test spec for the
|
I would add setting up a presubmit job "e2e-full" or something like that to run the whole spec optionally on PRs |
f7f366f
to
2a2c913
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hate to do this, but 1 super small item to address. Outside of that, lgtm.
2a2c913
to
140bbc3
Compare
/test ? |
@mboersma: The following commands are available to trigger jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test pull-cluster-api-provider-azure-e2e-full |
Looks like the GPU-enabled cluster provisioned and passed:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/assign @CecileRobertMichon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
I commented on kubernetes/test-infra#19715 (comment) after it merged, the e2e-full job should not run by default on PRs (right now it's being auto triggered because of the runIfChanged value), this requires a follow up to test-infra
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CecileRobertMichon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Yes, I saw--thanks for catching that. I'll make a PR to fix it. Update: see kubernetes/test-infra#19756 |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adds the
nvidia-gpu
flavor to support Azure N-series SKUs with NVIDIA GPUs. Creating a workload cluster from that flavor providesnvidia.com/gpu
schedulable resources on agent nodes.Which issue(s) this PR fixes:
Fixes #426
Special notes for your reviewer:
Many thanks to @sozercan for figuring out the essential commands and containerd config used here!
Note that NVv4-series GPUs are not supported. (Those VMs use an AMD GPU and are only supported on Windows.)
TODOs:
Release note: