-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add API for CDI --devices flag in Docker and Podman for mapping GPUs #3290
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lukeogg The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @lukeogg. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
Looks like this should be fine to podman, and I think the docker version handling makes sense.
Some nice markdown cleanup too. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Some minor comments on the CDI device and documentation side of things.
@@ -192,6 +196,17 @@ func validatePortMappings(portMappings []PortMapping) error { | |||
return nil | |||
} | |||
|
|||
func validateDevices(devices []string) error { | |||
for _, device := range devices { | |||
device := strings.TrimSpace(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to use "github.com/container-orchestrated-devices/container-device-interface/pkg/parser"
and call parser.IsQualifiedName(device)
here. This is used in docker/cli#4084 and significantly reduces the dependencies pulled in.
It should also be fine to allow podman
and / or docker
to perform this validation though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give that a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update to what Evan suggests to reduce the dependencies pulled in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenTheElder What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@klueska I am using parser.ParseQualifiedName(device) instead of parser.ParseQualifiedName(device). The results are the same except you get the additional error message about why your device string is invalid. I think this is preferable. ParseQualifiedName() is called in IsQualifiedName().
This doesn't change the dependencies at all. There are tests on github.com/container-orchestrated-devices/container-device-interface/pkg/parser pulling in the additional packages. Maybe we should put the tests in a different package? @elezar wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can update the tests in that package to use a different import for comparison or to be in the parser_test
package if require. As a matter of interest, which package that is equivalent to require
would be preferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't personally have a preference - I haven't found any with fewer dependencies. I was thinking use the parser_test
package to keep things very lightweight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created cncf-tags/container-device-interface#149 with an update to the test package. This should prevent github.com/stretchr/testify/require
from getting pulled in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, a few things came up back in the real world and here with #3277 etc., the dependency discussion wound up forked in https://github.com/kubernetes-sigs/kind/pull/3290/files#r1304873501
I'm going to try to get #3335 after which when this PR is rebased and go mod tidy
it should be clearer what we're actually talking about for the new dependencies.
See linked comment for stance on deps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently for example we have a lot of CRI inspired types with in-tree dependency-free code instead, to avoid creating dep hell for our users. That may not be so reasonable here.
containerPath: /var/run/nvidia-container-devices/all | ||
{{< /codeFromInline >}} | ||
|
||
Note: this method only support adding `all` GPUs to a single node. If you want to add specific GPUs to specific nodes, you will need to use the `devices` API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know whether this is strictly true. Using /var/run/nvidia-container-devices/0
or /var/run/nvidia-container-devices/{{DEVICE_UUID}}
should allow individual devices to be made available.
cc @klueska
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in #3257 (comment), we can't achieve any finer granularity than all
(regardless of whether we do volume mounts or use CDI) because the node's container is started with --privileged. Even if you tell it to selectively inject just a single GPU, the node will be able to see all of them because --privileged
pulls in all device nodes under /dev
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing some testing today and this does allow access to all GPUs. I'll update the documentation. It is fairly convenient with Kind, however, as you don't need to inject drivers or install them on the node images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this seems reasonable. The only real issue I see is the limitation around not being able to inject individual GPUs because of the --privileged flag.
I'm assuming this will trip up many users thinking they should be able to selectively give different nodes access to different GPUs (based on their CDI device name), but then suddenly all nodes have access to all GPUs.
// Append CDI device args (used for GPU support) | ||
if len(node.CDIDevices) > 0 { | ||
// Check for docker > 25 | ||
ver := Version() | ||
if ver != "dev" || strings.Split(ver, ".")[0] < "25" { | ||
return nil, errors.Errorf("using devices api in kind requires Docker >= v25, but found %q", ver) | ||
} | ||
|
||
// Append args for each device | ||
for _, device := range node.CDIDevices { | ||
args = append(args, "--device", strings.TrimSpace(device)) | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will also allow passing "standard" devices via the CDIDevices
setting? It looks like there is a new validate()
call that will prevent this though. Is the validate call made sometime before this is done here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, validation is called as part of the initial passing for the node configuration. parser.ParseQualifiedName(device)
as part of this.
@@ -192,6 +196,17 @@ func validatePortMappings(portMappings []PortMapping) error { | |||
return nil | |||
} | |||
|
|||
func validateDevices(devices []string) error { | |||
for _, device := range devices { | |||
device := strings.TrimSpace(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update to what Evan suggests to reduce the dependencies pulled in.
- role: control-plane | ||
- role: worker | ||
devices: | ||
- "nvidia.com/gpu=0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this will not give you access to just GPU 0, but rather all GPUs on the system.
This is because --privileged is passed to the docker call that creates the node, thus giving it access to all devices under /dev (including all GPU devices).
Individual GPU injection will only be supported once if/when kind is able to start nodes without --privileged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this reduces the value of adding an API for CDI devices in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, this is the same for all other mechanisms too. The --privileged
flag will mean that any container will have access to all devices and the only value add is that the driver libraries are also injected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this reduces the value of adding an API for CDI devices in my opinion.
This isn't true. The use of the device mounts uses our "legacy" stack which has a number of shortcomings that using CDI addresses. It also means that users can access devices from other vendors such as Intel or their own device / resource definitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only value add is that the driver libraries are also injected.
I don't supposed we can just trigger driver library injection directly in a simple way and reduce the API surface to "yes I'd like the GPU drivers"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent today putting this together:
https://github.com/klueska/kind-with-gpus-examples
It should be a good starting point for using GPUs with kind independent of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also pinged the docker maintainers that helped us push CDI through to see what their thoughts are here:
https://dockercommunity.slack.com/archives/C04MZQZJE94/p1712086101448089
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was pointed to this on that thread:
moby/moby#39702 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If moby/moby#47663 becomes available we'll have to version detect to use it to avoid breaking all of the users on older docker installs, and I'll be pretty wary of the branching codepath behaving differently vs --privileged.
We have only once required a newer docker version (when we started to require cgroupns=private be available) which caused a lot of consternation even though that change was aimed at mitigating the runc misc issues that caused kind to very flaky/broken ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would've been nice when kind was being built, but now we have to consider that people undoubtedly have test environments that depend on being in a --privileged node :/
containerPath: /var/run/nvidia-container-devices/all | ||
{{< /codeFromInline >}} | ||
|
||
Note: this method only support adding `all` GPUs to a single node. If you want to add specific GPUs to specific nodes, you will need to use the `devices` API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This limitation of all
is true even in the CDI case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this reduces the value of adding an API for CDI devices in my opinion.
Same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lukeogg. Some minor comments / suggestions from my side.
if ver != "dev" || strings.Split(ver, ".")[0] < "25" { | ||
return nil, errors.Errorf("using devices api in kind requires Docker >= v25, but found %q", ver) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This also requires that the Daemon be configured in experimental mode.
What about dropping this check entirely and relying on the error reported by the Docker daemon. Should be something along the lines of "no driver for "cdi"". Or is this UX to confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer this approach, rather than make kind to have awareness of the provider versions
@@ -33,6 +33,16 @@ func IsAvailable() bool { | |||
return strings.HasPrefix(lines[0], "Docker version") | |||
} | |||
|
|||
// Version gets the version of docker available on the system | |||
func Version() string { | |||
cmd := exec.Command("docker", "version", "--format", "'{{.Server.Version}}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use docker info
we can also extract information as to whether experimental
is enabled.
Does changing this function to assertCDISupported()
or something similar make sense here? We could then check the combination of server version and experimental to see whether it is supported.
@@ -212,6 +212,13 @@ func runArgsForNode(node *config.Node, clusterIPFamily config.ClusterIPFamily, n | |||
args..., | |||
) | |||
|
|||
// Append CDI device args (used for GPU support) | |||
if len(node.CDIDevices) > 0 { | |||
for _, device := range node.CDIDevices { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a compatibility check similar to that for docker also required here? CDI support was only really added with Podman 4.1.0
. As I've stated before, I think it's also fine to rely on the error returned by the CLI itself, but that may not be the nicest UX.
// Append CDI device args (used for GPU support) | ||
if len(node.CDIDevices) > 0 { | ||
for _, device := range node.CDIDevices { | ||
args = append(args, "--device", strings.TrimSpace(device)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having to remember to call strings.TrimSpace
whereever the elements of node.CDIDevices
are used, can we do that somewhere once?
|
||
As a pre-requisite you install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed on the host. | ||
|
||
Using `devices` for GPU support requires Docker v25 or later. A [CDI specification](https://github.com/container-orchestrated-devices/container-device-interface) will need to be generated for your device. For Nvidia GPU devices see notes [here.](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-device-interface-cdi-support) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using `devices` for GPU support requires Docker v25 or later. A [CDI specification](https://github.com/container-orchestrated-devices/container-device-interface) will need to be generated for your device. For Nvidia GPU devices see notes [here.](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-device-interface-cdi-support) | |
Using `devices` for GPU support requires Docker v25 or Podman v4.1.0 or later. A [CDI specification](https://github.com/container-orchestrated-devices/container-device-interface) will need to be generated for your device. For NVIDIA GPU devices see notes [here.](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-device-interface-cdi-support) |
|
||
Using `devices` for GPU support requires Docker v25 or later. A [CDI specification](https://github.com/container-orchestrated-devices/container-device-interface) will need to be generated for your device. For Nvidia GPU devices see notes [here.](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-device-interface-cdi-support) | ||
|
||
GPU devices can be mapped to Kind node copntainers with the devices API: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is it Kind
or KinD
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KIND
or kind
. #3290 (comment)
|
||
Steps to enable this: | ||
|
||
1. Add nvidia as your default runtime in `/etc/docker/daemon.json` If you have the [NVIDIA Container Toolkit installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) this can be done with: `sudo nvidia-ctk runtime configure --runtime=docker --set-as-default` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Add nvidia as your default runtime in `/etc/docker/daemon.json` If you have the [NVIDIA Container Toolkit installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) this can be done with: `sudo nvidia-ctk runtime configure --runtime=docker --set-as-default` | |
1. Add `nvidia` as your default runtime in `/etc/docker/daemon.json` If you have the [NVIDIA Container Toolkit installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) this can be done with: `sudo nvidia-ctk runtime configure --runtime=docker --set-as-default` |
- role: control-plane | ||
- role: worker | ||
devices: | ||
- "nvidia.com/gpu=0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this reduces the value of adding an API for CDI devices in my opinion.
This isn't true. The use of the device mounts uses our "legacy" stack which has a number of shortcomings that using CDI addresses. It also means that users can access devices from other vendors such as Intel or their own device / resource definitions.
worker or control-plane (in HA mode), | ||
KIND runs `kubeadm join` which can be configured using the | ||
KIND runs `kubeadm join` which can be configured using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind
vs KinD
vs KIND
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs consistently use "KIND" or "kind" but not "KinD". We're not using the acronym anymore as it may be podman or in the future something else. Generally just kind
like the CLI, but sometimes "KIND" or kind
(code formatted) is less ambiguously the tool vs the english word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should put a note in https://kind.sigs.k8s.io/docs/contributing/development/#documentation
@@ -5,6 +5,7 @@ go 1.16 | |||
require ( | |||
github.com/BurntSushi/toml v1.0.0 | |||
github.com/alessio/shellescape v1.4.1 | |||
github.com/container-orchestrated-devices/container-device-interface v0.6.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding dependency management, I'm working on #3335 to switch the modules to 1.17+, which will enable https://go.dev/ref/mod#graph-pruning and make the transitive dependencies clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just glancing at the go.sum changes, there's a concerningly large set of deps, but we can't tell with pre 1.17 go.mod files what is actually in the binary.
It remains important that kind
the go libraries and CLI are trivially embeddable with minimal dependencies which also must meet kubernetes / CNCF's (allowed licenses) dependency standards. We have a number of users using kind directly in test tools etc and we don't want to bring in more dependencies.
https://github.com/cncf-tags/container-device-interface/blob/main/go.mod has a lot of deps for what I would expect to just be a spec/parser.
I will get back to this in the next week or so. Been out for a bit. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@lukeogg: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR adds support for passing GPU parameters to Nvidia Container Toolkit through the CDI specification. Although there is a way to map all GPUs to a single node with device mounts, more granularity is desired. This PR will support mapping various device combinations to different nodes as needed.
Would resolve Issue 3164
All GPUs mapped to a single control-plane:
Specific GPUs mapped to specific worker nodes based on index: