Dynamic Resource Allocation #1231

jonathan-innis · 2024-05-03T00:17:23Z

Description

What problem are you trying to solve?

If you haven't heard, there's a lot of buzz in the community about this thing called "Dynamic Resource Allocation." Effectively, it's a change to the existing Kubernetes resource model that would allow users to select against resources surfaced through a ResourceSlice object associated with a node that exposes Node hardware. Users create a ResourceClaim and perform selection through attribute-based selection using Common Expression Language.

The proposal for this change is documented here where there is a ton of discussion for the use-cases and the implications throughout the Kubernetes project.

The change to the resource model is of particular importance to Karpenter since we rely deeply on this resource model to know whether a pod is eligible to schedule against an instance type which we can think of as a "theoretical" node. Effectively, Karpenter now needs to be aware of the concepts ResourceSlice and ResourceClaim to know which instance types have the hardware required to schedule a set of pods. As Karpenter performs scheduling against these ResourceSlices it needs to simulate a pod taking up that hardware and rule out an instance type when the hardware can no longer fit the pods scheduling against it.

This has some relation to #751 but I think we can decouple for now. DRA only requires what we know what the model would look like if the node were to launch, it doesn't necessitate that we allow users to specify arbitrary resources.

CloudProviders can first-class a set of resources it knows will appear in the ResourceSlices when the node comes up and hand that back in the GetInstanceTypes call for the scheduler to reason about. Some solid use-cases for this are things like NVIDIA GPUs whose hardware is well-known before launching the instance type or AWS's Inferentia accelerators.

Tasks

I want to build-out a set of Tasks that can be taken up to get a PoC for this working. Ideally, someone could build this out with kwok and then we could apply the same changes to the Azure and AWS providers.

Add ResourceSlice to the CloudProvider InstanceType model
Add ResourceSlice returning from the GetInstanceTypes() call in Kwok
Handle adding ResourceClaims to pod requirements
Handle ResourceClaim/ResourceSlice compatibility with CEL resolution
Handle simulating ResourceClaims against ResourceSlices through CEL (the tricky bit)

Working Group

Separately, if you are interested in attending the Working Group and contributing to other use-cases around DRA, the log is here and the official working group charter and meeting times are here

The YouTube Playlist for previous meetings can also be found here.

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

jonathan-innis · 2024-05-03T00:23:09Z

/triage accepted

jonathan-innis · 2024-05-03T00:50:42Z

IMO, it makes a lot of sense to build-out a staging/dra branch for the PoC work here. We can start building out the changes and collaborate on them without pulling them into the main branch. This is definitely going to be important since the DRA stuff is in beta and still in flux.

uniemimu · 2024-05-06T09:04:37Z

This is definitely going to be important since the DRA stuff is in beta and still in flux.

DRA is alpha. DRA beta ETA is 1.32. Starting the work aligned with KEP 4381 makes sense.

jonathan-innis · 2024-06-11T18:00:17Z

Update: There is another KEP (that is probably the more up-to-date one) that proposes a bunch of changes in 1.31: kubernetes/enhancements#4709. I'd encourage folks who are interested to take a look at it and see what we think about how it fits in with Karpenter's scheduling logic.

As @uniemimu called out, the current target is 1.32 for the API that is proposed in the KEP to go to beta.

jonathan-innis · 2024-07-08T07:53:38Z

FYI: Anyone who is interested in developing this PoC can use the SIG-provided example driver for testing changes: https://github.com/kubernetes-sigs/dra-example-driver (structured-parameters branch)

jonathan-innis added the kind/feature Categorizes issue or PR as related to a new feature. label May 3, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 3, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 3, 2024

p53 mentioned this issue May 3, 2024

Support Nvidia GPU Feature Discovery #1219

Open

toVersus mentioned this issue Nov 10, 2024

[EKS] [request]: Use Dynamic Resource Allocation for EKS aws/containers-roadmap#2314

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic Resource Allocation #1231

Dynamic Resource Allocation #1231

jonathan-innis commented May 3, 2024 •

edited

Loading

jonathan-innis commented May 3, 2024

jonathan-innis commented May 3, 2024 •

edited

Loading

uniemimu commented May 6, 2024

jonathan-innis commented Jun 11, 2024

jonathan-innis commented Jul 8, 2024 •

edited

Loading

Dynamic Resource Allocation #1231

Dynamic Resource Allocation #1231

Comments

jonathan-innis commented May 3, 2024 • edited Loading

Description

Tasks

Working Group

jonathan-innis commented May 3, 2024

jonathan-innis commented May 3, 2024 • edited Loading

uniemimu commented May 6, 2024

jonathan-innis commented Jun 11, 2024

jonathan-innis commented Jul 8, 2024 • edited Loading

jonathan-innis commented May 3, 2024 •

edited

Loading

jonathan-innis commented May 3, 2024 •

edited

Loading

jonathan-innis commented Jul 8, 2024 •

edited

Loading