-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic Resource Allocation #1231
Comments
/triage accepted |
IMO, it makes a lot of sense to build-out a |
DRA is alpha. DRA beta ETA is 1.32. Starting the work aligned with KEP 4381 makes sense. |
Update: There is another KEP (that is probably the more up-to-date one) that proposes a bunch of changes in 1.31: kubernetes/enhancements#4709. I'd encourage folks who are interested to take a look at it and see what we think about how it fits in with Karpenter's scheduling logic. As @uniemimu called out, the current target is 1.32 for the API that is proposed in the KEP to go to beta. |
FYI: Anyone who is interested in developing this PoC can use the SIG-provided example driver for testing changes: https://github.com/kubernetes-sigs/dra-example-driver (structured-parameters branch) |
Description
What problem are you trying to solve?
If you haven't heard, there's a lot of buzz in the community about this thing called "Dynamic Resource Allocation." Effectively, it's a change to the existing Kubernetes resource model that would allow users to select against resources surfaced through a
ResourceSlice
object associated with a node that exposes Node hardware. Users create aResourceClaim
and perform selection through attribute-based selection using Common Expression Language.The proposal for this change is documented here where there is a ton of discussion for the use-cases and the implications throughout the Kubernetes project.
The change to the resource model is of particular importance to Karpenter since we rely deeply on this resource model to know whether a pod is eligible to schedule against an instance type which we can think of as a "theoretical" node. Effectively, Karpenter now needs to be aware of the concepts
ResourceSlice
andResourceClaim
to know which instance types have the hardware required to schedule a set of pods. As Karpenter performs scheduling against theseResourceSlices
it needs to simulate a pod taking up that hardware and rule out an instance type when the hardware can no longer fit the pods scheduling against it.This has some relation to #751 but I think we can decouple for now. DRA only requires what we know what the model would look like if the node were to launch, it doesn't necessitate that we allow users to specify arbitrary resources.
CloudProviders can first-class a set of resources it knows will appear in the
ResourceSlices
when the node comes up and hand that back in theGetInstanceTypes
call for the scheduler to reason about. Some solid use-cases for this are things like NVIDIA GPUs whose hardware is well-known before launching the instance type or AWS's Inferentia accelerators.Tasks
I want to build-out a set of Tasks that can be taken up to get a PoC for this working. Ideally, someone could build this out with kwok and then we could apply the same changes to the Azure and AWS providers.
Working Group
Separately, if you are interested in attending the Working Group and contributing to other use-cases around DRA, the log is here and the official working group charter and meeting times are here
The YouTube Playlist for previous meetings can also be found here.
The text was updated successfully, but these errors were encountered: