-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EKS] [request]: Use Dynamic Resource Allocation for EKS #2314
Comments
Kubernetes code freeze for 1.32 is tomorrow (Friday 8th November 2024), as of #127511 the DRA feature gate is as follows:
Note that the default is xref: #512 |
Even if both the FeatureGate and API for Dynamic Resource Allocation (DRA) are enabled, support for the beta API from DRA GPU drivers like NVIDIA and Intel is still required. This is because there are some changes from the alpha API, making it impractical for users to use the alpha API at this stage. This support will likely be implemented as soon as Kubernetes 1.32 is released. However, there are several KEPs derived from DRA that have been proposed, so it cannot yet be considered stable.
In particular, when using the current DRA for cases that involve partitioning devices like MIGs, it deviates from the original "Dynamic" meaning of DRA.
Furthermore, DRA is not yet integrated with Cluster Autoscaler and Karpenter, which makes it hard to use in production. |
Do we have any updates regarding that? Is it possible to use the Dynamic Resource Allocation (DRA) on EKS with version 1.30 or 1.31? |
Let's clarify which part of DRA this feature is about. In my opinion, it should focus on enabling the v1beta1 API in 1.32. Enabling older alpha APIs in past Kubernetes releases makes no sense anymore. Enabling alpha features in 1.32 and beyond might make sense, but is probably too risky because they are alpha.
Those will come shortly after the 1.32 release. We need the new v1beta1 API to be released before the drivers can be updated.
What got promoted to beta is the core DRA with structured parameters. Promotion to beta means:
I think core DRA is now stable, even if work on additional features continues. |
|
If one day Kubernetes drops the API, clusters that rely on the behavior get a breaking change. Beta APIs are rarely dropped but it can happen. |
According to https://kubernetes.io/docs/reference/using-api/deprecation-policy/, a beta API remains available for three releases, then gets deprecated and removed after another three releases. Removing the API sooner would be a break of those stability guarantees. Waiting one release after beta graduation makes no difference regarding API availability. The current state (API group off, feature gate off) won't change. Some additional, future bugs might be fixed, but it's also possible that those bugs won't be found unless it gets enabled. |
Or perhaps more precisely, can remain available that long. It may get replaced or removed sooner, but then the "must remain available for three releases" kicks in. We've been very careful with the API design of DRA. I don't see a reason why it should get replaced by a v1beta2. I also don't think it's likely that it gets removed outright. |
If DRA is not stable to use, is the best way to use HPA and Karpenter now? If there is another good way, please recommend it. |
@pohly, thanks for chiming in on this discussion! I really appreciate everything you guys have done so far. My main concern is the same as what you all were discussed in kubernetes/kubernetes#127511 (comment): EKS doesn't offer a feature that lets users opt in to the FeatureGate / Beta API. If it’s going to be enabled by default on all clusters starting with EKS 1.32, I’m against it. I think @sftim is suggesting waiting for one minor version because there’s always a chance of unexpected impacts on existing workloads. It’s fine if people who want to use it have the option, but for users needing a stable production environment, DRA is a game changer for Kubernetes with a broad scope of changes, which makes it feel a bit risky.
I think it would be better to explain the challenges in more detail. |
That's the key question regarding "enabled by default". My two cents: I see the risk as pretty low, because kube-scheduler, kube-controller-manager and kubelet pretty much don't do anything related to DRA when the feature isn't used. |
This will be critical for Spring Boot apps to account for the ridiculous CPU spike that happens at startup. We go from 2-3 full cores down to 10 millicores once the app is stable. Setting the request to the 10milli is good for longer term but is a nightmare when higher number of pods are starting up at the same time. |
I think you're thinking of a different Kubernetes feature, @matthenry87 |
Ah hrm, you are correct. I am just needing to be able to write some sort of controller that adjusts the resource requests of a pod's containers without triggering a pod restart. Sorry for my misattribution. |
Community Note
Tell us about your request
According to https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
It is said that Kubernetes v1.29 includes cluster-level API support for dynamic resource allocation(DRA),
For the benefit of customers using EKS, I want to be able to add the use of dynamic resource allocation(DRA) in any way
Which service(s) is this request for?
EKS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
It is difficult to add the
--feature-gates
env because the kube-apiserver, kube-scheduler, kube-controller-manager and kubeletAdditional context
I think many customers want to allocate resources appropriately, so I would appreciate it if you could apply it quickly
Attachments
How to enable dynamic resource allocation(DRA)
The text was updated successfully, but these errors were encountered: