Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS] [request]: Use Dynamic Resource Allocation for EKS #2314

Open
junhwani opened this issue Mar 26, 2024 · 14 comments
Open

[EKS] [request]: Use Dynamic Resource Allocation for EKS #2314

junhwani opened this issue Mar 26, 2024 · 14 comments
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue

Comments

@junhwani
Copy link

junhwani commented Mar 26, 2024

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
According to https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
It is said that Kubernetes v1.29 includes cluster-level API support for dynamic resource allocation(DRA),
For the benefit of customers using EKS, I want to be able to add the use of dynamic resource allocation(DRA) in any way

Which service(s) is this request for?
EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
It is difficult to add the --feature-gates env because the kube-apiserver, kube-scheduler, kube-controller-manager and kubelet

Additional context
I think many customers want to allocate resources appropriately, so I would appreciate it if you could apply it quickly

Attachments
How to enable dynamic resource allocation(DRA)

@junhwani junhwani added the Proposed Community submitted issue label Mar 26, 2024
@mikestef9 mikestef9 added the EKS Amazon Elastic Kubernetes Service label Mar 26, 2024
@dims
Copy link
Member

dims commented Nov 7, 2024

Kubernetes code freeze for 1.32 is tomorrow (Friday 8th November 2024), as of #127511 the DRA feature gate is as follows:

	DynamicResourceAllocation: {
		{Version: version.MustParse("1.26"), Default: false, PreRelease: featuregate.Alpha},
		{Version: version.MustParse("1.32"), Default: false, PreRelease: featuregate.Beta},
	}

Note that the default is false for 1.32.

xref: #512

@toVersus
Copy link

Even if both the FeatureGate and API for Dynamic Resource Allocation (DRA) are enabled, support for the beta API from DRA GPU drivers like NVIDIA and Intel is still required. This is because there are some changes from the alpha API, making it impractical for users to use the alpha API at this stage. This support will likely be implemented as soon as Kubernetes 1.32 is released. However, there are several KEPs derived from DRA that have been proposed, so it cannot yet be considered stable.

In particular, when using the current DRA for cases that involve partitioning devices like MIGs, it deviates from the original "Dynamic" meaning of DRA.

Note: This partitioning is static. Dynamically reconfiguring a card to match demand is not part of this KEP. It's covered by the "partitionable devices" extension.

Furthermore, DRA is not yet integrated with Cluster Autoscaler and Karpenter, which makes it hard to use in production.

@aleksy-zalenski
Copy link

Do we have any updates regarding that? Is it possible to use the Dynamic Resource Allocation (DRA) on EKS with version 1.30 or 1.31?

@pohly
Copy link

pohly commented Nov 13, 2024

Let's clarify which part of DRA this feature is about. In my opinion, it should focus on enabling the v1beta1 API in 1.32. Enabling older alpha APIs in past Kubernetes releases makes no sense anymore. Enabling alpha features in 1.32 and beyond might make sense, but is probably too risky because they are alpha.

Even if both the FeatureGate and API for Dynamic Resource Allocation (DRA) are enabled, support for the beta API from DRA GPU drivers like NVIDIA and Intel is still required.

Those will come shortly after the 1.32 release. We need the new v1beta1 API to be released before the drivers can be updated.

However, there are several KEPs derived from DRA that have been proposed, so it cannot yet be considered stable.

What got promoted to beta is the core DRA with structured parameters. Promotion to beta means:

  • The API is guaranteed to remain available for several releases until it either gets promoted to GA or superseded by another beta. In both cases, the v1beta1 remains available, which makes it safe to rely on it.
  • We will fix whatever bugs are found in the implementation and backport those fixes.

I think core DRA is now stable, even if work on additional features continues.

@sftim
Copy link

sftim commented Nov 13, 2024

@sftim
Copy link

sftim commented Nov 13, 2024

The API is guaranteed to remain available for several releases until it either gets promoted to GA or superseded by another beta. In both cases, the v1beta1 remains available, which makes it safe to rely on it.

If one day Kubernetes drops the API, clusters that rely on the behavior get a breaking change. Beta APIs are rarely dropped but it can happen.

@pohly
Copy link

pohly commented Nov 13, 2024

According to https://kubernetes.io/docs/reference/using-api/deprecation-policy/, a beta API remains available for three releases, then gets deprecated and removed after another three releases. Removing the API sooner would be a break of those stability guarantees.

Waiting one release after beta graduation makes no difference regarding API availability. The current state (API group off, feature gate off) won't change. Some additional, future bugs might be fixed, but it's also possible that those bugs won't be found unless it gets enabled.

@pohly
Copy link

pohly commented Nov 13, 2024

a beta API remains available for three releases

Or perhaps more precisely, can remain available that long. It may get replaced or removed sooner, but then the "must remain available for three releases" kicks in.

We've been very careful with the API design of DRA. I don't see a reason why it should get replaced by a v1beta2. I also don't think it's likely that it gets removed outright.

@junhwani
Copy link
Author

If DRA is not stable to use, is the best way to use HPA and Karpenter now?

If there is another good way, please recommend it.

@toVersus
Copy link

@pohly, thanks for chiming in on this discussion! I really appreciate everything you guys have done so far. My main concern is the same as what you all were discussed in kubernetes/kubernetes#127511 (comment): EKS doesn't offer a feature that lets users opt in to the FeatureGate / Beta API.

If it’s going to be enabled by default on all clusters starting with EKS 1.32, I’m against it. I think @sftim is suggesting waiting for one minor version because there’s always a chance of unexpected impacts on existing workloads. It’s fine if people who want to use it have the option, but for users needing a stable production environment, DRA is a game changer for Kubernetes with a broad scope of changes, which makes it feel a bit risky.

If DRA is not stable to use, is the best way to use HPA and Karpenter now?

I think it would be better to explain the challenges in more detail.

@pohly
Copy link

pohly commented Nov 14, 2024

there’s always a chance of unexpected impacts on existing workloads.

That's the key question regarding "enabled by default". My two cents: I see the risk as pretty low, because kube-scheduler, kube-controller-manager and kubelet pretty much don't do anything related to DRA when the feature isn't used.

@matthenry87
Copy link

This will be critical for Spring Boot apps to account for the ridiculous CPU spike that happens at startup. We go from 2-3 full cores down to 10 millicores once the app is stable. Setting the request to the 10milli is good for longer term but is a nightmare when higher number of pods are starting up at the same time.

@sftim
Copy link

sftim commented Dec 13, 2024

This will be critical for Spring Boot apps to account for the ridiculous CPU spike that happens at startup. We go from 2-3 full cores down to 10 millicores once the app is stable. Setting the request to the 10milli is good for longer term but is a nightmare when higher number of pods are starting up at the same time.

I think you're thinking of a different Kubernetes feature, @matthenry87

@matthenry87
Copy link

This will be critical for Spring Boot apps to account for the ridiculous CPU spike that happens at startup. We go from 2-3 full cores down to 10 millicores once the app is stable. Setting the request to the 10milli is good for longer term but is a nightmare when higher number of pods are starting up at the same time.

I think you're thinking of a different Kubernetes feature, @matthenry87

Ah hrm, you are correct. I am just needing to be able to write some sort of controller that adjusts the resource requests of a pod's containers without triggering a pod restart. Sorry for my misattribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue
Projects
None yet
Development

No branches or pull requests

8 participants