-
Notifications
You must be signed in to change notification settings - Fork 978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karpenter workload consolidation/defragmentation #1091
Comments
what about https://github.com/kubernetes-sigs/descheduler which already implements some of this? |
i've been using descheduler for this. But mind one can only use Also mind that this is really sub-optimal compared to what Cluster Autoscaler (CAS) does. With CAS I can set a threshold of 70-80% and it will really condense the cluster. CAS gets scheduling mostly correct because it simulates kube-scheduler. I think the CAS approach is the correct one. I need to really condense the cluster, and the only way to do that is to simulate what will happen after evictions. At least evicting pods that will just reschedule is meaningless. I also think changing any instance types is an overkill. If I have 1 remaining node with some CPUs to spare, that is fine. At least I would like a strategy that tries to put pods on existing nodes first, then check on the remaining node (the one with the lowest utilisation), if another instance type fits. A more fancy thing to do, which I think comes much later, is to look at the overall mem vs cpu balance in the cluster. E.g if the cluster shifts from generally cpu bound to memory bound, it would be nice if karpenter could adjust for that. But then we hit the knapsack-like problem that can get a bit tricky to work out. |
Currently Karpenter will scale down empty nodes automatically. However, it does not actively move pods around to create empty nodes |
I'd like Karpenter to terminate (or request termination) of a node when it has a low density of pods and there is another node which could take the nodes (#1491). |
Karpenter looks exciting, but for large-scale K8 clusters deployments, this is pretty much a prerequisite. Is there any discussion or design document about the possible approaches that can be taken for bin packing of the existing workload? |
We're currently laying the foundation by implementing the remaining scheduling spec (affinity/antiaffinity). After that, we plan to make rapid progress on defrag design. I expect we'll start small (e.g. 1 node at a time, simple compactions) and get more sophisticated over time. This is pending design, though. In the short term, you can combine |
In the end my requirement is the same as the others in this thread, but one of the reasons for the requirement, which I have not yet seen captured, is that minimizing (to some reasonable limit) the number of hosts can impact the cost of anything billed per-host (e.g., DataDog APM). |
Hi, this is really important, needs to have the same functionality as cluster-autoscaler. This is preventing me from switching to Karpenter. |
We alose need this feature, the cluster's cost increase when we migrate to karpenter due to the increased nodes count. In our scenario, we use karpenter in a EMR on EKS cluster, which creates CR(jobbatch) on EKS cluster and those CR will create pods, we can not just add a poddisruptionbudgets for those workloads simply. |
We also need this feature. In our scenario, we would like to terminate under utilized nodes by actively move pods around to create empty nodes. Any idea about its release date ? |
Am I understanding correctly: this could in theory terminate pods that had recently been spun up on an existing node? Are there any workarounds right now to move affected pods to another node so there is no interruption? Or would this only be achieved with the tracked feature? |
We've laid down most of the groundwork for this feature with in flight nodes, not binding pods, topology, affinity, etc. This is one of our highest priorities right now, and we'll be sure to make early builds available to ya'll. If you're interested in discussing the design, feel free to drop by https://github.com/aws/karpenter/blob/main/WORKING_GROUP.md |
Running the previous helm command produces a yaml file with these images that cause the webhook to fail. After swapping out the images it worked fine. $ ag image:
karpenter.yaml
305: image: public.ecr.aws/karpenter/controller:v0.13.2@sha256:af463b2ab0a9b7b1fdf0991ee733dd8bcf5eabf80907f69ceddda28556aead31
344: image: public.ecr.aws/karpenter/webhook:v0.13.2@sha256:e10488262a58173911d2b17d6ef1385979e33334807efd8783e040aa241dd239 error Status:Failure,Message:mutation failed: cannot decode incoming new object: json: unknown field \"consolidation\",Reason:BadRequest,Details:nil,Code:400,}"} |
@dennisme I can't reproduce this:
|
@tzneal yeppp, was a local helm version issue and the oci registry vs |
@offzale: I think you want to increase your provisioner.spec.ttlSecondsAfterEmpty to longer than your cron job period. This will keep the idle nodes around from the 'last' cron job run. Alternatively, maybe shutting them down and recreating them is actually the right thing to do? This depends on the time intervals involved, cost of idle resources, desired 'cold' responsiveness, and instance shutdown/bootup overhead. Point being that I don't think there's a general crystal-ball strategy here that we can use for everyone... Unfortunately, I think you will need to tune it based on what you know about your predictable-future-workload and your desired response delay vs cost tradeoffs. |
I had a couple of things I needed to solve before testing Karpenter, and once I solved them, here is the first comparison: I'm testing it by replacing an ASG with ~60 m5n.8xlarge instances. The total CPU request of all the pods is ~1800, the total Memory request is 6.8TB. With Karpenter the allocatable CPU is 2400 (510 unallocated cores), and the allocatable Memory 13.8TB (6.18TB unallocated Memory) Here is the diversity of the nodes with Karpenter:
|
Thanks for the info @liorfranko. What does your provisioner look like? Karpenter implements two forms of consolidation. The first is where it will delete a node if the pods on that node can run elsewhere. Due to the anti-affinity rules on your pods, it sounds like this isn't possible. The second is where it will replace a node with a cheaper node if possible. This should be happening in your case unless the provisioner is overly constrained to larger types only. Since you've got a few 2xlarge types there, that doesn't appear to be the case either. That looks to be 85 nodes that Karpenter has launched. Do your work loads have preferred anti-affinities or node selectors? |
Here is the provisioner spec:
Here is an example of possible consolidation:
The pod
|
What does the |
I think it's related to several PDBs that were configured with |
It almost didn't effect, the total number of cores decreased by 20 cores and the memory by 200GB I think that the problem is related to the chosen instances.
On the other hand, I see many
Both of the above can be replaced with an |
I looked at your provisioner and it looks like these are spot nodes. We currently don't replace spot nodes with smaller spot nodes. The reasoning for this is that we don't have a way of knowing if the node we would replace it with is as available or more available than the node that is being replaced. By restricting the instance size we could potentially be moving you from a node that you're likely to keep for a while to a node that will be terminated in a short period of time. If these were on-demand nodes, then we would replace them with smaller instance types. |
Thanks @tzneal Do you know when it would be supported? Or at least let me choose to enable it. And the reason for choosing the c6i.12xl over the m5.8xl is the probability for interruptions? |
For spot, we use the capacity-optimized-prioritized strategy when we launch the node. The strategies are documented here but essentially it makes a trade-off of a slightly more expensive node in exchange for less chance of interruptions. Node size is also not always related to the cost, I just checked the spot pricing for us-east-1 at https://aws.amazon.com/ec2/spot/pricing/ and the r6a.8xlarge was actually cheaper than m5.8xlarge.
|
Thank @tzneal for all the information. So up until now, everything works good. |
Implements cluster consolidation via: - removing nodes if their workloads can run on other nodes - replacing nodes with cheaper instances Fixes aws#1091
Implements cluster consolidation via: - removing nodes if their workloads can run on other nodes - replacing nodes with cheaper instances Fixes aws#1091
Implements cluster consolidation via: - removing nodes if their workloads can run on other nodes - replacing nodes with cheaper instances Fixes #1091
@tzneal In which version will this feature be released? |
The constraint to on-demand comes also to disappointment for us as we have clusters with solely spot instances running. |
@kahirokunn you wish to replace spot nodes with cheaper ones that can actually be killed sooner than the one your workloads are before consolidating ? |
@FernandoMiguel I'm not sure how complicated is this to implement, but there's an AWS page here showing "Frequency of interruption" per instance type and with some types (e.g. |
not sure what workloads you run, but i tend to prefer my hosts to not change often. hence why deeper pools are prefered. |
I'm using Karpenter for non-sensitive workloads with a lot of peaks during the day that creates huge servers sometimes, which is good for me instead of having many small ones, but after an hour when the peak is down, it usually needs to be consolidated into smaller instances, because the cluster is left with huge servers that are mostly idle. |
@dekelev I don't run karpenter, but I have similar problem with "plain" EKS and (managed) nodegroups. I've solved it like this: https://github.com/matti/k8s-nodeflow/blob/main/k8s-nodeflow.sh this is running in the cluster and it ensures that all machines are constantly drained within N seconds. this requires proper configuration of PodDisruptionPolicies and also https://github.com/kubernetes-sigs/descheduler is recommended to "consolidate" low-utilization nodes. I think this OR something similar could work with Karpenter. Btw my email is in my github profile if you want to have a call or something about this - my plan is to develop these things further at some point and having valid use cases would be helpful for me. |
+1, I'd love a flag to enable consolidation for Spot instances too. Workloads vary wildly and I agree that by default this flag should be |
@FernandoMiguel Spot instances are naturaly accompanied with workload that supports interruptions, guess no one will use spot when the workload is allergic against!? However, I could picture a case where Karpenter tries to schedule a smaller node which is however unavailable, falling back to bigger instance and thus running into a loop!? Not sure if K. is doing a preflight in such case? |
The original problem that forced not replacing spot nodes with smaller spot nodes is that the straight forward approach eliminates the utility of the capacity optimized prioritized strategy that we use for launching spot nodes which considers pool depth and leads to launching nodes that may be slightly bigger than necessary, but are less likely to be reclaimed. If consolidation did replace spot nodes, we would often launch a spot node, get one that's slightly too big due to pool depth and interruption rate, and then immediately replace it with a cheaper instance type with a higher interruption rate. This turns our capacity optimized prioritized strategy into a Karpenter operated lowest price strategy which isn't recommended for use with spot. I'll be locking this issue as it was for the original consolidation feature and it's difficult to track new feature requests on closed issues. Feel free to open a new one specifically for spot replacement for that discussion. |
Tell us about your request
As a cluster admin, I want Karpenter to consolidation the application workloads by moving pods to a fewer worker nodes and scale down the cluster so that I can improve the cluster resource utilization rate.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
In an under-utilized cluster, application pods are spread across worker nodes with exceeding among of resources. This wasteful situation can be improved by carefully packing the pods to a smaller number of worker nodes with the right size. Current version of Karpenter does not support rearranging pods and continuously improve the cluster utilization. The workload consolidation feature is the missing important component to complete the cluster scaling life cycle management loop.
This workload consolidation feature is nontrivial because of the following coupling problems.
The pod packing problem determines which pods should be hosted together by the same worker node according to their taints and constraints. The goal is to produce a fewer well-balanced groups of pods that can be hosted by worker nodes with just the right size.
According to the pod packing solution, the instance type selection problem determines which combination of instance types should be used to host the pods after the rearrangement.
The above problems deeply couple together so that one solution affect the other. Together the problem is a variant of the bin packing problem which is NP-complete. A practical solution will implements a quick heuristic algorithm that utilizes the special structure of the problem for specific use cases and user preferences. Therefore, thorough discussions with the customer is important.
Are you currently working around this issue?
Currently Karpenter will scale down empty nodes automatically. However, it does not actively move pods around to create empty nodes.
Additional context
Currently the workload consolidation feature is in the design phase. We should gather inputs from the customers about their objectives and preferences.
Attachments
If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)
Community Note
The text was updated successfully, but these errors were encountered: