Parameterize Multinode + Single node consolidation timeout #1733
Labels
kind/feature
Categorizes issue or PR as related to a new feature.
triage/accepted
Indicates an issue or PR is ready to be actively worked on.
Description
Note
This is similar to #903 but distinct in that this happens to our largest clusters regardless of scale up/scale down activity. #1031 was opened but closed due to needing an RFC, and I would like to work on putting that together.
What problem are you trying to solve?
Provide the ability to configure the values for multinode and single node consolidation timeouts.
In large clusters with somewhat complex nodepool setups and anti affinity rules, we're consistently running into timeouts for the consolidation process. The impact is that our clusters are over provisioned as nodes aren't being taken offline.
We're also working on profiling karpenter and working to identify where the bottleneck is in the code, as a stopgap this would be a nice feature to have.
It would be really for us to be able to configure these values at runtime and set them to values that would allow the consolidation process to finish. It may not be as fast as the default values, but having it finish slower is preferable to having the process timeout and never complete.
How important is this feature to you?
This is very important as Karpenter is causing a fairly large uptick in spend for large clusters because the consolidation can't process fast enough.
The text was updated successfully, but these errors were encountered: