Parameterize Multinode + Single node consolidation timeout #1733

Pokom · 2024-10-02T13:32:42Z

Description

Note

This is similar to #903 but distinct in that this happens to our largest clusters regardless of scale up/scale down activity. #1031 was opened but closed due to needing an RFC, and I would like to work on putting that together.

What problem are you trying to solve?

Provide the ability to configure the values for multinode and single node consolidation timeouts.

In large clusters with somewhat complex nodepool setups and anti affinity rules, we're consistently running into timeouts for the consolidation process. The impact is that our clusters are over provisioned as nodes aren't being taken offline.

We're also working on profiling karpenter and working to identify where the bottleneck is in the code, as a stopgap this would be a nice feature to have.

It would be really for us to be able to configure these values at runtime and set them to values that would allow the consolidation process to finish. It may not be as fast as the default values, but having it finish slower is preferable to having the process timeout and never complete.

How important is this feature to you?

This is very important as Karpenter is causing a fairly large uptick in spend for large clusters because the consolidation can't process fast enough.

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

njtran · 2024-10-14T20:51:38Z

/triage accepted

Pokom · 2024-10-15T12:50:53Z

@njtran are you open to an outside contribution for the issue? I should have some time this week to get a PR ready

Adds two options for the following timeouts - multinodeconsolidation - singlenodeconsolidation These are exposed on the following ways: - `--multi-node-consolidation-timeout` or `MULTI_NODE_CONSOLIDATION_TIMEOUT` - `--single-node-consolidation-timeout` or `SINGLE_NODE_CONSOLIDATION_TIMEOUT` The primary way of testing this was by building the image and running within dev and production clusters within Grafana Labs fleet. --- - refs kubernetes-sigs#1733 Signed-off-by: pokom <[email protected]>

Pokom added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 2, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 2, 2024

Pokom mentioned this issue Oct 2, 2024

Parameterize Multinode + Single node consolidation timeout aws/karpenter-provider-aws#7129

Closed

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 14, 2024

Pokom mentioned this issue Oct 15, 2024

feat(options): Add consolidation timeout options #1754

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameterize Multinode + Single node consolidation timeout #1733

Parameterize Multinode + Single node consolidation timeout #1733

Pokom commented Oct 2, 2024 •

edited

Loading

njtran commented Oct 14, 2024

Pokom commented Oct 15, 2024

Parameterize Multinode + Single node consolidation timeout #1733

Parameterize Multinode + Single node consolidation timeout #1733

Comments

Pokom commented Oct 2, 2024 • edited Loading

Description

njtran commented Oct 14, 2024

Pokom commented Oct 15, 2024

Pokom commented Oct 2, 2024 •

edited

Loading