Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameterize Multinode + Single node consolidation timeout #1733

Open
Pokom opened this issue Oct 2, 2024 · 2 comments
Open

Parameterize Multinode + Single node consolidation timeout #1733

Pokom opened this issue Oct 2, 2024 · 2 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Pokom
Copy link

Pokom commented Oct 2, 2024

Description

Note

This is similar to #903 but distinct in that this happens to our largest clusters regardless of scale up/scale down activity. #1031 was opened but closed due to needing an RFC, and I would like to work on putting that together.

What problem are you trying to solve?

Provide the ability to configure the values for multinode and single node consolidation timeouts.

In large clusters with somewhat complex nodepool setups and anti affinity rules, we're consistently running into timeouts for the consolidation process. The impact is that our clusters are over provisioned as nodes aren't being taken offline.

We're also working on profiling karpenter and working to identify where the bottleneck is in the code, as a stopgap this would be a nice feature to have.

It would be really for us to be able to configure these values at runtime and set them to values that would allow the consolidation process to finish. It may not be as fast as the default values, but having it finish slower is preferable to having the process timeout and never complete.

How important is this feature to you?

This is very important as Karpenter is causing a fairly large uptick in spend for large clusters because the consolidation can't process fast enough.

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@Pokom Pokom added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 2, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 2, 2024
@njtran
Copy link
Contributor

njtran commented Oct 14, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 14, 2024
@Pokom
Copy link
Author

Pokom commented Oct 15, 2024

@njtran are you open to an outside contribution for the issue? I should have some time this week to get a PR ready

Pokom added a commit to grafana/karpenter that referenced this issue Oct 15, 2024
Adds two options for the following timeouts
- multinodeconsolidation
- singlenodeconsolidation

These are exposed on the following ways:
- `--multi-node-consolidation-timeout` or
  `MULTI_NODE_CONSOLIDATION_TIMEOUT`
- `--single-node-consolidation-timeout` or
  `SINGLE_NODE_CONSOLIDATION_TIMEOUT`

The primary way of testing this was by building the image and running
within dev and production clusters within Grafana Labs fleet.

---

- refs kubernetes-sigs#1733

Signed-off-by: pokom <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants