fix: Consider unready nodes as in flight #2224
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes ##2164
Description
During large scale ups, nodes sometimes flip/flop from ready to not read as things come online. This change considers unready nodes as in flight to avoid overscaling.
In the case of hardware, networking, or other failure, it's possible for a node to transition from ready to unready. Previously, Kubernetes would evict the pods on the node after 5 minutes (default), triggering additional scale out. Instead, Karpenter will consider these taints as ephemeral and the node as in flight. Intervention (automated or otherwise) is required to remove these nodes.
This may also be considered as a safety feature. If some issue caused nodes to consistently transition Ready -> NotReady, Karpenter would keep scaling out to provisioner limits. Instead, this waits for the nodes to recover or an operator to intervene. Additional automation (e.g. kubernetes-sigs/karpenter#750) can address this case.
How was this change tested?
TEST_FILTER=TestUtilization make e2etests
Does this change impact docs?
WIP
Release Note
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.