Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imbalance with balance similar node groups #2892

Closed
JulienBalestra opened this issue Mar 4, 2020 · 5 comments
Closed

Imbalance with balance similar node groups #2892

JulienBalestra opened this issue Mar 4, 2020 · 5 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@JulienBalestra
Copy link
Contributor

JulienBalestra commented Mar 4, 2020

In the current AWS implementation, for example, the comparaison of node templates vs real world nodes produce imbalance scale ups when using the option --balance-similar-node-groups.

For example, when having 3 ASGs (1 per zone) with one ASG sets to 0, the autoscaler is always going to scale the ASG with real world nodes instead of balancing the under provisioned ASG (the one at 0).

I suppose this is due to how we build node template and compare it to the real world ones.

To avoid this issue, we've been running these experimental changes:

These experimental changes have been running properly for few months and gave us the expected results.
I'd like to open a conversation, why we initially used a mix comparaison of real and templated nodes.

This is also a potential issue when users change the instance type of the running ASG (#2840).

Let me know if you need more details I might have forgotten.

@MaciekPytel
Copy link
Contributor

At a glance, this basically always uses TemplateNodeInfo, right? This was discussed at length in #1021.

tl;dr TemplateNodeInfo is always wrong. Even in controlled and uniform env like GKE it's impossible to get it 100% right (and, believe me, I tried). If you ever have a pod that barely fits or barely doesn't fit on a node (within the margin of error of TemplateNodeInfo) and you use a patch that always uses TemplateNodeInfo the effects are devastating (including infinite scale-up and / or pods remaining pending forever).

I think a possibly better solution to the original problem is to add logic, where if you have multiple ~identical NodeGroup you use an existing node in one of the groups as a template for all the groups (overriding only the zone label).

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 4, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 4, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bpineau added a commit to DataDog/autoscaler that referenced this issue Oct 13, 2020
Use real-world nodeInfo from similar groups when BalanceSimilarNodeGroups.

Where available, the CA evaluates node groups capacities and labels using
real-world nodes, for optimal accuracy (accounting for kubelet and system
reserved resources, for instance).

It fallbacks to synthetic nodeInfos infered from ASG/MIG/VMSs templates when
no real nodes are available yet (ie. upscaling from zero/empty node group).

That asymetric evaluation can prevent `BalanceSimilarNodeGroups` from working
reliably when upscaling from zero:
* The first upscaled nodeGroup will get a node (capacity initially evaluated from ASG/MIG/VMSS template, then from the first real node, runtime labels)
* Other empty nodeGroups will likely be considered dissimilar to the first one: we're comparing ASG templates with real-world nodeInfo
* In the case of `least-waste` expander for instance, CA might favor the nodegroup from real-world node, accounting for reserved resources (best usage) over the others

This change set implements [Maciek Pytel suggestion (from a previous attempt at fixing this issue)](kubernetes#2892 (comment)):
we try to use real-world nodes examples from ~similar node groups, before
a last resort fallback to synthetic nodeInfos built from ASG templates.

We compare nodeInfos using the `FindSimilarNodeGroups` processor primitives
in order to leverage per cloud provider comparators (eg. label exclusions).

We look for similar nodegroups using synthetic nodeInfo from ASG/MIG (where
available): comparing their nodeInfo from real-world nodes to a synthetic
nodeInfo from the empty nodegroup wouldn't work (that's the root cause of this
issue). When ~similar nodegroups are found that way, we use their nodeInfos
built from real-world nodes as models for empty nodegroups (but keeping
original location related labels).

Tested with AWS ASGs, GCP MIGs and Azure VMSS.
bpineau added a commit to DataDog/autoscaler that referenced this issue Oct 13, 2020
This is a second attempt at kubernetes#1021 (might also provides an alternate
solution for kubernetes#2892 and kubernetes#3608).

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting
to false and describing the limitations (not all cloud providers)
and the risk of using it (loss of accuracy, risk of upscaling
unusable nodes or leaving pending pods).

Some uses cases includes:
* Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach)
* Edited/updated ASGs/MIGs taints and labels
* Updated instance type

Per previous discussion, the later two cases could be covered in
the long term by custom node-shape discovery Processors (open to
discuss that option too).
bpineau added a commit to DataDog/autoscaler that referenced this issue Oct 15, 2020
This is a second attempt at kubernetes#1021 (might also provides an alternate
solution for kubernetes#2892 and kubernetes#3608).

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting
to false and describing the limitations (not all cloud providers)
and the risk of using it (loss of accuracy, risk of upscaling
unusable nodes or leaving pending pods).

Some uses cases includes:
* Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach)
* Edited/updated ASGs/MIGs taints and labels
* Updated instance type

Per previous discussion, the later two cases could be covered in
the long term by custom node-shape discovery Processors (open to
discuss that option too).
bpineau added a commit to DataDog/autoscaler that referenced this issue Dec 14, 2020
Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo()
to NodeInfos obtained from real-world nodes is bound to fail, even with
kube reservations provided through nodegroups labels/annotations
(for instance: kernel mem reservation is hard to predict).

This makes `balance-similar-node-groups` likely to misbehave when
`scale-up-from-zero` is enabled (and a first nodegroup gets a real
node), for instance.

Following [Maciek Pytel suggestion](kubernetes#3608 (comment))
(from discussions on a previous attempt at solving this), we can
implement a NodeInfo Processor that would improve template-generated
NodeInfos whenever a node was created off a similar nodegroup.

We're storing node's virtual origin through machineid, which works
fine but is a bit ugly (suggestions welcome).

Tested this solves balance-similar-node-groups + scale-up-from-zero,
with various instance types on AWS and GCP.

Previous attempts to solve that issue/discussions:
* kubernetes#2892 (comment)
* kubernetes#3608 (comment)
bpineau added a commit to DataDog/autoscaler that referenced this issue Dec 14, 2020
Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo()
to NodeInfos obtained from real-world nodes is bound to fail, even with
kube reservations provided through nodegroups labels/annotations
(for instance: kernel mem reservation is hard to predict).

This makes `balance-similar-node-groups` likely to misbehave when
`scale-up-from-zero` is enabled (and a first nodegroup gets a real
node), for instance.

Following [Maciek Pytel suggestion](kubernetes#3608 (comment))
(from discussions on a previous attempt at solving this), we can
implement a NodeInfo Processor that would improve template-generated
NodeInfos whenever a node was created off a similar nodegroup.

We're storing node's virtual origin through machineid, which works
fine but is a bit ugly (suggestions welcome).

Tested this solves balance-similar-node-groups + scale-up-from-zero,
with various instance types on AWS and GCP.

Previous attempts to solve that issue/discussions:
* kubernetes#2892 (comment)
* kubernetes#3608 (comment)
bpineau added a commit to DataDog/autoscaler that referenced this issue Dec 14, 2020
Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo()
to NodeInfos obtained from real-world nodes is bound to fail, even with
kube reservations provided through nodegroups labels/annotations
(for instance: kernel mem reservation is hard to predict).

This makes `balance-similar-node-groups` likely to misbehave when
`scale-up-from-zero` is enabled (and a first nodegroup gets a real
node), for instance.

Following [Maciek Pytel suggestion](kubernetes#3608 (comment))
(from discussions on a previous attempt at solving this), we can
implement a NodeInfo Processor that would improve template-generated
NodeInfos whenever a node was created off a similar nodegroup.

We're storing node's virtual origin through machineid, which works
fine but is a bit ugly (suggestions welcome).

Tested this solves balance-similar-node-groups + scale-up-from-zero,
with various instance types on AWS and GCP.

Previous attempts to solve that issue/discussions:
* kubernetes#2892 (comment)
* kubernetes#3608 (comment)
bpineau added a commit to DataDog/autoscaler that referenced this issue Dec 23, 2020
Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo()
to NodeInfos obtained from real-world nodes is bound to fail, even with
kube reservations provided through nodegroups labels/annotations
(for instance: kernel mem reservation is hard to predict).

This makes `balance-similar-node-groups` likely to misbehave when
`scale-up-from-zero` is enabled (and a first nodegroup gets a real
node), for instance.

Following [Maciek Pytel suggestion](kubernetes#3608 (comment))
(from discussions on a previous attempt at solving this), we can
implement a NodeInfo Processor that would improve template-generated
NodeInfos whenever a node was created off a similar nodegroup.

We're storing node's virtual origin through machineid, which works
fine but is a bit ugly (suggestions welcome).

Tested this solves balance-similar-node-groups + scale-up-from-zero,
with various instance types on AWS and GCP.

Previous attempts to solve that issue/discussions:
* kubernetes#2892 (comment)
* kubernetes#3608 (comment)
bpineau added a commit to DataDog/autoscaler that referenced this issue Jan 18, 2021
Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo()
to NodeInfos obtained from real-world nodes is bound to fail, even with
kube reservations provided through nodegroups labels/annotations
(for instance: kernel mem reservation is hard to predict).

This makes `balance-similar-node-groups` likely to misbehave when
`scale-up-from-zero` is enabled (and a first nodegroup gets a real
node), for instance.

Following [Maciek Pytel suggestion](kubernetes#3608 (comment))
(from discussions on a previous attempt at solving this), we can
implement a NodeInfo Processor that would improve template-generated
NodeInfos whenever a node was created off a similar nodegroup.

We're storing node's virtual origin through machineid, which works
fine but is a bit ugly (suggestions welcome).

Tested this solves balance-similar-node-groups + scale-up-from-zero,
with various instance types on AWS and GCP.

Previous attempts to solve that issue/discussions:
* kubernetes#2892 (comment)
* kubernetes#3608 (comment)
bpineau added a commit to DataDog/autoscaler that referenced this issue Jan 18, 2021
This is a second attempt at kubernetes#1021 (might also provides an alternate
solution for kubernetes#2892 and kubernetes#3608).

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting
to false and describing the limitations (not all cloud providers)
and the risk of using it (loss of accuracy, risk of upscaling
unusable nodes or leaving pending pods).

Some uses cases includes:
* Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach)
* Edited/updated ASGs/MIGs taints and labels
* Updated instance type

Per previous discussion, the later two cases could be covered in
the long term by custom node-shape discovery Processors (open to
discuss that option too).
bpineau added a commit to DataDog/autoscaler that referenced this issue Jan 25, 2021
This is a second attempt at kubernetes#1021 (might also provides an alternate
solution for kubernetes#2892 and kubernetes#3608).

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting
to false and describing the limitations (not all cloud providers)
and the risk of using it (loss of accuracy, risk of upscaling
unusable nodes or leaving pending pods).

Some uses cases includes:
* Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach)
* Edited/updated ASGs/MIGs taints and labels
* Updated instance type

Per previous discussion, the later two cases could be covered in
the long term by custom node-shape discovery Processors (open to
discuss that option too).
bpineau added a commit to DataDog/autoscaler that referenced this issue Jan 25, 2021
Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo()
to NodeInfos obtained from real-world nodes is bound to fail, even with
kube reservations provided through nodegroups labels/annotations
(for instance: kernel mem reservation is hard to predict).

This makes `balance-similar-node-groups` likely to misbehave when
`scale-up-from-zero` is enabled (and a first nodegroup gets a real
node), for instance.

Following [Maciek Pytel suggestion](kubernetes#3608 (comment))
(from discussions on a previous attempt at solving this), we can
implement a NodeInfo Processor that would improve template-generated
NodeInfos whenever a node was created off a similar nodegroup.

We're storing node's virtual origin through machineid, which works
fine but is a bit ugly (suggestions welcome).

Tested this solves balance-similar-node-groups + scale-up-from-zero,
with various instance types on AWS and GCP.

Previous attempts to solve that issue/discussions:
* kubernetes#2892 (comment)
* kubernetes#3608 (comment)
bpineau added a commit to DataDog/autoscaler that referenced this issue Feb 26, 2021
Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo()
to NodeInfos obtained from real-world nodes is bound to fail, even with
kube reservations provided through nodegroups labels/annotations
(for instance: kernel mem reservation is hard to predict).

This makes `balance-similar-node-groups` likely to misbehave when
`scale-up-from-zero` is enabled (and a first nodegroup gets a real
node), for instance.

Following [Maciek Pytel suggestion](kubernetes#3608 (comment))
(from discussions on a previous attempt at solving this), we can
implement a NodeInfo Processor that would improve template-generated
NodeInfos whenever a node was created off a similar nodegroup.

We're storing node's virtual origin through machineid, which works
fine but is a bit ugly (suggestions welcome).

Tested this solves balance-similar-node-groups + scale-up-from-zero,
with various instance types on AWS and GCP.

Previous attempts to solve that issue/discussions:
* kubernetes#2892 (comment)
* kubernetes#3608 (comment)
bpineau added a commit to DataDog/autoscaler that referenced this issue Apr 8, 2021
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609).
Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type.

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods).

Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.

`GetNodeInfosForGroups()` fate (and the opportunity to split NodeInfoProcessor in NodeInfoProcessor and NodeInfoDecoratorProcessor) left open for discussion as I'd like to collect feedback on refactoring plans here.
bpineau added a commit to DataDog/autoscaler that referenced this issue Apr 8, 2021
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609).
Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type.

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods).

Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.

`GetNodeInfosForGroups()` fate (and the opportunity to split NodeInfoProcessor in NodeInfoProcessor and NodeInfoDecoratorProcessor) left open for discussion as I'd like to collect feedback on refactoring plans here.
bpineau added a commit to DataDog/autoscaler that referenced this issue Apr 8, 2021
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609).
Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type.

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods).

Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.
bpineau added a commit to DataDog/autoscaler that referenced this issue Apr 9, 2021
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609).
Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type.

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods).

Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.
bpineau added a commit to DataDog/autoscaler that referenced this issue Apr 19, 2021
Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo()
to NodeInfos obtained from real-world nodes is bound to fail, even with
kube reservations provided through nodegroups labels/annotations
(for instance: kernel mem reservation is hard to predict).

This makes `balance-similar-node-groups` likely to misbehave when
`scale-up-from-zero` is enabled (and a first nodegroup gets a real
node), for instance.

Following [Maciek Pytel suggestion](kubernetes#3608 (comment))
(from discussions on a previous attempt at solving this), we can
implement a NodeInfo Processor that would improve template-generated
NodeInfos whenever a node was created off a similar nodegroup.

We're storing node's virtual origin through machineid, which works
fine but is a bit ugly (suggestions welcome).

Tested this solves balance-similar-node-groups + scale-up-from-zero,
with various instance types on AWS and GCP.

Previous attempts to solve that issue/discussions:
* kubernetes#2892 (comment)
* kubernetes#3608 (comment)
bpineau added a commit to DataDog/autoscaler that referenced this issue Apr 19, 2021
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609).
Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type.

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods).

Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.

`GetNodeInfosForGroups()` fate (and the opportunity to split NodeInfoProcessor in NodeInfoProcessor and NodeInfoDecoratorProcessor) left open for discussion as I'd like to collect feedback on refactoring plans here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants