Node pool scale up timeout #1133

aermakov-zalando · 2018-08-09T09:34:33Z

The autoscaler has a timeout for non-ready nodes which forces it to kill those nodes and potentially select a different node pool in the next iteration. However, in the situation where the node pool cannot scale up at all it'll happily wait forever, keeping pods in Pending state without trying to compensate.

For example, setting multiple AWS Spot node pools with different instance types, or setting up a Spot pool and an On Demand pool doesn't really work. We'd expect CA to scale up one of the ASGs, detect a few minutes later that there's still no nodes coming up (because the corresponding Spot pool doesn't have capacity) and fall back to another pool. What actually happens is that CA will scale up the node pool by increasing desired capacity and then not do anything at all other than printing Upcoming 1 nodes/Failed to find readiness information for ....

The text was updated successfully, but these errors were encountered:

MaciekPytel · 2018-08-09T10:02:57Z

CA should timeout scale-up (IIRC after 15 minutes) and put the node group that failed to scale-up in 'backoff' state. At that point it should try to scale-up again, ignoring this node group. You can see this by looking for one of the following:

Scale-up timed out for node group <name> after <time> in logs
ScaleUpTimedOut event on cluster-autoscaler-status configmap (in kube-system ns)
Node group showing scale-up status as "Backoff" in status configmap
failed_scale_ups_total metric increasing

If you can reproduce it and doesn't see any of the above it's a bug. In that case can you provide some more details (especially CA version you're using).

aermakov-zalando · 2018-08-09T12:33:38Z

We're running version 1.2.2, but I can try with 1.3.1.

MaciekPytel · 2018-08-09T12:47:43Z

It shouldn't make any difference, this was added earlier than 1.2 (I don't remember exactly, but probably 1.1 timeframe?).

So you're saying it's stuck on Upcoming 1 node for more than 15 minutes? Can you provide a log of initial scale-up, a loop immediately after and another loop after 15+ minutes?

MaciekPytel · 2018-08-09T13:30:19Z

As @aleksandra-malinowska pointed out to me the timeout is effectively reset if there is another scale-up on the same node group (ie. CA only notices it if the last of multiple overlapping scale-ups times out). So that may be another thing to look for in the logs.

aermakov-zalando · 2018-08-09T13:35:35Z

The logs are in this gist. Please ignore all ASGs and pods in 1b, the nodes there were constantly being created and destroyed by Spot termination.

It looks like the scale-up timeout is working just fine and the pool is marked as unhealthy. However, in the next loop iteration CA doesn't consider another group that would've fit the same pod (scale_up.go:178] No need for any nodes in nodepool-default-worker-m4-splitaz-aws-ACCT-eu-central-1-kube-aws-test-aermakov64-AutoScalingGroup1a-LJECHLYTYAKY in 03-next-iteration.txt). It only tries to scale it up after I manually disable the unhealthy node group by setting its max size to 0.

aermakov-zalando · 2018-08-09T13:42:17Z

I think the problem is that building upcomingNodes in

autoscaler/cluster-autoscaler/core/scale_up.go

Line 278 in f646b9a

upcomingNodes := make([]*schedulercache.NodeInfo, 0)

ignores whether the node group is healthy or not, so the loop in

autoscaler/cluster-autoscaler/core/scale_up.go

Line 304 in f646b9a

for _, nodeGroup := range nodeGroups {

thinks that the pod could be scheduled on the perma-upcoming node and doesn't consider other groups, but I could also be completely wrong.

aermakov-zalando · 2018-08-09T13:53:38Z

It seems that an easy fix would be to completely ignore upcoming nodes from unhealthy groups in ScaleUp(). It doesn't look like the list is used for anything other than scheduling estimation, so it shouldn't affect anything else. WDYT?

MaciekPytel · 2018-08-09T14:25:39Z

CA handles this by resizing the node group back to original size after timed-out scale-up. This is done in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L180. However, going through the code it looks like this may not work for timed-out scale-from-0. It would be signified by lines in log looking like Readiness for node group <group> not found, which I do see in your log.

I'll try to reproduce later to confirm this theory (not enough time this week, sorry), but I'm fairly confident that's what's happening. If I'm right it's a bug in clusterstate. I have an idea how to fix it, but clusterstate is not the easiest thing to reason about and I need to have some time to dig into it to make sure I'm not breaking anything.

jrake-revelant · 2018-08-20T14:53:16Z

@MaciekPytel did you have a chance to look into it yet?

szuecs · 2018-09-27T16:09:21Z

@MaciekPytel could you reproduce this?

fejta-bot · 2018-12-26T16:53:20Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

MaciekPytel · 2018-12-27T10:07:27Z

/remove-lifecycle stale
/sigh I keep meaning to go back and fix this..

MaciekPytel · 2019-01-09T17:34:56Z

cc: @losipiuk

phspagiari · 2019-01-16T14:35:42Z

👍 using 1.2.4 (K8S 1.10.12) we had problems using spot and ondemand nodepools. CA tries (ad infinitum) request a spot (even with the price-too-low state) and never fallback to the next nodepools (ondemand).

losipiuk · 2019-01-21T10:07:28Z

With recent code changes cloudprovider.NodeGroup.Nodes() returns Instance objects. The intention of this method is to return an Instance object for not only running nodes but also those which are being crated or deleted. Additionally, for nodes which are being created (with Status.State==InstanceCreating), ErrorInfo can be provided. The core logic now reacts to errors specified via ErrorInfo and will retract from scale-up of give node group (and backoff node group) immediately, instead waiting 15mins for timeout.

If you want to implement it for AWS you may take a look at implementation for GCE which is already in.

Note: For CA logic to react on error the ErrorClass must be set to OutOfResourcesErrorClass. Probably it will be extended to OtherErrorClass too.

viggeh · 2019-03-08T13:55:26Z

Circling back to @aermakov-zalando's point. Does it not make sense to skip upcoming nodes from unhealthy groups? We should be able to assume that the node is no longer upcoming and not increment upcomingNodes. That would make cluster-autoscaler fall back on a different ASG for the pod that cannot be unscheduled.

davidquarles · 2019-03-28T21:57:56Z

@viggeh Are you still planning on tackling this in a follow-up PR? We're hitting this and could take a stab at the recommended implementation, if that's helpful.

viggeh · 2019-04-02T07:23:34Z

@davidquarles Sorry for the late reply. I've not had the time to take a proper look at this and don't expect to do so in the next couple of weeks. If you can take the problem on that would be awesome!

davidquarles · 2019-04-09T20:44:49Z

@MaciekPytel @mwielgus @viggeh I'm no AWS expert, but AFAICT instances aren't added to the ASG when spot requests are made. With an open, unfulfillable spot request triggered by an ASG, I see that the ASG's desired number of instances has been incremented, and the spot request shows up as an associated scaling activity, but there are no new instances under the ASG. Is the aforementioned strategy for solving this still valid? If so, would I just use AwsNodeGroup.TemplateNodeInfo to create fake nodes (with Status.State==InstanceCreating and corresponding ErrorInfo)?

mvisonneau · 2019-04-11T16:17:40Z

I'll add that this is also valid for any failing reason other than spot requests not being fulfilled, eg: corrupted LC or LT config - AMI not found, missing subnet or SG, etc..

max-rocket-internet · 2019-04-17T14:20:06Z

CA handles this by resizing the node group back to original size after timed-out scale-up.

I don't see this behaviour. In my test the ASG is just left at the new value even though the new instance didn't get provisioned.

max-rocket-internet · 2019-04-17T14:36:53Z

Just to add my test results to this issue, if it helps any...

I have 3x ASGs (2 spot, 1 normal) I have unschedulable pods, CA triggers scale up:

I0417 13:42:04.765399       1 scale_up.go:427] Best option to resize: eu01-stg-spot-2
I0417 13:42:04.765417       1 scale_up.go:431] Estimated 1 nodes needed in eu01-stg-spot-2
I0417 13:42:04.765439       1 scale_up.go:533] Final scale-up plan: [{eu01-stg-spot-2 2->3 (max: 20)}]

But spot price is not fulfilled so instance is not created. Then max-node-provision-time passes:

W0417 14:03:32.422046       1 clusterstate.go:198] Scale-up timed out for node group eu01-stg-spot-1 after 15m8.316405684s
W0417 14:03:32.422109       1 clusterstate.go:221] Disabling scale-up for node group eu01-stg-spot-1 until 2019-04-17 14:08:32.247733959 +0000 UTC m=+4217.943746338
W0417 14:03:32.532256       1 scale_up.go:329] Node group eu01-stg-spot-1 is not ready for scaleup - backoff

Now at this point I would expect CA to immediately choose 1 of the other 2 available ASGs but it does not:

I0417 14:09:23.451114       1 scale_up.go:412] No need for any nodes in eu01-stg
I0417 14:09:23.451516       1 scale_up.go:412] No need for any nodes in eu01-stg-spot-1
I0417 14:09:23.452883       1 scale_up.go:412] No need for any nodes in eu01-stg-spot-2

And pods are left unschedulable.

choseh · 2019-04-30T10:18:49Z

We're running into the same issues here. I know it's only been like 13 days, but did you happen to find a workaround @max-rocket-internet?

max-rocket-internet · 2019-04-30T10:49:55Z

@choseh nope 😐

baxor · 2019-05-01T17:27:27Z

Please address this issue.

MaxWinterstein · 2019-06-17T14:47:26Z

EDIT: This was not a fault on Kubernetes side, i simply ran into an IP Address Quota on Google side...

Stumbled across this too. I tried to use a scale-to-zero pool besides my existing cluster for new CI/CD gitlab workers. Running on GKE with 1.13.6-gke.0.

Sadly there is no scaling above 1 node, thus 5 are allowed. Main-Cluster is currently running on 4 nodes.

Tried with and without preemptible nodes (guess the AWS term is SPOT for this)

pod description:

Events:
  Type     Reason             Age                From                Message
  ----     ------             ----               ----                -------
  Warning  FailedScheduling   16s (x2 over 16s)  default-scheduler   0/5 nodes are available: 4 node(s) didn't match node selector, 5 Insufficient cpu.
  Normal   NotTriggerScaleUp  4s (x2 over 15s)   cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient cpu, 1 in backoff after failed scale-up

jaypipes · 2019-08-20T14:26:02Z

AFAICT, #2235 properly handles the bug described here, and that PR has been merged. Can we can close this bug out now?

aermakov-zalando · 2019-08-20T14:43:56Z

@jaypipes The fix in #2235 is AWS specific, so other cloud providers would still be affected by this. Or is the idea here that every cloud provider will have to implement the same logic to work around an issue in the clusterstate?

jaypipes · 2019-08-20T15:48:11Z

@jaypipes The fix in #2235 is AWS specific, so other cloud providers would still be affected by this. Or is the idea here that every cloud provider will have to implement the same logic to work around an issue in the clusterstate?

While the fix in #2235 is indeed for the AWS cloud provider, the solution is the one recommended by @MaciekPytel and @losipiuk for all cloud providers that suffer from the "I don't have an actual Instance (yet)" problem with their autoscaling API. See here for @MaciekPytel recommendation:

#2008 (comment)

Note that he mentions that the managed instance groups API in GCE does the whole placeholder thing behind the scenes but is essentially what was implemented for the AWS cloud provider in #2235

aermakov-zalando · 2019-08-20T16:00:34Z

I see. We only really care about AWS anyway, so I'll just close the issue. Thanks for the update!

* Bumping K8s dependencies to v1.28 Signed-off-by: Yuki Iwai <[email protected]> * Bumping controller-runtime to v0.16.2 Signed-off-by: Yuki Iwai <[email protected]> * Bumping cert-controller to v0.9.0 Signed-off-by: Yuki Iwai <[email protected]> * Add an executable permission to the generate-internal-groups.sh Signed-off-by: Yuki Iwai <[email protected]> * Ignore Unexported cache Options when testing Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: Yuki Iwai <[email protected]>

Fix out of GOPATH client-go generation after `Bump K8s dependencies to v1.28 kubernetes#1133`

aleksandra-malinowska added area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. labels Aug 13, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 27, 2018

viggeh mentioned this issue Mar 11, 2019

Ignoring upcoming nodes from unhealthy nodegroups. #1779

Closed

814HiManny mentioned this issue Mar 31, 2019

AWS Ondemand not scaled up if Spot requests remain "Open" #1795

Closed

mvisonneau mentioned this issue May 2, 2019

Stop waiting for upcoming nodes from unhealthy node groups #1980

Closed

piontec mentioned this issue May 10, 2019

fix: correctly handle lack of capacity of AWS spot ASGs #2008

Closed

aermakov-zalando closed this as completed Aug 20, 2019

This was referenced Sep 16, 2022

Spot VM Support gardener/machine-controller-manager#27

Open

Early abort/backoff support for Gardener nodegroups a.k.a machinedeployments gardener/autoscaler#154

Closed

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024

[hack] Fix out of GOPATH client-go generation. (kubernetes#1158)

cc7dbcb

Fix out of GOPATH client-go generation after `Bump K8s dependencies to v1.28 kubernetes#1133`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node pool scale up timeout #1133

Node pool scale up timeout #1133

aermakov-zalando commented Aug 9, 2018

MaciekPytel commented Aug 9, 2018

aermakov-zalando commented Aug 9, 2018

MaciekPytel commented Aug 9, 2018

MaciekPytel commented Aug 9, 2018

aermakov-zalando commented Aug 9, 2018 •

edited

Loading

aermakov-zalando commented Aug 9, 2018 •

edited

Loading

aermakov-zalando commented Aug 9, 2018

MaciekPytel commented Aug 9, 2018 •

edited

Loading

jrake-revelant commented Aug 20, 2018

szuecs commented Sep 27, 2018

fejta-bot commented Dec 26, 2018

MaciekPytel commented Dec 27, 2018

MaciekPytel commented Jan 9, 2019

phspagiari commented Jan 16, 2019

losipiuk commented Jan 21, 2019

viggeh commented Mar 8, 2019

davidquarles commented Mar 28, 2019

viggeh commented Apr 2, 2019

davidquarles commented Apr 9, 2019

mvisonneau commented Apr 11, 2019

max-rocket-internet commented Apr 17, 2019

max-rocket-internet commented Apr 17, 2019

choseh commented Apr 30, 2019

max-rocket-internet commented Apr 30, 2019

baxor commented May 1, 2019

MaxWinterstein commented Jun 17, 2019 •

edited

Loading

jaypipes commented Aug 20, 2019

aermakov-zalando commented Aug 20, 2019

jaypipes commented Aug 20, 2019

aermakov-zalando commented Aug 20, 2019

Node pool scale up timeout #1133

Node pool scale up timeout #1133

Comments

aermakov-zalando commented Aug 9, 2018

MaciekPytel commented Aug 9, 2018

aermakov-zalando commented Aug 9, 2018

MaciekPytel commented Aug 9, 2018

MaciekPytel commented Aug 9, 2018

aermakov-zalando commented Aug 9, 2018 • edited Loading

aermakov-zalando commented Aug 9, 2018 • edited Loading

aermakov-zalando commented Aug 9, 2018

MaciekPytel commented Aug 9, 2018 • edited Loading

jrake-revelant commented Aug 20, 2018

szuecs commented Sep 27, 2018

fejta-bot commented Dec 26, 2018

MaciekPytel commented Dec 27, 2018

MaciekPytel commented Jan 9, 2019

phspagiari commented Jan 16, 2019

losipiuk commented Jan 21, 2019

viggeh commented Mar 8, 2019

davidquarles commented Mar 28, 2019

viggeh commented Apr 2, 2019

davidquarles commented Apr 9, 2019

mvisonneau commented Apr 11, 2019

max-rocket-internet commented Apr 17, 2019

max-rocket-internet commented Apr 17, 2019

choseh commented Apr 30, 2019

max-rocket-internet commented Apr 30, 2019

baxor commented May 1, 2019

MaxWinterstein commented Jun 17, 2019 • edited Loading

EDIT: This was not a fault on Kubernetes side, i simply ran into an IP Address Quota on Google side...

jaypipes commented Aug 20, 2019

aermakov-zalando commented Aug 20, 2019

jaypipes commented Aug 20, 2019

aermakov-zalando commented Aug 20, 2019

aermakov-zalando commented Aug 9, 2018 •

edited

Loading

aermakov-zalando commented Aug 9, 2018 •

edited

Loading

MaciekPytel commented Aug 9, 2018 •

edited

Loading

MaxWinterstein commented Jun 17, 2019 •

edited

Loading