cluster autoscaler failed to scale up when AWS couldn't start a new instance #1996

piontec · 2019-05-07T13:37:07Z

We're using recent cluster-autoscaler, built from the master branch. It's used with AWS EKS and k8s 1.12.7. We're using it to run a cluster based on AWS spot instances.

Recently we had a situation, where CAS failed to scale up the cluster. During scale-up, 2 ASGs were selected to increase the capacity, but AWS had no spot capacity in them, so no new instances were really started (they were stuck in a state like requested instances = 5, running instances = 0). We're using --max-node-provision-time=10m option, but new nodes in other ASGs (which had capacity) were not started in this time. The CAS was stuck in this state for over an hour.

We found the following in CAS logs:

...
I0501 12:50:26.228744 1 scale_up.go:264] Pod blah-r-ebbf1d5c18a2f81565-7489bfcc6c-7zxp2 is unschedulable
...
I0501 12:50:26.228818 1 scale_up.go:303] Upcoming 3 nodes
...
Pod cg-service-r-1885076a5de3ca3fee-55f44f94f4-tcvcr can't be scheduled on central-eks-ondemand-workers-asg-r4.4xlarge, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient ephemeral-storage
...
I0501 12:50:26.228846 1 scale_up.go:341] Skipping node group central-eks-canary-asg-r5.12xlarge - max size reached
...
I0501 12:50:26.265802 1 scale_up.go:411] No need for any nodes in central-eks-workers-asg-m4.16xlarge
I0501 12:50:26.359310 1 scale_up.go:411] No need for any nodes in central-eks-workers-asg-m5.12xlarge
...
W0501 12:50:26.359343 1 scale_up.go:325] Node group central-eks-workers-asg-m5a.12xlarge is not ready for scaleup - unhealthy

As you can see, we had some groups that couldn't be used (unhealthy were the once where AWS had no spot capacity), but some of them were perfectly fine ("the ones with "No need for any nodes...").

This situation is hard to reproduce, but can someone please help me review the code?

If an instance is not coming up for more than MaxNodeProvisonTime, it is added to the LongUnregistered counter here.
Then, when calculating upcoming nodes, LongUnregistered, is subtracted, as can be seen in clusterstate.go, which is used in scale_up.go.
In our case newNodes := ar.CurrentTarget - (readiness.Ready + readiness.Unready + readiness.LongNotStarted + readiness.LongUnregistered) should correctly evaluate to 0, which means no new nodes are coming.
Yet, our logs show "3 upcoming nodes".

Is it possible, that we had CurrentTarget = X, Ready = Unready = LongNotStarted = LongUnregistered = 0, which set the value to 0, but later it was updated to CurrentTarget = X, Ready = Unready = 0, LongNotStarted = LongUnregistered = X, the result was negative and the check prevent the counter from updating?

The text was updated successfully, but these errors were encountered:

MaciekPytel · 2019-05-07T16:44:16Z

First of all - most CA flags were added in early development stage for CA developer testing or as a safety mechanism for disabling new features. Basically they exist ONLY because of historical reasons. No one tests or maintains them, so changing them may break things in all sorts of unexpected ways. Changing --max-node-provision-time is probably particularly unsafe - there is a const MaxNodeStartupTime in clusterstate and I would not be surprised at all if something relies on those values being the same.

In principle a node should never be counted as NotStarted (meaning the node object exists in Kubernetes, but the status is not ready) and Unregistered (meaning the node has not registered at all). Logically those are mutually exclusive states. The calculation is different, but in both cases there are 15 minute timeouts involved - maybe something relies on both timeouts being the same? It shouldn't matter, but AFAIK no one has tested this in a very long time.

Finally, the primary way CA deals with LongUnregistered nodes is trying to delete them. The logs you pasted don't cover that, but it should happen at the start of every loop and it would be worth to check in the logs why they weren't deleted for an hour.

piontec · 2019-05-08T08:40:10Z

Thanks for the update, we'll switch back to 15min max-node-provision-time just in case.
About the logs: the problem is we got nothing like that in logs. Everything looks pretty normal, the only important stuff I missed above is:

I0501 12:50:26.935803 1 scale_up.go:419] No expansion options

which is only possible if Estimator computed that no new nodes are needed...

Also, we're starting CAS with the following set of flags - are some other unsafe to use too?

       - ./cluster-autoscaler
        - --v=6
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=priority
        - --max-node-provision-time=10m
        - --scan-interval=20s
        - --scale-down-utilization-threshold=0.8
        - --scale-down-delay-after-add=30m
        - --scale-down-delay-after-delete=30m
        - --scale-down-delay-after-failure=30m
        - --scale-down-unneeded-time=15m
        - --node-group-auto-discovery=asg:tag=kubernetes.io/cluster/central-eks

MaciekPytel · 2019-05-08T09:46:59Z

The other flags should be fine - those are commonly used options (I guess your expander is not widely used yet :) ).

I think I figured it out though and it's not the flags. The handling of unregistered nodes assumes some representation of non-registered instances is returned by NodeGroup.Nodes(). If it's not the nodes will never come up as LongUnregistered, because that's tracked individually for each identifier returned by NodeGroup.Nodes().

You can check how many nodes in each state (LongUnregistered, LongNotStarted, etc) you have in each NodeGroup by looking at CA status configmap (kubectl get configmap cluster-autoscaler-status -o yaml -n kube-system). I bet you'll find out that in your case you have no unregistered nodes.

I can't test this theory as I have no access to AWS, but I'm pretty sure that's it. The way to fix it would be to change AWS cloudprovider so NodeGroup.Nodes() returns some representation for non-existing spot instances (note - those identifiers must later allow deleting the non-existing instances or just resizing the ASG back down with NodePool.DeleteInstances()).

cc: @Jeffwan
@mvisonneau - I think the above is the proper way to fix the issue you're trying to address with #1980.

edit: to clarify - you don't need to implement InstanceStatus for the non-existing spots. The timeout-based error handling will kick in even if it's always nil. You just need to return a cloudprovider.Instance with some unique Id for each non-existing instance (the Id must be consistent between the loops). I suspect something like "-not-created-1" through "-not-created-N" could do the trick.

piontec · 2019-05-08T09:59:38Z

Thanks! Your assumption, that LongNotStarted/LongUnregistered is always 0 fits my logs - in that case, the number of upcoming nodes will be always equal to 'Wanted' size of ASG, even if nothing is coming up.
The problem is that this case is hard to reproduce, I would have to mock AWS API... anyway, I'll try to play with this.
Thanks!

mvisonneau · 2019-05-08T10:29:58Z

@piontec whilst I was looking into that, I managed to reproduce the behaviour by configuring launch templates with $0.01 capped spot requests.

piontec · 2019-05-08T10:47:43Z

Aaah, that hint is gold, thanks!

piontec · 2019-05-09T08:49:45Z

@MaciekPytel I need one more hint. I was able to reproduce the bug and it seems you're right. In general, artificial node IDs for spots can solve this, but there's one problem: when CAS is trying to delete such "non-existing" node (I call them placeholders), I need to remove the placeholder from the asg struct, but also I have to mark the asg as "unhealthy", so CAS won't use it anymore on the next iteration. What's the best way to make sure this ASG won't be used right now, but still might be used in the future (understood as later, but not for the next few iterations)?

edit: It seems func (csr *ClusterStateRegistry) RegisterFailedScaleUp is there for exactly that reason, right?

MaciekPytel · 2019-05-09T14:16:51Z

It should do exactly what you want out of the box. Except the NodeGroup won't be listed as 'unhealthy', it will be 'backed off'.

More specifically - CA should automatically backoff from scaling-up a given NodeGroup after failed scale-up. The node group won't be considered 'unhealthy', but it will go on exponential backoff for scale-up. You can find this in logs by looking for Disabling scale-up for node group messages or you can see it live by checking status configmap - backed off groups will have ScaleUp status listed as BackOff (or something similar, I don't remember exactly).

piontec · 2019-05-10T09:48:04Z

OK, patch is ready, I manually tested that against "low bid ASG" and it worked fine. Please review, @MaciekPytel

fejta-bot · 2019-08-08T10:07:39Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

ghost · 2019-08-13T07:56:06Z

/remove-lifecycle stale

jaypipes · 2019-08-20T14:18:17Z

Note that #2008 was reworked as #2235, and that latter PR has been merged. Is it safe to close this issue now @piontec and @MaciekPytel?

MaciekPytel · 2019-08-20T15:17:50Z

I would suggest cherry-picking the fix first, but otherwise I think it's ok to close.

drewhemm · 2019-11-15T10:26:30Z

Hi,

Do we know if #2235 also handles on-demand instance launch failures due to lack of capacity?

Launching a new EC2 instance. Status Reason: We currently do not have sufficient c5n.18xlarge capacity in the Availability Zone you requested (us-west-2b). Our system will be working on provisioning additional capacity. You can currently get c5n.18xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2a, us-west-2c. Launching EC2 instance failed.

I am running v1.15.3. I will try to see if there is anything useful in the logs.

piontec · 2019-12-18T15:41:33Z

My guess is it should, but I don't have time to mock the whole setup and test it somehow. But for spot instances, the request was never finished and that's the case covered - that's why I think you should be good.

BTW, @MaciekPytel should we close it now?

fejta-bot · 2020-03-17T16:11:28Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-04-16T16:54:02Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-05-16T17:38:27Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-05-16T17:38:39Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tayritenour · 2021-08-11T20:48:12Z

This is still an issue, even in the latest version of the cluster autoscaler, v1.21.0

MaciekPytel mentioned this issue May 8, 2019

Stop waiting for upcoming nodes from unhealthy node groups #1980

Closed

piontec mentioned this issue May 10, 2019

fix: correctly handle lack of capacity of AWS spot ASGs #2008

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 8, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 13, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 17, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 16, 2020

k8s-ci-robot closed this as completed May 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster autoscaler failed to scale up when AWS couldn't start a new instance #1996

cluster autoscaler failed to scale up when AWS couldn't start a new instance #1996

piontec commented May 7, 2019 •

edited

Loading

MaciekPytel commented May 7, 2019

piontec commented May 8, 2019

MaciekPytel commented May 8, 2019 •

edited

Loading

piontec commented May 8, 2019

mvisonneau commented May 8, 2019

piontec commented May 8, 2019

piontec commented May 9, 2019 •

edited

Loading

MaciekPytel commented May 9, 2019

piontec commented May 10, 2019

fejta-bot commented Aug 8, 2019

ghost commented Aug 13, 2019

jaypipes commented Aug 20, 2019

MaciekPytel commented Aug 20, 2019

drewhemm commented Nov 15, 2019

piontec commented Dec 18, 2019

fejta-bot commented Mar 17, 2020

fejta-bot commented Apr 16, 2020

fejta-bot commented May 16, 2020

k8s-ci-robot commented May 16, 2020

tayritenour commented Aug 11, 2021

cluster autoscaler failed to scale up when AWS couldn't start a new instance #1996

cluster autoscaler failed to scale up when AWS couldn't start a new instance #1996

Comments

piontec commented May 7, 2019 • edited Loading

MaciekPytel commented May 7, 2019

piontec commented May 8, 2019

MaciekPytel commented May 8, 2019 • edited Loading

piontec commented May 8, 2019

mvisonneau commented May 8, 2019

piontec commented May 8, 2019

piontec commented May 9, 2019 • edited Loading

MaciekPytel commented May 9, 2019

piontec commented May 10, 2019

fejta-bot commented Aug 8, 2019

ghost commented Aug 13, 2019

jaypipes commented Aug 20, 2019

MaciekPytel commented Aug 20, 2019

drewhemm commented Nov 15, 2019

piontec commented Dec 18, 2019

fejta-bot commented Mar 17, 2020

fejta-bot commented Apr 16, 2020

fejta-bot commented May 16, 2020

k8s-ci-robot commented May 16, 2020

tayritenour commented Aug 11, 2021

piontec commented May 7, 2019 •

edited

Loading

MaciekPytel commented May 8, 2019 •

edited

Loading

piontec commented May 9, 2019 •

edited

Loading