Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster autoscaler failed to scale up when AWS couldn't start a new instance #1996

Closed
piontec opened this issue May 7, 2019 · 20 comments
Closed
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@piontec
Copy link
Contributor

piontec commented May 7, 2019

We're using recent cluster-autoscaler, built from the master branch. It's used with AWS EKS and k8s 1.12.7. We're using it to run a cluster based on AWS spot instances.

Recently we had a situation, where CAS failed to scale up the cluster. During scale-up, 2 ASGs were selected to increase the capacity, but AWS had no spot capacity in them, so no new instances were really started (they were stuck in a state like requested instances = 5, running instances = 0). We're using --max-node-provision-time=10m option, but new nodes in other ASGs (which had capacity) were not started in this time. The CAS was stuck in this state for over an hour.

We found the following in CAS logs:

...
I0501 12:50:26.228744 1 scale_up.go:264] Pod blah-r-ebbf1d5c18a2f81565-7489bfcc6c-7zxp2 is unschedulable
...
I0501 12:50:26.228818 1 scale_up.go:303] Upcoming 3 nodes
...
Pod cg-service-r-1885076a5de3ca3fee-55f44f94f4-tcvcr can't be scheduled on central-eks-ondemand-workers-asg-r4.4xlarge, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient ephemeral-storage
...
I0501 12:50:26.228846 1 scale_up.go:341] Skipping node group central-eks-canary-asg-r5.12xlarge - max size reached
...
I0501 12:50:26.265802 1 scale_up.go:411] No need for any nodes in central-eks-workers-asg-m4.16xlarge
I0501 12:50:26.359310 1 scale_up.go:411] No need for any nodes in central-eks-workers-asg-m5.12xlarge
...
W0501 12:50:26.359343 1 scale_up.go:325] Node group central-eks-workers-asg-m5a.12xlarge is not ready for scaleup - unhealthy

As you can see, we had some groups that couldn't be used (unhealthy were the once where AWS had no spot capacity), but some of them were perfectly fine ("the ones with "No need for any nodes...").

This situation is hard to reproduce, but can someone please help me review the code?

If an instance is not coming up for more than MaxNodeProvisonTime, it is added to the LongUnregistered counter here.
Then, when calculating upcoming nodes, LongUnregistered, is subtracted, as can be seen in clusterstate.go, which is used in scale_up.go.
In our case newNodes := ar.CurrentTarget - (readiness.Ready + readiness.Unready + readiness.LongNotStarted + readiness.LongUnregistered) should correctly evaluate to 0, which means no new nodes are coming.
Yet, our logs show "3 upcoming nodes".

Is it possible, that we had CurrentTarget = X, Ready = Unready = LongNotStarted = LongUnregistered = 0, which set the value to 0, but later it was updated to CurrentTarget = X, Ready = Unready = 0, LongNotStarted = LongUnregistered = X, the result was negative and the check prevent the counter from updating?

@MaciekPytel
Copy link
Contributor

First of all - most CA flags were added in early development stage for CA developer testing or as a safety mechanism for disabling new features. Basically they exist ONLY because of historical reasons. No one tests or maintains them, so changing them may break things in all sorts of unexpected ways. Changing --max-node-provision-time is probably particularly unsafe - there is a const MaxNodeStartupTime in clusterstate and I would not be surprised at all if something relies on those values being the same.

In principle a node should never be counted as NotStarted (meaning the node object exists in Kubernetes, but the status is not ready) and Unregistered (meaning the node has not registered at all). Logically those are mutually exclusive states. The calculation is different, but in both cases there are 15 minute timeouts involved - maybe something relies on both timeouts being the same? It shouldn't matter, but AFAIK no one has tested this in a very long time.

Finally, the primary way CA deals with LongUnregistered nodes is trying to delete them. The logs you pasted don't cover that, but it should happen at the start of every loop and it would be worth to check in the logs why they weren't deleted for an hour.

@piontec
Copy link
Contributor Author

piontec commented May 8, 2019

Thanks for the update, we'll switch back to 15min max-node-provision-time just in case.
About the logs: the problem is we got nothing like that in logs. Everything looks pretty normal, the only important stuff I missed above is:

I0501 12:50:26.935803 1 scale_up.go:419] No expansion options

which is only possible if Estimator computed that no new nodes are needed...

Also, we're starting CAS with the following set of flags - are some other unsafe to use too?

       - ./cluster-autoscaler
        - --v=6
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=priority
        - --max-node-provision-time=10m
        - --scan-interval=20s
        - --scale-down-utilization-threshold=0.8
        - --scale-down-delay-after-add=30m
        - --scale-down-delay-after-delete=30m
        - --scale-down-delay-after-failure=30m
        - --scale-down-unneeded-time=15m
        - --node-group-auto-discovery=asg:tag=kubernetes.io/cluster/central-eks

@MaciekPytel
Copy link
Contributor

MaciekPytel commented May 8, 2019

The other flags should be fine - those are commonly used options (I guess your expander is not widely used yet :) ).

I think I figured it out though and it's not the flags. The handling of unregistered nodes assumes some representation of non-registered instances is returned by NodeGroup.Nodes(). If it's not the nodes will never come up as LongUnregistered, because that's tracked individually for each identifier returned by NodeGroup.Nodes().

You can check how many nodes in each state (LongUnregistered, LongNotStarted, etc) you have in each NodeGroup by looking at CA status configmap (kubectl get configmap cluster-autoscaler-status -o yaml -n kube-system). I bet you'll find out that in your case you have no unregistered nodes.

I can't test this theory as I have no access to AWS, but I'm pretty sure that's it. The way to fix it would be to change AWS cloudprovider so NodeGroup.Nodes() returns some representation for non-existing spot instances (note - those identifiers must later allow deleting the non-existing instances or just resizing the ASG back down with NodePool.DeleteInstances()).

cc: @Jeffwan
@mvisonneau - I think the above is the proper way to fix the issue you're trying to address with #1980.

edit: to clarify - you don't need to implement InstanceStatus for the non-existing spots. The timeout-based error handling will kick in even if it's always nil. You just need to return a cloudprovider.Instance with some unique Id for each non-existing instance (the Id must be consistent between the loops). I suspect something like "-not-created-1" through "-not-created-N" could do the trick.

@piontec
Copy link
Contributor Author

piontec commented May 8, 2019

Thanks! Your assumption, that LongNotStarted/LongUnregistered is always 0 fits my logs - in that case, the number of upcoming nodes will be always equal to 'Wanted' size of ASG, even if nothing is coming up.
The problem is that this case is hard to reproduce, I would have to mock AWS API... anyway, I'll try to play with this.
Thanks!

@mvisonneau
Copy link

@piontec whilst I was looking into that, I managed to reproduce the behaviour by configuring launch templates with $0.01 capped spot requests.

@piontec
Copy link
Contributor Author

piontec commented May 8, 2019

Aaah, that hint is gold, thanks!

@piontec
Copy link
Contributor Author

piontec commented May 9, 2019

@MaciekPytel I need one more hint. I was able to reproduce the bug and it seems you're right. In general, artificial node IDs for spots can solve this, but there's one problem: when CAS is trying to delete such "non-existing" node (I call them placeholders), I need to remove the placeholder from the asg struct, but also I have to mark the asg as "unhealthy", so CAS won't use it anymore on the next iteration. What's the best way to make sure this ASG won't be used right now, but still might be used in the future (understood as later, but not for the next few iterations)?

edit: It seems func (csr *ClusterStateRegistry) RegisterFailedScaleUp is there for exactly that reason, right?

@MaciekPytel
Copy link
Contributor

It should do exactly what you want out of the box. Except the NodeGroup won't be listed as 'unhealthy', it will be 'backed off'.

More specifically - CA should automatically backoff from scaling-up a given NodeGroup after failed scale-up. The node group won't be considered 'unhealthy', but it will go on exponential backoff for scale-up. You can find this in logs by looking for Disabling scale-up for node group messages or you can see it live by checking status configmap - backed off groups will have ScaleUp status listed as BackOff (or something similar, I don't remember exactly).

@piontec
Copy link
Contributor Author

piontec commented May 10, 2019

OK, patch is ready, I manually tested that against "low bid ASG" and it worked fine. Please review, @MaciekPytel

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 8, 2019
@ghost
Copy link

ghost commented Aug 13, 2019

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 13, 2019
@jaypipes
Copy link
Contributor

Note that #2008 was reworked as #2235, and that latter PR has been merged. Is it safe to close this issue now @piontec and @MaciekPytel?

@MaciekPytel
Copy link
Contributor

I would suggest cherry-picking the fix first, but otherwise I think it's ok to close.

@drewhemm
Copy link
Contributor

Hi,

Do we know if #2235 also handles on-demand instance launch failures due to lack of capacity?

Launching a new EC2 instance. Status Reason: We currently do not have sufficient c5n.18xlarge capacity in the Availability Zone you requested (us-west-2b). Our system will be working on provisioning additional capacity. You can currently get c5n.18xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2a, us-west-2c. Launching EC2 instance failed.

I am running v1.15.3. I will try to see if there is anything useful in the logs.

@piontec
Copy link
Contributor Author

piontec commented Dec 18, 2019

My guess is it should, but I don't have time to mock the whole setup and test it somehow. But for spot instances, the request was never finished and that's the case covered - that's why I think you should be good.

BTW, @MaciekPytel should we close it now?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 17, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 16, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tayritenour
Copy link

This is still an issue, even in the latest version of the cluster autoscaler, v1.21.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants