-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster autoscaler failed to scale up when AWS couldn't start a new instance #1996
Comments
First of all - most CA flags were added in early development stage for CA developer testing or as a safety mechanism for disabling new features. Basically they exist ONLY because of historical reasons. No one tests or maintains them, so changing them may break things in all sorts of unexpected ways. Changing In principle a node should never be counted as NotStarted (meaning the node object exists in Kubernetes, but the status is not ready) and Unregistered (meaning the node has not registered at all). Logically those are mutually exclusive states. The calculation is different, but in both cases there are 15 minute timeouts involved - maybe something relies on both timeouts being the same? It shouldn't matter, but AFAIK no one has tested this in a very long time. Finally, the primary way CA deals with LongUnregistered nodes is trying to delete them. The logs you pasted don't cover that, but it should happen at the start of every loop and it would be worth to check in the logs why they weren't deleted for an hour. |
Thanks for the update, we'll switch back to 15min
which is only possible if Estimator computed that no new nodes are needed... Also, we're starting CAS with the following set of flags - are some other unsafe to use too?
|
The other flags should be fine - those are commonly used options (I guess your expander is not widely used yet :) ). I think I figured it out though and it's not the flags. The handling of unregistered nodes assumes some representation of non-registered instances is returned by NodeGroup.Nodes(). If it's not the nodes will never come up as LongUnregistered, because that's tracked individually for each identifier returned by NodeGroup.Nodes(). You can check how many nodes in each state (LongUnregistered, LongNotStarted, etc) you have in each NodeGroup by looking at CA status configmap ( I can't test this theory as I have no access to AWS, but I'm pretty sure that's it. The way to fix it would be to change AWS cloudprovider so NodeGroup.Nodes() returns some representation for non-existing spot instances (note - those identifiers must later allow deleting the non-existing instances or just resizing the ASG back down with NodePool.DeleteInstances()). cc: @Jeffwan edit: to clarify - you don't need to implement InstanceStatus for the non-existing spots. The timeout-based error handling will kick in even if it's always nil. You just need to return a cloudprovider.Instance with some unique Id for each non-existing instance (the Id must be consistent between the loops). I suspect something like "-not-created-1" through "-not-created-N" could do the trick. |
Thanks! Your assumption, that |
@piontec whilst I was looking into that, I managed to reproduce the behaviour by configuring launch templates with $0.01 capped spot requests. |
Aaah, that hint is gold, thanks! |
@MaciekPytel I need one more hint. I was able to reproduce the bug and it seems you're right. In general, artificial node IDs for spots can solve this, but there's one problem: when CAS is trying to delete such "non-existing" node (I call them placeholders), I need to remove the placeholder from the asg struct, but also I have to mark the asg as "unhealthy", so CAS won't use it anymore on the next iteration. What's the best way to make sure this ASG won't be used right now, but still might be used in the future (understood as later, but not for the next few iterations)? edit: It seems |
It should do exactly what you want out of the box. Except the NodeGroup won't be listed as 'unhealthy', it will be 'backed off'. More specifically - CA should automatically backoff from scaling-up a given NodeGroup after failed scale-up. The node group won't be considered 'unhealthy', but it will go on exponential backoff for scale-up. You can find this in logs by looking for |
OK, patch is ready, I manually tested that against "low bid ASG" and it worked fine. Please review, @MaciekPytel |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Note that #2008 was reworked as #2235, and that latter PR has been merged. Is it safe to close this issue now @piontec and @MaciekPytel? |
I would suggest cherry-picking the fix first, but otherwise I think it's ok to close. |
Hi, Do we know if #2235 also handles on-demand instance launch failures due to lack of capacity?
I am running v1.15.3. I will try to see if there is anything useful in the logs. |
My guess is it should, but I don't have time to mock the whole setup and test it somehow. But for spot instances, the request was never finished and that's the case covered - that's why I think you should be good. BTW, @MaciekPytel should we close it now? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is still an issue, even in the latest version of the cluster autoscaler, v1.21.0 |
We're using recent
cluster-autoscaler
, built from themaster
branch. It's used with AWS EKS and k8s 1.12.7. We're using it to run a cluster based on AWS spot instances.Recently we had a situation, where CAS failed to scale up the cluster. During scale-up, 2 ASGs were selected to increase the capacity, but AWS had no spot capacity in them, so no new instances were really started (they were stuck in a state like
requested instances = 5, running instances = 0
). We're using--max-node-provision-time=10m
option, but new nodes in other ASGs (which had capacity) were not started in this time. The CAS was stuck in this state for over an hour.We found the following in CAS logs:
As you can see, we had some groups that couldn't be used (unhealthy were the once where AWS had no spot capacity), but some of them were perfectly fine ("the ones with "No need for any nodes...").
This situation is hard to reproduce, but can someone please help me review the code?
If an instance is not coming up for more than
MaxNodeProvisonTime
, it is added to theLongUnregistered
counter here.Then, when calculating upcoming nodes,
LongUnregistered
, is subtracted, as can be seen in clusterstate.go, which is used in scale_up.go.In our case
newNodes := ar.CurrentTarget - (readiness.Ready + readiness.Unready + readiness.LongNotStarted + readiness.LongUnregistered)
should correctly evaluate to 0, which means no new nodes are coming.Yet, our logs show "3 upcoming nodes".
Is it possible, that we had
CurrentTarget = X, Ready = Unready = LongNotStarted = LongUnregistered = 0
, which set the value to0
, but later it was updated toCurrentTarget = X, Ready = Unready = 0, LongNotStarted = LongUnregistered = X
, the result was negative and the check prevent the counter from updating?The text was updated successfully, but these errors were encountered: