-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster-autoscaler gets stuck with "Failed to fix node group sizes" error #6128
Comments
The same thing happened for one of our EKS clusters. here's the full output for a loop:
There's no obvious actions taken on our side that caused the problem to appear. |
When this has happened to us, it seems to have helped if I changed the number of instances in the node group manually, which then would get the cluster autoscaler to be able to correct the size again. My hypothesis so far is that this might be related to the cluster using the m7i-flex family of instances which aren't necessarily available in all availability zones. That can cause the node group to be set to for instance have 5 instances in it, but there will only be 4 created by AWS because the node group is configured to make use of subnets/AZs where it can't create instances. I've now replaced the node group with a new one that only has subnets/AZs enabled that has this family of instances available and hope that might help. @com6056 Any chance you might be in a similar situation? |
We aren't using those instance types, so I don't think that is what is causing it (at least for us). We have ASGs with a mixed instance type policy and there should usually be instances available (and if not, it should just fail and fallback to a different node group). |
Same issue, I use AWS as provider and kubernetes 1.28: After downgrade of cluster-autoscaler from1.28 to 1.27.3 it started to work ok. |
We are seeing this issue in our environments too and we are doing nothing special with instance types, at one point after a few reboots of autoscaler it worked for a little while then stopped working again. Currently downgrading as suggested by @mykhailogorsky |
Nope still hitting it, even with
|
We have seen no issues since our downgrade to v1.27.3 and things have been operating as we expect, seems like it is an issue introduced in 1.18.0. |
We are also seeing this in v1.28.0 |
This is unrelated to scale down logic - this is called only to reduce target size on a node group without actually deleting any nodes. The error suggests that it would actually require deleting existing VMs to reduce the node group size. Not sure why this started in 1.28. As a data point, this may be AWS specific - I haven't seen this error on GKE. |
We seen the same issue with 1.28.1. But only when deleting a node manually from from the cluster. |
Got the same issue today. I found out that we had more instances in our ASG than there were Nodes in Kubernetes. After finding the instance in EC2 which had not joined the cluster as Node and terminating the instance the cluster-autoscaler started to recover again. |
Seeing the same issue here. Terminating the instance that failed to join also fixes it. |
Same issue on 1.28.2 on AWS. Terminated the instance that was running but not part of the cluster. (ASG seems to have created a new node then, which did join the cluster) cluster-autoscaler was restarting with failed health checks (apparently getting 500s) and had the log messages as above. |
The culprit is the implementation of aws provider:
which size is almost always equal to len(nodes) because aws nodes are composed of active nodes and fake nodes (place-holder), they are always equal to the target size. |
related issue: #5829 |
Maybe we can filter out active running nodes, ignore these stale place-holder fake nodes which status == placeholderUnfulfillableStatus, and make sure asg target size could converge to active nodes for example:
|
FWIW, we're stuck with this bug as well; downgrading to 1.27.3 isn't a great option for us:
|
Hello, I had exactly the same problem running in 1.29 k8s and 1.29.0 image version for cluster-autoscaler. One node was in problem and all our system wasn't scalated due to this reason during lot of hours. |
Seeing this as well with cluster [email protected] on [email protected]. If the bootstrap script ( |
I observed a similar issue. The initial scale-up timeout caused the error to start appearing. The instance was eventually scaled up, but it wasn't registered as a Kubernetes node. When I deleted the node manually, issue has been resolved. I believe it was related to the low availability of a particular instance type (
|
I believe the fundamental issue is that cluster-autoscaler does not handle instances without node objects, that is, it does not support deleting nodes when the instance is still running, and similarly, does not handle instances that fail to produce a node object in the first place (e.g., misconfigured launch templates). Nodes are not supposed to be deleted manually. The kubelet creates the initial node object and keeps it updated, but does not react to the node object being removed. Cluster-autoscaler uses cloud APIs to scale up ASGs when needed or shutdown instances when drained, but does not intervene in node object lifecycle. The cloud node controller is supposed to remove the node object only after the instance is shutdown and relinquished, but has no way to tell whether a still-running instance is supposed to be a node of the cluster. So, manually deleting a node creates a zombie instance. (The control plane responds by redirecting service traffic and rescheduling workloads elsewhere, but the instance is left running.) Ideally kubelet should recreate the missing node object if necessary, but it currently doesn't. Then cluster-autoscaler starts getting confused by the sustained mismatch between ASG sizes and the node object inventory. This initially presents as cluster-autoscaler's To actually clean up the problem you need to manually identify the zombie instances and individually terminate them (or, more disruptively, just bump that ASG's desired size to zero briefly - then manually bump it back again in case that's the same ASG that cluster-autoscaler itself is run on). And then stop letting people manually delete nodes (nor run untested configs). A possible fix here would be, for all ASGs that cluster-autoscaler is configured to manage, to have cluster-autoscaler terminate any instance beyond a certain age if it doesn't correspond to a node object. (This would treat them like failed start-ups. An alternative solution would be for kubelet to recreate the node object it syncs to whenever necessary. Another alternative would be if kubelet reacted by exiting and shutting down its host, letting the ASG clean up such instances.) Note this proposed fix will also clean up instances that fail to connect to the cluster, which is precisely the source of the problem for some of the reports above in this thread. (If the ASG launch template is misconfigured for the cluster, it makes sense that cluster-autoscaler should be allowed to periodically retry creating the instance, effectively checking for corrections to that configuration.) |
We noticed this with a spot instance node group on EKS. Could a preempted instance also trigger this behavior? |
FYI, we are also seeing this with one of our users. It starts again with the timeout.
|
Just had this occur on one of our EKS 1.28 clusters, fixed it by:
Everything looks stabilized for this cluster now and can probably repeat if it does return. |
This method works fine for EKS 1.29 also, thanks. I'd add some tips how to deduce this "cattle out of herd". Error is: To see what CAS "thinks" about this ASG see Configmap cluster-autoscaler-status. Something like this: Than see instance list for this ASG. There should be one spare, Open question is what circumstances lead to such a situation with manual fix needed. Will try 1.29.2. |
Thanks for help, these solution working perfectly. |
Thank you! This works perfectly running EKS 1.28 and CA 1.28.2. Any plan to work on a fix? Thanks in advance. |
After encountering this issue on EKS 1.28 and CA 1.28.2, we reproduced the issue as described above by manually removing one of the Kubernetes nodes. When the autoscaler is able to recover from this disconnect between ASG state and Kubernetes state, you will see this in the log:
This message is never present in To test this, I modified the
And sure enough, on the next run of the cluster-autoscaler:
|
We also encountered this problem, and I refactored part of the logic. It can solve this problem very well #6729. Works well on our large cluster.
|
I am using EKS 1.29 with CA 1.29.2, tried this but i detach the instance, ASG automatically adds newer instances which are also not present in the cluster and i ended up with more extra zombie instances, I uncheck the instance replacement option and the instance is k8s starts to disappear. |
any ETA on fix? |
Anyone who had the problem on 1.28/29 would you please verify if #6528 (merged in 1.30) solved the problem for you? The solution itself is very similar to the one mentioned by @jschwartzy in #6128 (comment) |
@chuyee |
Hello everyone. We're running Kubernetes |
To be honest folks i have tried almost every version of CA after 1.27 and man it's a pain. Now that it was hurting us pretty bad as our ability to scale up or down was hit pretty bad. Now we have moved to karpenter and i can sleep like a baby. |
"We don't do cross version testing or compatibility testing in other environments. Some user reports indicate successful use of a newer version of Cluster Autoscaler with older clusters, however, there is always a chance that it won't work as expected." So you can either upgrade to v1.30.1 (as we did) or wait for this backport PR to be released as 1.28.6 if you'd like to be more cautious and use the intended supported version. |
@markshawtoronto thanks, will wait 1.28.6 to be released. |
We are in the same spot, anyone knows when it will be released? Thanks |
It'd also be great to see 1.29 backport PR released within the v1.29.4 🙏🏻 |
Any ETA on 1.28.6 and 1.29.4? |
@cloudwitch Probably July 24th based on https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#schedule. This is really inconvenient though so I'm pondering running with CA 1.27 on newer clusters. |
Was this issue fixed for 1.28.6? I don't see mentions about it on the release notes |
|
For those just coming across this bug and need to perform the above workaround as a quick fix, take not that detaching an instance from an ASG does not terminate it. In our case, I actually went to the instance's page and terminated it. Then the ASG automatically recreated a new instance which was able to join the cluster properly. We just then let Cluster Autoscaler deal with any excess capacity if there were any. For completeness, this is how we performed the workaround (adapted from original steps by @mrocheleau):
|
Is this fix already available in 1.29.3? I see it seems backported to 1.29, but I faced this problem 2 days ago using cluster autoscaler 1.29.3 |
Same with 1.29.4, can anyone confirm if the bad behavior is still present? |
Still facing this is in 1.30.2 image. |
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
Component version:
v1.28.0
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
AWS via the
aws
providerWhat did you expect to happen?:
I expect
cluster-autoscaler
to be able to scale ASGs up/down without issue.What happened instead?:
cluster-autoscaler
gets stuck in a deadlock with the following error:How to reproduce it (as minimally and precisely as possible):
Not entirely sure what causes it unfortunately.
Anything else we need to know?:
The text was updated successfully, but these errors were encountered: