Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-autoscaler gets stuck with "Failed to fix node group sizes" error #6128

Open
com6056 opened this issue Sep 22, 2023 · 50 comments
Open
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider kind/bug Categorizes issue or PR as related to a bug.

Comments

@com6056
Copy link
Contributor

com6056 commented Sep 22, 2023

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: v1.28.0

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:41:01Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.11", GitCommit:"8cfcba0b15c343a8dc48567a74c29ec4844e0b9e", GitTreeState:"clean", BuildDate:"2023-06-14T09:49:38Z", GoVersion:"go1.19.10", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

AWS via the aws provider

What did you expect to happen?:

I expect cluster-autoscaler to be able to scale ASGs up/down without issue.

What happened instead?:

cluster-autoscaler gets stuck in a deadlock with the following error:

cluster-autoscaler-aws-65dccf9965-qmprl cluster-autoscaler I0922 16:10:22.995283       1 static_autoscaler.go:709] Decreasing size of build-16-32, expected=4 current=2 delta=-2
cluster-autoscaler-aws-65dccf9965-qmprl cluster-autoscaler E0922 16:10:22.995297       1 static_autoscaler.go:439] Failed to fix node group sizes: failed to decrease build-16-32: attempt to delete existing nodes targetSize:4 delta:-2 existingNodes: 4

How to reproduce it (as minimally and precisely as possible):

Not entirely sure what causes it unfortunately.

Anything else we need to know?:

@com6056 com6056 added the kind/bug Categorizes issue or PR as related to a bug. label Sep 22, 2023
@Tenzer
Copy link

Tenzer commented Oct 2, 2023

The same thing happened for one of our EKS clusters. here's the full output for a loop:

I1002 14:50:40.757207       1 reflector.go:790] k8s.io/client-go/informers/factory.go:150: Watch close - *v1.ReplicaSet total 112 items received
I1002 14:50:42.157028       1 reflector.go:790] k8s.io/client-go/informers/factory.go:150: Watch close - *v1.PersistentVolumeClaim total 8 items received
I1002 14:51:09.451280       1 static_autoscaler.go:287] Starting main loop
I1002 14:51:09.451449       1 auto_scaling_groups.go:393] Regenerating instance to ASG map for ASG names: []
I1002 14:51:09.451463       1 auto_scaling_groups.go:400] Regenerating instance to ASG map for ASG tags: map[k8s.io/cluster-autoscaler/enabled: k8s.io/cluster-autoscaler/public:]
I1002 14:51:09.586582       1 auto_scaling_groups.go:142] Updating ASG eks-x86-m7i-flex-xlarge-f6c4dd96-da24-ca3f-2824-ffbc8b741746
I1002 14:51:09.586626       1 aws_wrapper.go:703] 0 launch configurations to query
I1002 14:51:09.586632       1 aws_wrapper.go:704] 0 launch templates to query
I1002 14:51:09.586638       1 aws_wrapper.go:724] Successfully queried 0 launch configurations
I1002 14:51:09.586645       1 aws_wrapper.go:735] Successfully queried 0 launch templates
I1002 14:51:09.586651       1 aws_wrapper.go:746] Successfully queried instance requirements for 0 ASGs
I1002 14:51:09.586659       1 aws_manager.go:129] Refreshed ASG list, next refresh after 2023-10-02 14:52:09.586657471 +0000 UTC m=+375178.448258144
I1002 14:51:09.588859       1 aws_manager.go:185] Found multiple availability zones for ASG "eks-x86-m7i-flex-xlarge-f6c4dd96-da24-ca3f-2824-ffbc8b741746"; using us-east-1a for failure-domain.beta.kubernetes.io/zone label
I1002 14:51:09.589066       1 static_autoscaler.go:709] Decreasing size of eks-x86-m7i-flex-xlarge-f6c4dd96-da24-ca3f-2824-ffbc8b741746, expected=12 current=11 delta=-1
E1002 14:51:09.589091       1 static_autoscaler.go:439] Failed to fix node group sizes: failed to decrease eks-x86-m7i-flex-xlarge-f6c4dd96-da24-ca3f-2824-ffbc8b741746: attempt to delete existing nodes targetSize:12 delta:-1 existingNodes: 12

There's no obvious actions taken on our side that caused the problem to appear.

@Tenzer
Copy link

Tenzer commented Oct 9, 2023

When this has happened to us, it seems to have helped if I changed the number of instances in the node group manually, which then would get the cluster autoscaler to be able to correct the size again.
It does however seem likely to reoccur again a couple of hours later.

My hypothesis so far is that this might be related to the cluster using the m7i-flex family of instances which aren't necessarily available in all availability zones. That can cause the node group to be set to for instance have 5 instances in it, but there will only be 4 created by AWS because the node group is configured to make use of subnets/AZs where it can't create instances.

I've now replaced the node group with a new one that only has subnets/AZs enabled that has this family of instances available and hope that might help.

@com6056 Any chance you might be in a similar situation?

@com6056
Copy link
Contributor Author

com6056 commented Nov 7, 2023

We aren't using those instance types, so I don't think that is what is causing it (at least for us). We have ASGs with a mixed instance type policy and there should usually be instances available (and if not, it should just fail and fallback to a different node group).

@mykhailogorsky
Copy link

Same issue, I use AWS as provider and kubernetes 1.28:
attempt to delete existing nodes targetSize:13 delta:-1 existingNodes: 13
From what I noticed in function DecreaseTargetSize it doesn't receive correct actual group size with this line of code:
nodes, err := ng.awsManager.GetAsgNodes(ng.asg.AwsRef)

After downgrade of cluster-autoscaler from1.28 to 1.27.3 it started to work ok.

@stefansedich
Copy link
Contributor

stefansedich commented Nov 15, 2023

We are seeing this issue in our environments too and we are doing nothing special with instance types, at one point after a few reboots of autoscaler it worked for a little while then stopped working again.

Currently downgrading as suggested by @mykhailogorsky

@com6056
Copy link
Contributor Author

com6056 commented Nov 15, 2023

@x13n could this be caused by #5976 enabling parallel drain by default? I guess I could try 1.28 again with -parallel-drain=false to confirm 👀

@com6056
Copy link
Contributor Author

com6056 commented Nov 15, 2023

Nope still hitting it, even with --parallel-drain="false":

cluster-autoscaler-aws-6d95d65-hbgdb cluster-autoscaler I1115 19:34:18.815122       1 static_autoscaler.go:709] Decreasing size of build-32-128, expected=5 current=1 delta=-4
cluster-autoscaler-aws-6d95d65-hbgdb cluster-autoscaler E1115 19:34:18.815135       1 static_autoscaler.go:439] Failed to fix node group sizes: failed to decrease build-32-128: attempt to delete existing nodes targetSize:5 delta:-4 existingNodes: 5

@stefansedich
Copy link
Contributor

We have seen no issues since our downgrade to v1.27.3 and things have been operating as we expect, seems like it is an issue introduced in 1.18.0.

@rajaie-sg
Copy link

We are also seeing this in v1.28.0

@x13n
Copy link
Member

x13n commented Nov 16, 2023

This is unrelated to scale down logic - this is called only to reduce target size on a node group without actually deleting any nodes. The error suggests that it would actually require deleting existing VMs to reduce the node group size. Not sure why this started in 1.28. As a data point, this may be AWS specific - I haven't seen this error on GKE.

@eaterm
Copy link

eaterm commented Dec 6, 2023

We seen the same issue with 1.28.1. But only when deleting a node manually from from the cluster.
After some time the autoscaler then tries to change the number of nodes in the node group and then fails.

@mycrEEpy
Copy link

Got the same issue today. I found out that we had more instances in our ASG than there were Nodes in Kubernetes. After finding the instance in EC2 which had not joined the cluster as Node and terminating the instance the cluster-autoscaler started to recover again.

@artificial-aidan
Copy link

Seeing the same issue here. Terminating the instance that failed to join also fixes it.

@mohag
Copy link

mohag commented Feb 2, 2024

Same issue on 1.28.2 on AWS.

Terminated the instance that was running but not part of the cluster. (ASG seems to have created a new node then, which did join the cluster)

cluster-autoscaler was restarting with failed health checks (apparently getting 500s) and had the log messages as above.

@songminglong
Copy link

The culprit is the implementation of aws provider:

func (ng *AwsNodeGroup) DecreaseTargetSize(delta int) error {
	if delta >= 0 {
		return fmt.Errorf("size decrease size must be negative")
	}

	size := ng.asg.curSize
	nodes, err := ng.awsManager.GetAsgNodes(ng.asg.AwsRef)
	if err != nil {
		return err
	}
	if int(size)+delta < len(nodes) {
		return fmt.Errorf("attempt to delete existing nodes targetSize:%d delta:%d existingNodes: %d",
			size, delta, len(nodes))
	}
	return ng.awsManager.SetAsgSize(ng.asg, size+delta)
}

which size is almost always equal to len(nodes) because aws nodes are composed of active nodes and fake nodes (place-holder), they are always equal to the target size.

@songminglong
Copy link

related issue: #5829

@songminglong
Copy link

The culprit is the implementation of aws provider:

func (ng *AwsNodeGroup) DecreaseTargetSize(delta int) error {
	if delta >= 0 {
		return fmt.Errorf("size decrease size must be negative")
	}

	size := ng.asg.curSize
	nodes, err := ng.awsManager.GetAsgNodes(ng.asg.AwsRef)
	if err != nil {
		return err
	}
	if int(size)+delta < len(nodes) {
		return fmt.Errorf("attempt to delete existing nodes targetSize:%d delta:%d existingNodes: %d",
			size, delta, len(nodes))
	}
	return ng.awsManager.SetAsgSize(ng.asg, size+delta)
}

which size is almost always equal to len(nodes) because aws nodes are composed of active nodes and fake nodes (place-holder), they are always equal to the target size.

Maybe we can filter out active running nodes, ignore these stale place-holder fake nodes which status == placeholderUnfulfillableStatus, and make sure asg target size could converge to active nodes

for example:

func (ng *AwsNodeGroup) DecreaseTargetSize(delta int) error {
	if delta >= 0 {
		return fmt.Errorf("size decrease size must be negative")
	}

	size := ng.asg.curSize
	nodes, err := ng.awsManager.GetAsgNodes(ng.asg.AwsRef)
	if err != nil {
		return err
	}

	// filter out active nodes, ignore these stale place-holder fake nodes which status == placeholderUnfulfillableStatus
	// make sure asg target size could converge to active nodes
	filteredNodes := make([]AwsInstanceRef, 0)
	for i := range nodes {
		node := nodes[i]
		instanceStatus, err := ng.awsManager.GetInstanceStatus(node)
		if err != nil {
			klog.V(4).Infof("Could not get instance status, continuing anyways: %v", err)
		} else if instanceStatus != nil && *instanceStatus == placeholderUnfulfillableStatus {
			continue
		}
		filteredNodes = append(filteredNodes, node)
	}
	nodes = filteredNodes

	if int(size)+delta < len(nodes) {
		return fmt.Errorf("attempt to delete existing nodes targetSize:%d delta:%d existingNodes: %d",
			size, delta, len(nodes))
	}
	return ng.awsManager.SetAsgSize(ng.asg, size+delta)
}

@akloss-cibo
Copy link

FWIW, we're stuck with this bug as well; downgrading to 1.27.3 isn't a great option for us:

uses the unknown EC2 instance type "m7i.48xlarge"

@mhornero91
Copy link

Hello, I had exactly the same problem running in 1.29 k8s and 1.29.0 image version for cluster-autoscaler. One node was in problem and all our system wasn't scalated due to this reason during lot of hours.

@fahaddd-git
Copy link

fahaddd-git commented Mar 1, 2024

Seeing this as well with cluster [email protected] on [email protected]. If the bootstrap script (/etc/eks/bootstrap.sh) fails, the node will just sit in the ASG and all CA scaling up/down will be stopped.

@pawelaugustyn
Copy link

I observed a similar issue. The initial scale-up timeout caused the error to start appearing. The instance was eventually scaled up, but it wasn't registered as a Kubernetes node. When I deleted the node manually, issue has been resolved. I believe it was related to the low availability of a particular instance type (g4dn.4xlarge). EKS 1.28 with 1.28.2 cluster-autoscaler.

I0313 12:34:34.470823       1 executor.go:147] Scale-up: setting group eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598 size to 7
I0313 12:34:34.470926       1 auto_scaling_groups.go:265] Setting asg eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598 size to 7
...
W0313 12:49:40.563382       1 clusterstate.go:266] Scale-up timed out for node group eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598 after 15m5.500491908s
W0313 12:49:40.584456       1 clusterstate.go:297] Disabling scale-up for node group eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598 until 2024-03-13 12:54:40.13918732 +0000 UTC m=+2030.419314682; errorClass=Other; errorCode=timeout
W0313 12:50:01.422358       1 orchestrator.go:582] Node group eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598 is not ready for scaleup - backoff
W0313 12:50:22.047950       1 orchestrator.go:582] Node group eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598 is not ready for scaleup - backoff
I0313 12:50:32.503066       1 static_autoscaler.go:709] Decreasing size of eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598, expected=7 current=6 delta=-1
E0313 12:50:32.503102       1 static_autoscaler.go:439] Failed to fix node group sizes: failed to decrease eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598: attempt to delete existing nodes targetSize:7 delta:-1 existingNodes: 7

...
NOW I DELETED THE NODE FROM EC2 CONSOLE
...

E0313 15:15:27.466600       1 static_autoscaler.go:439] Failed to fix node group sizes: failed to decrease eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598: attempt to delete existing nodes targetSize:7 delta:-1 existingNodes: 7
I0313 15:15:37.796147       1 static_autoscaler.go:709] Decreasing size of eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598, expected=7 current=6 delta=-1
E0313 15:15:37.796180       1 static_autoscaler.go:439] Failed to fix node group sizes: failed to decrease eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598: attempt to delete existing nodes targetSize:7 delta:-1 existingNodes: 7
I0313 15:15:48.264982       1 static_autoscaler.go:709] Decreasing size of eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598, expected=7 current=6 delta=-1
E0313 15:15:48.265018       1 static_autoscaler.go:439] Failed to fix node group sizes: failed to decrease eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598: attempt to delete existing nodes targetSize:7 delta:-1 existingNodes: 8
I0313 15:15:58.763859       1 static_autoscaler.go:709] Decreasing size of eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598, expected=7 current=6 delta=-1
E0313 15:15:58.763892       1 static_autoscaler.go:439] Failed to fix node group sizes: failed to decrease eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598: attempt to delete existing nodes targetSize:7 delta:-1 existingNodes: 8

NO MORE ENTRIES RELATED TO THIS ISSUE

@towca towca added the area/provider/aws Issues or PRs related to aws provider label Mar 21, 2024
@benjimin
Copy link

benjimin commented Mar 30, 2024

I believe the fundamental issue is that cluster-autoscaler does not handle instances without node objects, that is, it does not support deleting nodes when the instance is still running, and similarly, does not handle instances that fail to produce a node object in the first place (e.g., misconfigured launch templates).

Nodes are not supposed to be deleted manually. The kubelet creates the initial node object and keeps it updated, but does not react to the node object being removed. Cluster-autoscaler uses cloud APIs to scale up ASGs when needed or shutdown instances when drained, but does not intervene in node object lifecycle. The cloud node controller is supposed to remove the node object only after the instance is shutdown and relinquished, but has no way to tell whether a still-running instance is supposed to be a node of the cluster. So, manually deleting a node creates a zombie instance. (The control plane responds by redirecting service traffic and rescheduling workloads elsewhere, but the instance is left running.)

Ideally kubelet should recreate the missing node object if necessary, but it currently doesn't. Then cluster-autoscaler starts getting confused by the sustained mismatch between ASG sizes and the node object inventory. This initially presents as cluster-autoscaler's Failed to find readiness information for <ASG> message, while tending to leave pods indefinitely in a pending state (as if it thinks a scale-up is already in progress). If later the cluster manages to satisfy capacity due to some other trigger (either by manually bumping up the ASG desired size, or waiting for a transient workload spike to scale the cluster up further, or if enough pods just happen to finish) then all workloads may be restored to health, but nonetheless (after a scaling delay) cluster-autoscaler instead starts emitting Failed to fix node group sizes: failed to decrease <ASG>... messages containing the telltale clue that targetSize == existingNodes while delta < 0.

To actually clean up the problem you need to manually identify the zombie instances and individually terminate them (or, more disruptively, just bump that ASG's desired size to zero briefly - then manually bump it back again in case that's the same ASG that cluster-autoscaler itself is run on). And then stop letting people manually delete nodes (nor run untested configs).

A possible fix here would be, for all ASGs that cluster-autoscaler is configured to manage, to have cluster-autoscaler terminate any instance beyond a certain age if it doesn't correspond to a node object. (This would treat them like failed start-ups. An alternative solution would be for kubelet to recreate the node object it syncs to whenever necessary. Another alternative would be if kubelet reacted by exiting and shutting down its host, letting the ASG clean up such instances.)

Note this proposed fix will also clean up instances that fail to connect to the cluster, which is precisely the source of the problem for some of the reports above in this thread. (If the ASG launch template is misconfigured for the cluster, it makes sense that cluster-autoscaler should be allowed to periodically retry creating the instance, effectively checking for corrections to that configuration.)

@zchenyu
Copy link

zchenyu commented Apr 1, 2024

We noticed this with a spot instance node group on EKS. Could a preempted instance also trigger this behavior?

@ahilden
Copy link

ahilden commented Apr 17, 2024

FYI, we are also seeing this with one of our users. It starts again with the timeout.

05:26:42.532720       1 clusterstate.go:266] Scale-up timed out for node group eks-something-worker-80c733af6-1e9e-534e-a483-918f39b0bc05 after 15m2.467913257s

@kanupriyaraheja
Copy link

We observed the same issue with cluster autoscaler 1.28.0. It started with a timeout error. The instance got created in ASG but was not registered as a node in kubernetes. I attempted to recreate the issue which was difficult to do so I tried a different method of getting the same error which was deleting a node manually on kubernetes.
For Cluster autoscaler 1.28.0:
I manually deleted one of the nodes that a worker pod was running on. The instance is still present in ASG but this deletes in corresponding node in k8. The pod running on the deleted node crashes first but then goes into an indefinite pending state. Autoscaler attempts to decrease the size of the nodegroup but is unable to and it leads to the "Failed to fix node group size error".
Screenshot 2024-04-18 at 3 15 04 PM

I tested the same scenario on autoscaler 1.27.3. I manually deleted one of the nodes that a worker pod was running on. The instance is still present in ASG but this deletes in corresponding node in k8. This led to the pod crashing and then going to pending state. However the cluster autoscaler recognized the zombie instance and printed a message in the logs 1 unregistered node present. The autoscaler deleted the zombie instance and created a new one and also created a new node in kubernetes. The autoscaler resolved the problem on its own.

Screenshot 2024-04-18 at 7 54 11 PM

In 1.27.3 autoscaler gets unregistered nodes correctly hence it is able to identify zombie instances.

func getNotRegisteredNodes(allNodes []*apiv1.Node, cloudProviderNodeInstances map[string][]cloudprovider.Instance, time time.Time) []UnregisteredNode {
	registered := sets.NewString()
	for _, node := range allNodes {
		registered.Insert(node.Spec.ProviderID)
	}
	notRegistered := make([]UnregisteredNode, 0)
	for _, instances := range cloudProviderNodeInstances {
		for _, instance := range instances {
			if !registered.Has(instance.Id) {
				notRegistered = append(notRegistered, UnregisteredNode{
					Node:              fakeNode(instance, cloudprovider.FakeNodeUnregistered),
					UnregisteredSince: time,
				})
			}
		}
	}
	return notRegistered
}

@mrocheleau
Copy link

Just had this occur on one of our EKS 1.28 clusters, fixed it by:

  • Look at the node group node list to get current instance names (double confirm instances in the node group within k8s directly, these should 100% match) - in our case this node group had 4 actual in-service nodes
  • Went into the ASG directly and to the Instance Management tab - in our case this showed 5 in service instances.
  • Look at each instance's details in EC2 directly to check the names/IP's, and in our case one of them in the ASG didn't have a corresponding node in EKS/K8s
  • Detach that instance from the ASG - warning about "ASG may replace this instance", but it didn't.
  • Instance expected counts now match between ASG and EKS/K8s
  • Errors on the cluster-autoscaler pod stopped and normal operation resumed on it's own, once it had picked up the ASG updated status

Everything looks stabilized for this cluster now and can probably repeat if it does return.

@kappa8219
Copy link

kappa8219 commented Apr 23, 2024

Just had this occur on one of our EKS 1.28 clusters, fixed it by:

  • Look at the node group node list to get current instance names (double confirm instances in the node group within k8s directly, these should 100% match) - in our case this node group had 4 actual in-service nodes
  • Went into the ASG directly and to the Instance Management tab - in our case this showed 5 in service instances.
  • Look at each instance's details in EC2 directly to check the names/IP's, and in our case one of them in the ASG didn't have a corresponding node in EKS/K8s
  • Detach that instance from the ASG - warning about "ASG may replace this instance", but it didn't.
  • Instance expected counts now match between ASG and EKS/K8s
  • Errors on the cluster-autoscaler pod stopped and normal operation resumed on it's own, once it had picked up the ASG updated status

Everything looks stabilized for this cluster now and can probably repeat if it does return.

This method works fine for EKS 1.29 also, thanks.

I'd add some tips how to deduce this "cattle out of herd".

Error is:
static_autoscaler.go:449] Failed to fix node group sizes: failed to decrease gtlb: attempt to delete existing nodes targetSize:2 delta:-1 existingNodes: 2

To see what CAS "thinks" about this ASG see Configmap cluster-autoscaler-status. Something like this:
ScaleUp: NoActivity (ready=1 cloudProviderTarget=2 ...

Than see instance list for this ASG. There should be one spare, ready=1 from the ConfigMap is wrong. To check which one I used Annotations on Nodes csi.volume.kubernetes.io/nodeid={"ebs.csi.aws.com":"i-abcdabcddfv345"}. Compare to ASG list - got instance(s) to detach.

Open question is what circumstances lead to such a situation with manual fix needed. Will try 1.29.2.

@Kamalpreet-KK
Copy link

Just had this occur on one of our EKS 1.28 clusters, fixed it by:

  • Look at the node group node list to get current instance names (double confirm instances in the node group within k8s directly, these should 100% match) - in our case this node group had 4 actual in-service nodes
  • Went into the ASG directly and to the Instance Management tab - in our case this showed 5 in service instances.
  • Look at each instance's details in EC2 directly to check the names/IP's, and in our case one of them in the ASG didn't have a corresponding node in EKS/K8s
  • Detach that instance from the ASG - warning about "ASG may replace this instance", but it didn't.
  • Instance expected counts now match between ASG and EKS/K8s
  • Errors on the cluster-autoscaler pod stopped and normal operation resumed on it's own, once it had picked up the ASG updated status

Everything looks stabilized for this cluster now and can probably repeat if it does return.

Thanks for help, these solution working perfectly.

@ferwasy
Copy link

ferwasy commented Apr 25, 2024

Just had this occur on one of our EKS 1.28 clusters, fixed it by:

* Look at the node group node list to get current instance names (double confirm instances in the node group within k8s directly, these should 100% match) - in our case this node group had 4 actual in-service nodes

* Went into the ASG directly and to the Instance Management tab - in our case this showed 5 in service instances.

* Look at each instance's details in EC2 directly to check the names/IP's, and in our case one of them in the ASG didn't have a corresponding node in EKS/K8s

* Detach that instance from the ASG - warning about "ASG may replace this instance", but it didn't.

* Instance expected counts now match between ASG and EKS/K8s

* Errors on the cluster-autoscaler pod stopped and normal operation resumed on it's own, once it had picked up the ASG updated status

Everything looks stabilized for this cluster now and can probably repeat if it does return.

Thank you! This works perfectly running EKS 1.28 and CA 1.28.2. Any plan to work on a fix? Thanks in advance.

@jschwartzy
Copy link

After encountering this issue on EKS 1.28 and CA 1.28.2, we reproduced the issue as described above by manually removing one of the Kubernetes nodes.
Taking a look at the difference in behavior with 1.27.3, I think the issue is caused by this change:
e5bc070#diff-d404c1cb5f21589e42a6372990c673fcb42738f61d35bda2f5eeccdd0a7c3abeL1000-R1002

When the autoscaler is able to recover from this disconnect between ASG state and Kubernetes state, you will see this in the log:

I0504 14:22:56.465489       1 static_autoscaler.go:405] 1 unregistered nodes present

This message is never present in 1.28.x versions.

To test this, I modified the getNotRegisteredNodes method to expand the check:

if (!registered.Has(instance.Id) && expRegister) || (!registered.Has(instance.Id) && instance.Status == nil)

And sure enough, on the next run of the cluster-autoscaler:

I0504 14:22:56.465489       1 static_autoscaler.go:405] 1 unregistered nodes present
...
I0504 14:25:57.187609       1 clusterstate.go:633] Found longUnregistered Nodes [aws:///us-east-1a/i-1234ab5678901a12b]
I0504 14:25:57.187651       1 static_autoscaler.go:405] 1 unregistered nodes present
I0504 14:25:57.187662       1 static_autoscaler.go:746] Marking unregistered node aws:///us-east-1a/i-1234ab5678901a12b for removal
I0504 14:25:57.187678       1 static_autoscaler.go:755] Removing 1 unregistered nodes for node group general
I0504 14:25:57.407027       1 auto_scaling_groups.go:343] Terminating EC2 instance: i-1234ab5678901a12b
I0504 14:25:57.407055       1 aws_manager.go:162] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation
I0504 14:25:57.407118       1 static_autoscaler.go:413] Some unregistered nodes were removed

@daimaxiaxie
Copy link
Contributor

daimaxiaxie commented May 6, 2024

We also encountered this problem, and I refactored part of the logic. It can solve this problem very well #6729. Works well on our large cluster.

For AWS, instance has only two states. (expectedToRegister is incorrect)

  1. instance.Status = nil
  2. instance.State = InstanceCreating and instance.State.ErrorInfo = OutOfResourcesErrorClass

Therefore removeOldUnregisteredNodes will be invalid. Finally enter fixNodeGroupSize and report an error.

@wsaeed-tkxel
Copy link

wsaeed-tkxel commented May 7, 2024

Just had this occur on one of our EKS 1.28 clusters, fixed it by:

* Look at the node group node list to get current instance names (double confirm instances in the node group within k8s directly, these should 100% match) - in our case this node group had 4 actual in-service nodes

* Went into the ASG directly and to the Instance Management tab - in our case this showed 5 in service instances.

* Look at each instance's details in EC2 directly to check the names/IP's, and in our case one of them in the ASG didn't have a corresponding node in EKS/K8s

* Detach that instance from the ASG - warning about "ASG may replace this instance", but it didn't.

* Instance expected counts now match between ASG and EKS/K8s

* Errors on the cluster-autoscaler pod stopped and normal operation resumed on it's own, once it had picked up the ASG updated status

Everything looks stabilized for this cluster now and can probably repeat if it does return.

I am using EKS 1.29 with CA 1.29.2, tried this but i detach the instance, ASG automatically adds newer instances which are also not present in the cluster and i ended up with more extra zombie instances, I uncheck the instance replacement option and the instance is k8s starts to disappear.

@tooptoop4
Copy link

any ETA on fix?

@chuyee
Copy link

chuyee commented Jun 3, 2024

Anyone who had the problem on 1.28/29 would you please verify if #6528 (merged in 1.30) solved the problem for you? The solution itself is very similar to the one mentioned by @jschwartzy in #6128 (comment)

@markshawtoronto
Copy link

Anyone who had the problem on 1.28/29 would you please verify if #6528 (merged in 1.30) solved the problem for you? The solution itself is very similar to the one mentioned by @jschwartzy in #6128 (comment)

@chuyee
Confirmed, we upgraded our autoscalers to v1.30.1, even while running kubernetes 1.28 clusters on EKS and the problem was easily reproducible before but now gone for us. 🥳

@ferwasy
Copy link

ferwasy commented Jun 7, 2024

Hello everyone. We're running Kubernetes v1.28 on EKS and have been experiencing this issue with the cluster autoscaler. Is version v1.30.1 fully compatible with Kubernetes v1.28? Thanks in advance.

@wsaeed-tkxel
Copy link

wsaeed-tkxel commented Jun 7, 2024

To be honest folks i have tried almost every version of CA after 1.27 and man it's a pain. Now that it was hurting us pretty bad as our ability to scale up or down was hit pretty bad. Now we have moved to karpenter and i can sleep like a baby.

@markshawtoronto
Copy link

markshawtoronto commented Jun 7, 2024

Hello everyone. We're running Kubernetes v1.28 on EKS and have been experiencing this issue with the cluster autoscaler. Is version v1.30.1 fully compatible with Kubernetes v1.28? Thanks in advance.

@ferwasy

"We don't do cross version testing or compatibility testing in other environments. Some user reports indicate successful use of a newer version of Cluster Autoscaler with older clusters, however, there is always a chance that it won't work as expected."

So you can either upgrade to v1.30.1 (as we did) or wait for this backport PR to be released as 1.28.6 if you'd like to be more cautious and use the intended supported version.

@ferwasy
Copy link

ferwasy commented Jun 10, 2024

Hello everyone. We're running Kubernetes v1.28 on EKS and have been experiencing this issue with the cluster autoscaler. Is version v1.30.1 fully compatible with Kubernetes v1.28? Thanks in advance.

@ferwasy

"We don't do cross version testing or compatibility testing in other environments. Some user reports indicate successful use of a newer version of Cluster Autoscaler with older clusters, however, there is always a chance that it won't work as expected."

So you can either upgrade to v1.30.1 (as we did) or wait for this backport PR to be released as 1.28.6 if you'd like to be more cautious and use the intended supported version.

@markshawtoronto thanks, will wait 1.28.6 to be released.

@andresrsanchez
Copy link

Hello everyone. We're running Kubernetes v1.28 on EKS and have been experiencing this issue with the cluster autoscaler. Is version v1.30.1 fully compatible with Kubernetes v1.28? Thanks in advance.

@ferwasy
"We don't do cross version testing or compatibility testing in other environments. Some user reports indicate successful use of a newer version of Cluster Autoscaler with older clusters, however, there is always a chance that it won't work as expected."
So you can either upgrade to v1.30.1 (as we did) or wait for this backport PR to be released as 1.28.6 if you'd like to be more cautious and use the intended supported version.

@markshawtoronto thanks, will wait 1.28.6 to be released.

We are in the same spot, anyone knows when it will be released?

Thanks

@pawelaugustyn
Copy link

It'd also be great to see 1.29 backport PR released within the v1.29.4 🙏🏻

@cloudwitch
Copy link

Any ETA on 1.28.6 and 1.29.4?

@maur1th
Copy link

maur1th commented Jun 27, 2024

@cloudwitch Probably July 24th based on https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#schedule. This is really inconvenient though so I'm pondering running with CA 1.27 on newer clusters.

@samuel-esp
Copy link

Was this issue fixed for 1.28.6? I don't see mentions about it on the release notes

@semoog
Copy link

semoog commented Jul 22, 2024

Was this issue fixed for 1.28.6? I don't see mentions about it on the release notes

Backport is included
image

@relaxdiego
Copy link

relaxdiego commented Aug 15, 2024

Just had this occur on one of our EKS 1.28 clusters, fixed it by:

* Look at the node group node list to get current instance names (double confirm instances in the node group within k8s directly, these should 100% match) - in our case this node group had 4 actual in-service nodes

* Went into the ASG directly and to the Instance Management tab - in our case this showed 5 in service instances.

* Look at each instance's details in EC2 directly to check the names/IP's, and in our case one of them in the ASG didn't have a corresponding node in EKS/K8s

* Detach that instance from the ASG - warning about "ASG may replace this instance", but it didn't.

* Instance expected counts now match between ASG and EKS/K8s

* Errors on the cluster-autoscaler pod stopped and normal operation resumed on it's own, once it had picked up the ASG updated status

Everything looks stabilized for this cluster now and can probably repeat if it does return.

For those just coming across this bug and need to perform the above workaround as a quick fix, take not that detaching an instance from an ASG does not terminate it. In our case, I actually went to the instance's page and terminated it. Then the ASG automatically recreated a new instance which was able to join the cluster properly. We just then let Cluster Autoscaler deal with any excess capacity if there were any. For completeness, this is how we performed the workaround (adapted from original steps by @mrocheleau):

  1. Look at the node group node list to get current instance names (double confirm instances in the node group within k8s directly, these should 100% match)
  2. Went into the ASG directly and to the Instance Management tab
  3. Look at each instance's details in EC2 directly to check the names/IPs. In our case we found one extra instance that did not show up as a node in k8s (by using kubectl get nodes -o wide for example)
  4. Terminate the instance. This will cause the ASG to create a new one. This is fine since cluster-autoscaler will just adjust the node group if there is any excess capacity.
  5. Errors on the cluster-autoscaler should stop and normal operation resume.

@agamez-harmonicinc
Copy link

Is this fix already available in 1.29.3? I see it seems backported to 1.29, but I faced this problem 2 days ago using cluster autoscaler 1.29.3

@samuel-esp
Copy link

Is this fix already available in 1.29.3? I see it seems backported to 1.29, but I faced this problem 2 days ago using cluster autoscaler 1.29.3

Same with 1.29.4, can anyone confirm if the bad behavior is still present?

@benmoss
Copy link
Member

benmoss commented Oct 17, 2024

It was backported to 1.29.4 and 1.28.6 and released as part of 1.30.0
1.28.6: 0c97874
1.29.4: 90591b5
1.30.0 and beyond: 6ca8414

@hh-sushantkumar
Copy link

Still facing this is in 1.30.2 image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.