Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Autoscaler does not start new nodes when Taints and NodeSelector are used in EKS #3802

Closed
dschunack opened this issue Jan 11, 2021 · 37 comments
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@dschunack
Copy link
Contributor

dschunack commented Jan 11, 2021

Hi,

we use EKS with kubernetes 1.18 and the Cluster Autoscaler. With kubernetes 1.17 the "beta.kubernetes.io/Instance-type" is deprecated. We use instead the new "node.kubernetes.io/instance-type" as NodeSelector. This is working for autoscaling groups without taints. For the autoscaling groups with taints is the new "node.kubernetes.io/instance-type" selector not working and the cluster autoscaler doesn't start new nodes. If we switch back to the old and deprecated "beta.kubernetes.io/instance-type" Selector the cluster autoscaler starts a new Node. We see this behavior on all of our EKS.

Events output for both Test PODs with beta and node.kubernetes.io as NodeSelector.
POD with node.kubernetes.io selector was started first.

% kubectl get pods
NAME                READY   STATUS    RESTARTS   AGE
test-4xlarge-beta   0/1     Pending   0          41s
test-4xlarge-node   0/1     Pending   0          72s

% kubectl describe pod test-4xlarge-node
Name:         test-4xlarge-node
Namespace:    default
Priority:     0
Node:         <none>
Labels:       <none>
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Pending
IP:           
IPs:          <none>
Containers:
  test-4xlarge-node:
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-token-lzknk:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-lzknk
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  node.kubernetes.io/instance-type=c5a.4xlarge
Tolerations:     disk=true:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age               From                Message
  ----     ------             ----              ----                -------
  Normal   NotTriggerScaleUp  88s               cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 12 node(s) didn't match node selector
  Warning  FailedScheduling   9s (x8 over 92s)  default-scheduler   0/35 nodes are available: 3 node(s) were unschedulable, 32 node(s) didn't match node selector.


% kubectl describe pod test-4xlarge-beta
Name:         test-4xlarge-beta
Namespace:    default
Priority:     0
Node:         <none>
Labels:       <none>
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Pending
Containers:
  test-4xlarge-beta:     
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-token-lzknk:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-lzknk
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  beta.kubernetes.io/instance-type=c5a.4xlarge
Tolerations:     disk=true:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age               From                Message
  ----     ------            ----              ----                -------
  Normal   TriggeredScaleUp  47s               cluster-autoscaler  pod triggered scale-up: [{eks-agileci-cattle-disk-asg20201117110440315400000002 0->1 (max: 100)}]
  Warning  FailedScheduling  7s (x5 over 51s)  default-scheduler   0/35 nodes are available: 3 node(s) were unschedulable, 32 node(s) didn't match node selector.

Which component are you using?: cluster-autoscaler
What version of the component are you using?: cluster-autoscaler release v1.18.3
What k8s version are you using (kubectl version)?: 1.18.9

kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-12T01:09:16Z", GoVersion:"go1.15.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.9-eks-d1db3c", GitCommit:"d1db3c46e55f95d6a7d3e5578689371318f95ff9", GitTreeState:"clean", BuildDate:"2020-10-20T22:18:07Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

What did you expect to happen?: Cluster-Autoscaler starts a new Nodes
What happened instead?: Cluster-Autoscaler doesn't start a new Nodes. See the following Error.

Events:
  Type     Reason             Age               From                Message
  ----     ------             ----              ----                -------
  Normal   NotTriggerScaleUp  88s               cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 12 node(s) didn't match node selector
  Warning  FailedScheduling   9s (x8 over 92s)  default-scheduler   0/35 nodes are available: 3 node(s) were unschedulable, 32 node(s) didn't match node selector.

How to reproduce it (as minimally and precisely as possible):

We use the following POD template to test the cluster-autoscaler.

Is Working:

apiVersion: v1
kind: Pod
metadata:
  name: test-4xlarge-beta
spec:
  restartPolicy: OnFailure
  containers:
  - name: test-4xlarge-beta
    image: radial/busyboxplus
    args:
    - "sh"
  tolerations:
  - key: "disk"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  nodeSelector:
    beta.kubernetes.io/instance-type: c5a.4xlarge

Is not Working:

apiVersion: v1
kind: Pod
metadata:
  name: test-4xlarge-node
spec:
  restartPolicy: OnFailure
  containers:
  - name: test-4xlarge-node
    image: radial/busyboxplus
    args:
    - "sh"
  tolerations:
  - key: "disk"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  nodeSelector:
    node.kubernetes.io/instance-type: c5a.4xlarge

Taints and Tags are configured on the ASG and also in kubelet configuration.
See Screenshot

Xnip2021-01-11_16-50-51

@dschunack dschunack added the kind/bug Categorizes issue or PR as related to a bug. label Jan 11, 2021
@umialpha
Copy link
Contributor

Hi, could you provide the labels on your node? I thought the label on your node may be "beta.kubernetes.io/instance-type"

@dschunack
Copy link
Contributor Author

dschunack commented Jan 13, 2021

Since 1.17 both labels are present on the nodes (beta and node label).

kubectl get nodes --show-labels -l node.kubernetes.io/instance-type=c5a.4xlarge 
NAME                                             STATUS   ROLES    AGE    VERSION              LABELS
ip-10-194-24-148.eu-central-1.compute.internal   Ready    <none>   3h1m   v1.18.9-eks-d1db3c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=c5a.4xlarge,beta.kubernetes.io/os=linux,cpu=true,failure-domain.beta.kubernetes.io/region=eu-central-1,failure-domain.beta.kubernetes.io/zone=eu-central-1a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-194-24-148.u0.ww.conti.de,kubernetes.io/os=linux,node.kubernetes.io/instance-type=c5a.4xlarge,topology.kubernetes.io/region=eu-central-1,topology.kubernetes.io/zone=eu-central-1a

@umialpha
Copy link
Contributor

umialpha commented Jan 13, 2021

thanks for feedback, TBH I am not familiar with aws. Could you please check the tag "node.kubernetes.io/instance-type" is set in your ASG tags. From my perspective, if you scale from 0, when creating a node template for predicting, the node tags are copied from the asg.
BTW, from the logs you offered, it seems that the nodegroup is scaled from 0, right? If you choose to scale from 1(or more), it may perform correctly.

@dschunack
Copy link
Contributor Author

dschunack commented Jan 13, 2021

The scaling with node.kubernetes.io/instance-type is working without taints and also a scale up from 0.
But, if you add a taint on the ASG the cluster autoscaler doesn't scale up and report an error.
If we switch back to the old beta.kubernetes.io/instance-type label it's working and also scale up from 0.
I think it's not a problem of the tags on the ASG. We don't set any tag like "beta.kubernetes.io" or "node.kubernetes.io", this is not needed.

@dschunack
Copy link
Contributor Author

Any news?

@pre
Copy link

pre commented Jan 28, 2021

I'm having the same issue, cluster-autoscaler fails to start a new node when requesting an instance type which is not yet online.

For example, when cluster does not have a large instance type c5.24xlarge, cluster-autoscaler fails to start a new node with a pod launched with node selector node.kubernetes.io/instance-type: c5.24xlarge, even though we have this exact instance type defined in the managed node group available instance types.

Cluster autoscaler logs don't contain anything meaningful, pod has

pod didn't trigger scale-up: 2 node(s) didn't match Pod's node affinity

@dschunack
Copy link
Contributor Author

dschunack commented Feb 12, 2021

Hi,

We tested it today again and it looks that the autoscaler is not working correct with "node.kubernetes.io/instance-type".
Sometimes it works, sometimes not.

Today we started a POD with nginx image to test the autoscaling. Only the NodeSelector is different.
"beta.kubernetes.io" is working and "node.kubernetes.io" not.
The POD with nodeSelector "node.kubernetes.io" was started first.

Autoscaler version: 1.18.4

Doesn't Work:

kubectl describe pod nginx-reg                                      
Name:         nginx-reg
Namespace:    default
Priority:     0
Node:         <none>
Labels:       env=test
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Pending
IP:           
IPs:          <none>
Containers:
  nginx:
    Image:        nginx
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-8bf7f (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-token-8bf7f:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-8bf7f
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  node.kubernetes.io/instance-type=m5a.xlarge
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                  From                Message
  ----     ------             ----                 ----                -------
  Normal   NotTriggerScaleUp  4m39s                cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 12 node(s) didn't match node selector
  Warning  FailedScheduling   13s (x7 over 4m43s)  default-scheduler   0/42 nodes are available: 34 node(s) were unschedulable, 8 node(s) didn't match node selector.

Works:

kubectl describe pod nginx-reg2
Name:         nginx-reg2
Namespace:    default
Priority:     0
Node:         <none>
Labels:       env=test
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Pending
IP:           
IPs:          <none>
Containers:
  nginx:
    Image:        nginx
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-8bf7f (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-token-8bf7f:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-8bf7f
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  beta.kubernetes.io/instance-type=m5a.xlarge
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                From                Message
  ----     ------            ----               ----                -------
  Warning  FailedScheduling  23s (x2 over 23s)  default-scheduler   0/42 nodes are available: 34 node(s) were unschedulable, 8 node(s) didn't match node selector.
  Normal   TriggeredScaleUp  12s                cluster-autoscaler  pod triggered scale-up: [{agileci-prod-pet-system-asg20210212094053622400000002 0->1 (max: 50)}]

@umialpha
Copy link
Contributor

umialpha commented Mar 17, 2021

Hi, I think I found the root cause. When scaling from 0, aws_cloud_provide will generate nodeinfo from template(not real node). When generating, it forgot to add "node.kubernetes.io/instance-type" to the label. Check the code here aws_manager.go

@dschunack
Copy link
Contributor Author

dschunack commented Mar 17, 2021

Hi, yes we have the same feeling that the autoscaler forget the "node.kubernetes.io" labels, but not immediately.
Some minutes after the shutdown of the last node in the ASG it's also working with "node.kubernetes.io" but not after some hours. A fix will maybe solve also the following issue: Scale up windows

@dschunack
Copy link
Contributor Author

dschunack commented Mar 17, 2021

Hi again,

found some new code lines to support also ARM64 with the stable API #3848 and also for AZURE.

I think some of the stable APIs are missing in the AWS manager:

LableArchStable
LableOSStable

Hope it's possible to add all Stable APIs/Labels soon.

@dschunack
Copy link
Contributor Author

Hi,

Some PRs added to integrate stable API and this is very nice, thanks.
Is it possible to add also the LabelArchStable and LabelOSStable for the aws_manager?
This is missing at the moment for the aws_manager.

@alexmnyc
Copy link

I had an issue with zero instance ASG's and nodeselector not targeting correct node labels #4010 also on EKS

@lsowen
Copy link

lsowen commented Apr 29, 2021

Hi, yes we have the same feeling that the autoscaler forget the "node.kubernetes.io" labels, but not immediately.
Some minutes after the shutdown of the last node in the ASG it's also working with "node.kubernetes.io" but not after some hours. A fix will maybe solve also the following issue: Scale up windows

I'm seeing something similar, but I'm not using any node.kubernetes.io labels. When cluster-autoscaler (v1.20.0) is first launched, it successfully scales up from zero when needed by creating template-node-for-... template nodes. For a while, it works without issue, scaling up and down (even to and from 0). However, within 24 hours it stops being able to find a match for any ASG which has been scaled down to zero. I see no more log entries for template-node-for-...., so I suspect the "actual definitions" of the ASG expire from a cache, and the logic for using the template node definition does not start back up. After this occurs, I start to see log messages like:

Pod <POD_NAME> can't be scheduled on <ASG_NAME>, predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=

Though this is the ASG which should scale up. Restarting the cluster-autoscaler "resolves" the issue (but is not a real solution, as this requires restarting the autoscaler every day at random times).

@lsowen
Copy link

lsowen commented May 23, 2021

I have continued to experience this issue. And have tracked down part of the issue.

In the loop where it is checking the nodeGroups, it looks for a cached definition in the nodeInfoCache:

if nodeInfoCache != nil {
if nodeInfo, found := nodeInfoCache[id]; found {
if nodeInfoCopy, err := deepCopyNodeInfo(nodeInfo); err == nil {
result[id] = nodeInfoCopy
continue
}
}
}

For the groups which do have issues, the results are being returned from that cache, and the nodeInfoCopy.node.ObjectMeta.Labels is missing the expected labels. So the node templates are not matching the required "NodeAffinity.Filter()" (https://github.com/kubernetes/kubernetes/blob/d8f9e4587ac1265efd723bce74ae6a39576f2d58/pkg/scheduler/framework/plugins/nodeaffinity/node_affinity.go#L115)

Labels from a "correct" group (which does autoscale up from 0):

                Labels: map[string]string [                                                                                                                    
                        "kubernetes.io/os": "linux",                                                                                                           
                        "kops.k8s.io/instancegroup": "workers-devstage-large-spot",                
                        "spotinstance": "yes",                                                                                                                 
                        "kubernetes.io/arch": "amd64",                         
                        "workergroup": "devstage", 
                        "topology.kubernetes.io/zone": "us-east-1a", 
                        "node-role.kubernetes.io/spot-worker": "true", 
                        "kubernetes.io/hostname": "template-node-for-workers-devstage-large-spot.cluster-01....+22 more", 
                        "node.kubernetes.io/instance-type": "r5.24xlarge", 
                        "beta.kubernetes.io/os": "linux", 
                        "beta.kubernetes.io/arch": "amd64", 
                        "nodetype": "worker", 
                        "failure-domain.beta.kubernetes.io/region": "us-east-1", 
                        "topology.kubernetes.io/region": "us-east-1", 
                        "beta.kubernetes.io/instance-type": "r5.24xlarge", 
                        "node-role.kubernetes.io/node": "", 
                        "failure-domain.beta.kubernetes.io/zone": "us-east-1a",  
                        "kubernetes.io/role": "node", 
                        "workersize": "large", 
                ],  

Labels from an "incorrect" group (which does not autoscale up from 0 since it is missing the workersize and workergroup labels we use in our pod nodeSelector):

                Labels: map[string]string [                                                                                                                                                                                                                                                                                   
                        "topology.kubernetes.io/region": "us-east-1",                                                                                                                                                                                                                                                         
                        "node.kubernetes.io/instance-type": "c5.12xlarge",                                                                                                                                                                                                                                                    
                        "topology.kubernetes.io/zone": "us-east-1a",                                                                                                                                                                                                                                                          
                        "beta.kubernetes.io/instance-type": "c5.12xlarge",                                                                                                                                                                                                                                                    
                        "kubernetes.io/os": "linux",                                                                                                                                                                                                                                                                          
                        "beta.kubernetes.io/arch": "amd64",                                                                                                                                                                                                                                                                   
                        "beta.kubernetes.io/os": "linux",                                                                                                                                                                                                                                                                     
                        "kubernetes.io/arch": "amd64",                                                                                                                                                                                                                                                                        
                        "failure-domain.beta.kubernetes.io/region": "us-east-1",                                                                                                                                                                                                                                              
                        "kubernetes.io/hostname": "template-node-for-workers-dev-normal-spot.cluster-01.-2...+18 more",                                                                                                                                                                                              
                        "failure-domain.beta.kubernetes.io/zone": "us-east-1a",                                                                                                                                                                                                                                               
                ],  

My guess is that the node is still "booting" when the info is cached, so not all labels have been added to the data which is permanently cached. Possibly IsNodeReadyAndSchedulable triggering too early?

for _, node := range nodes {
// Broken nodes might have some stuff missing. Skipping.
if !kube_util.IsNodeReadyAndSchedulable(node) {
continue
}
added, id, typedErr := processNode(node)
if typedErr != nil {
return map[string]*schedulerframework.NodeInfo{}, typedErr
}
if added && nodeInfoCache != nil {
if nodeInfoCopy, err := deepCopyNodeInfo(result[id]); err == nil {
nodeInfoCache[id] = nodeInfoCopy
}
}
}

Restarting the cluster-autoscaler pod allows it to refresh all data from AWS, at which point the correct node groups are scaled up for the existing pending pods. Then, at some point in the next 24 or so hours one or more groups will stop scaling properly (which ones of our 10 or so groups stop failing seems to be random).

@lsowen
Copy link

lsowen commented Jun 2, 2021

I think I have confirmed that my hypothesis in #3802 (comment) is correct.

I've deployed a patched version with a workaround (not a fix), which has prevented the issue from re-occurring.

https://github.com/kubernetes/autoscaler/compare/cluster-autoscaler-1.21.0...lsowen:autoscaler-failure-workaround?expand=1

Basically, wait 5 minutes after the nodes is "ready" before caching the info about the node, which includes the labels. This prevents instance groups from being cached with missing labels.

As for a fix, I'm not sure the best way. A few options:

  1. A configurable "timeout" similar to my workaround, to delay caching
  2. A different way of determining IsNodeReadyAndSchedulable (not sure what way other than the current implementation:
    func IsNodeReadyAndSchedulable(node *apiv1.Node) bool {
    )
  3. A change to kubelet to not set the NodeReady condition until after all node labels are registered.

Option 3 seems the most robust, but is definitely the most complicated. I don't even know where to begin. It might also be the root of my issue, because older versions of kubernetes (and thus older kubelet) didn't seem to trigger this issue.

@dany74q
Copy link

dany74q commented Aug 11, 2021

I've been experiencing similar symptoms to what's described here.

@lsowen - I think the race is a tad bit more specific - from what I see at least, it seems like
the cache is populated (and existing entries overridden) from the k8s api server on every autoscaling attempt;

I believe the flow is the following:

  • We take all relevant nodes from the k8s api server - this uses a ListWatcher behind the scenes, which watches for Node changes from the k8s api server, and also resyncs the entire node list every hour;
    if the watch operation does not consistently fail - I believe that one would get a relatively up-to-date view of the nodes in the cluster with this on each invocation.

  • With the k8s supplied nodes at hand, we cache the node info of the first seen node for any cloud-provider node group on each iteration; different invocation might cache info from different nodes within the group, depending on the lister result;
    this means that if you have several nodes within your group, but one of them is off-sync with its labels - it might corrupt the autoscaler view of the entire group.

  • After caching all node infos, we iterate on all node groups from the cloud provider - and then we use the previously populated cached view if such exists; I'd guess this is due to the autoscaler preferring the use of the real-world view of your nodes vs the template generated from the cloud provider, as they may be off-sync.

If the above is correct, what I believe needs to happen in order to trigger such a race condition is that the last time the autoscaler had seen a node from the k8s api server and cached its info - only then should the labels be off-sync, in order to corrupt the state for the entirety of the next runs.

If we're operating under the premise that all of your node group nodes eventually do consist of all required labels, which are added at runtime - then, as long as there are alive nodes, the autoscaler state should be eventually consistent and it should work well in one of the next cycles (b/c it does override the cache entries on each cycle);

When it could indeed break, I believe, is at times where the group scales from 1->0, and when this soon-to-be-terminated node has a partial label list - potentially because it's removing labels before termination, or if it's terminating before it's fully provisioned;
In that case, we would cache this partial view one last time before we no longer have any nodes for that given node group in the cluster - we then continue to use this corrupted view endlessly, as the cache entries aren't expiring.

Would you agree @lsowen ?

@lsowen
Copy link

lsowen commented Aug 11, 2021

@dany74q I agree that the issue arises when a node group is scaled down to 0 and cannot scale back up, caused by a corruption in the cache of labels that autoscaler is using.

However, at least in my case, the cache that autoscaler holds is populated by the first node in the group as it boots up, not as it is terminating. The issue is that not all labels are applied on the node before it is marked as "ready". If I apply a delay so that autoscaler doesn't see the newly booted node for a bit (in my case I arbitrarily used a 5 minute delay), then the issue goes away. I was having the issue multiple times a day, but with my (badly) patched version, I have not seen the issue once in over 2 months.

patched version: https://github.com/kubernetes/autoscaler/compare/cluster-autoscaler-1.21.0...lsowen:autoscaler-failure-workaround?expand=1

@dany74q
Copy link

dany74q commented Aug 11, 2021

@lsowen - Thanks ! I've seen the patch - the thing I don't fully understand about it though, is why the continuous overriding of the cache entries does not resolve this on its own after a period of time, if indeed the problematic cache entry is that initial one ?

GetNodeInfosForGroups is called on every scale attempt, and from the code it looks like the cache is always overridden with the latest k8s-supplied node object:
https://github.com/lsowen/autoscaler/blob/5f5e0be76c99504cd20b7019c7e3694cfc5ec79d/cluster-autoscaler/core/utils/utils.go#L96-L100

What I would've expected in your case then, is that once the Node had stabilized with all correct labels,
the cache entry would've eventually been overridden -
and if it wouldn't have been cached on its way down with a partial label list, a well-formed Node should've been returned from the cache, and not the first partial view.

Do you see a flow in the code in which that first invalid entry would've been cached - and newer entries never overriding it (in case it's still up in the next autoscale cycle) ?

@lsowen
Copy link

lsowen commented Aug 11, 2021

@dany74q I believe it is because added is only true once, when the node is initially not found in the cache: https://github.com/lsowen/autoscaler/blob/5f5e0be76c99504cd20b7019c7e3694cfc5ec79d/cluster-autoscaler/core/utils/utils.go#L64-L75

@dany74q
Copy link

dany74q commented Aug 11, 2021

@lsowen - I thought that might be the case, but the cache is not probed at that point at all, the result there is purely local to the function and is being re-calculated on every call, correct me if I'm wrong ofc.

Thanks !

@thpang
Copy link

thpang commented Oct 27, 2021

Any activity on this one? Been open for a while and is an issue for folks who apply labels/taints to their node pools. Hoping to see some movement soon.

@draeath
Copy link

draeath commented Nov 4, 2021

I don't apply taints or labels to my nodegroups and have run into this behavior with kubernetes 1.21 (via AWS EKS) and autoscaler 9.9.2 (which I believe is the right version for 1.21? this is itself still screwy, see #4054). I had to switch from kubernetes.io/arch to beta.kubernetes.io/arch (and likewise for /os) for it to scale up from 0 nodes.

I'm not sure if that's a separate issue given I am not applying any taints or labels. If it isn't a separate problem, it suggests this still is broken.

@thpang
Copy link

thpang commented Nov 15, 2021

Has anyone been able to determine the root cause or a fix for this issue? We are currently having an issue where a customer using EKS does not see their nodes register correctly once they are scaled up from 0 (zero). Again, tanints and labels are used.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2022
@draeath
Copy link

draeath commented Feb 13, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2022
@smrutiranjantripathy
Copy link

Has anyone been able to determine the root cause or a fix for this issue? We are currently having an issue where a customer using EKS does not see their nodes register correctly once they are scaled up from 0 (zero). Again, tanints and labels are used.

We are able to get around this issue by using tags of labels as described here.

@olahouze
Copy link

Has anyone been able to determine the root cause or a fix for this issue? We are currently having an issue where a customer using EKS does not see their nodes register correctly once they are scaled up from 0 (zero). Again, tanints and labels are used.

We are able to get around this issue by using tags of labels as described here.

I'm not sure I understand all of the workarounds

I have POD with a node selector like this:

nodeSelectorTerms:
          - matchExpressions:
              - key: eks.amazonaws.com/nodegroup
                operator: In
                values:
                  - nodegroup-name

So I need to add the following tag to my AWS group Autoscaler:

"k8s.io/cluster-autoscaler/node-template/label/eks.amazonaws.com/nodegroup" = nodegroupe-name
or
"k8s.io/cluster-autoscaler/node-template/label/nodegroup" = nodegroupe-name

Best regards

@dev-rowbot
Copy link

@olahouze - to get this working I needing to add this tag to my AWS Autoscaling Group

k8s.io/cluster-autoscaler/node-template/label/nodegroup-type: stateless

Make sure that Tag new instances is ticked as well.

I then set the pod affinity to

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
            - key: eks.amazonaws.com/capacityType
              operator: In
              values:
                - SPOT
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: nodegroup-type
              operator: In
              values:
                - stateless

The autoscaler picked up the change on the next cycle and scaled up the ASG from 0.

Hope this helps

@olahouze
Copy link

Hello

Thank you for the answer

With this information it forces me

  • To modify all my helms / pod definition to use a nodeaffinity on the label nodegroup=nodegroupe-name (and not eks.amazonaws.com/nodegroup = nodegroupe-name)
  • To add on my autoscaller-groups and my instances of manual labels : nodegroup=nodegroupe-name

The advantage of using eks.amazonaws.com/nodegroup in nodeaffinity is that AWS realizes all alone the addition of this label...

Other people have already tested successfully the use of "k8s.io/cluster-autoscaler/node-template/label/eks.amazonaws.com/nodegroup" = nodegroupe-name on group autoscaler?

Sincerely

@dev-rowbot
Copy link

@olahouze - I agree with your thinking, I was also going to update all my helm charts. One point that I missed is that I also have a label in m eksctl nodegroup that matches the tag I just added. I suspect that cluster autoscaler will need the tag and the Scheduler will need the label:

  - name: ng-2-stateless-spot-1a
    spot: true
    tags:
      k8s.io/cluster-autoscaler/node-template/label/nodegroup-type: stateless
    labels:
      nodegroup-type: stateless
      instance-type: spot

There is also an advanced eksctl cluster example here which uses the cluster-autoscaler tags on nodegroups

@pkit
Copy link

pkit commented Jul 14, 2022

I'm not sure why it's never mentioned here but the whole thing seems to be fixed by #5002
And yes, it was not fixed before that.

@scravy
Copy link

scravy commented Aug 21, 2022

@pkit How does #5002 fix anything? I am running a cluster autoscaler with auto config enabled and I am experiencing exactly the issues described here in this ticket. #5002 does not give a workaround which fixes this, does not patch this, does not reference a pull request... ?

@dschunack
Copy link
Contributor Author

Hi,

we fixed our issues as follows and the cluster autoscaler is now able to start new Instances based on node selectors.
In our use case, we used self managed ASG instead of Node Groups. That give us more flexibility to manage our Nodes.

We set the following tags on the ASGs. In this case with a Taint.

Tags Value Tag new instances
k8s.io/cluster-autoscaler/enabled true Yes
k8s.io/cluster-autoscaler/node-template/label/kubernetes.io/arch amd64 Yes
k8s.io/cluster-autoscaler/node-template/label/kubernetes.io/os linux Yes
k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/instance-type g4dn.2xlarge Yes
k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/lifecycle on-demand Yes
k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone eu-central-1b Yes
k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone eu-central-1b Yes
k8s.io/cluster-autoscaler/node-template/taint/gpu true:NoSchedule Yes
kubernetes.io/cluster/eks-XXXXXXXXXXXXXX owned Yes

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 24, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 19, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 18, 2023
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet