Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Autoscaler does not interpret labels specified with k8s.io/cluster-autoscaler/node-template/label/* tags on an AWS ASG unless those tags are set to propagate to the instances #4490

Closed
adamnovak opened this issue Nov 30, 2021 · 9 comments
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@adamnovak
Copy link

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: v1.17.3

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
...
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T14:23:04Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

We're deploying nodes on Amazon AWS with Autoscaling Groups, using the cluster autoscaler's ability to automatically pick up ASGs tagged weith certain tags

What did you expect to happen?:

I expected the cluster autoscaler to read the tags of the ASG to determine what the tags on nodes that the ASG produces will be, when scaling from 0. I don't expect the value of the "Tag new instances" toggle on the tag to matter here.

In particular, I expect that if I tag an ASG with k8s.io/cluster-autoscaler/node-template/label/eks.amazonaws.com/capacityType with value SPOT, and don't set the tag to propagate to instances, then the cluster autoscaler will make a hypothetical node that will match a nodeSelector of eks.amazonaws.com/capacityType: SPOT.

(Note that I'm not using EKS here, just the label values they define, since Kubernetes itself has no standard for labeling or tainting preemptible nodes.)

What happened instead?:

When the labeling tag was set to not propagate to nodes, I got log messages like:

I1130 18:29:45.793842       1 pod_schedulable.go:165] Pod adamnovak-spot-pi-qsd9c can't be scheduled on cg-kubernetes-r5ad.8xlarge-spot, predicate failed: GeneralPredicates predicate mismatch, reason: node(s) didn't match node selector,

When I changed the tag to propagate to new instances, then I got a different error (because I'd misspelled my ephemeral storage limit tag):

When I fixed that tag, then the autoscaler started provisioning my node.

How to reproduce it (as minimally and precisely as possible):

  1. Set up a working autoscaling group with the cluster autoscaler.
  2. Scale it down to 0.
  3. Give it a tag that starts with k8s.io/cluster-autoscaler/node-template/label/, and specifies a unique label, but set it not to propagate to nodes. (Also, optionally configure the node to really have that label when it comes up.)
  4. Launch a pod that has a nodeSelector to match that label.
  5. Note that the autoscaler won't try to scale up the ASG, because it thinks the node won't have the label that the tag says it will have.
  6. Check the tag new instances checkbox on the tag on the ASG.
  7. Wait for the autoscaler to reread the ASGs
  8. Observe the autoscaler trying to scale up the ASG to run the pod.

Anything else we need to know?:

I suspect taints and other stuff inferred from tags also work this way.

This might be a possible cause of people reporting they are affected by #4010 and #3802, even though the screenshots I've seen there indicate that the tag new instances flags are set by the main reporters.

@adamnovak adamnovak added the kind/bug Categorizes issue or PR as related to a bug. label Nov 30, 2021
@acalm
Copy link

acalm commented Feb 18, 2022

I think this was "kind of" documented in an example before, it was later replaced with some recommendations and moved to the FAQ. Not sure if that means that the behavior was fixed between those commits or if it's just unfortunate that the example got replaced/moved.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 22, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 21, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@EHJ-52n
Copy link

EHJ-52n commented Aug 23, 2022

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 23, 2022
@k8s-ci-robot
Copy link
Contributor

@EHJ-52n: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@EHJ-52n
Copy link

EHJ-52n commented Aug 23, 2022

@adamnovak Is this issue solved for you?

@adamnovak
Copy link
Author

I've been employing the workaround of always setting the tags to propagate, and I'm not likely to find time to try and reproduce this again on our live system any time soon.

As for documenting that setting the tags to propagate is necessary, it looks like @acalm found where in the docs that would belong, so I think you could look there in the current mainline to see if it has been documented yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

6 participants