Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA fails to schedule nodes onto spot ASG's with zero instances producing "node(s) didn't match node selector" on EKS #4010

Closed
alexmnyc opened this issue Apr 12, 2021 · 14 comments
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@alexmnyc
Copy link

alexmnyc commented Apr 12, 2021

My nodes:

--kubelet-extra-args "--node-labels=node.kubernetes.io/lifecycle=spot,group-name=spot,instance=m5.xlarge" 'pie-gp-8pN1ZRZI'

My ASG of instance=m5.xlarge

desired=0 min=0 max=20

Scheduling this deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-to-scaleout
spec:
  replicas: 20
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        service: nginx
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx-to-scaleout
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 500m
            memory: 512Mi
      nodeSelector:
        instance: m5a.large

CA logs:

node(s) didn't match node selector

Related to #4002 ?

@alexmnyc alexmnyc added the kind/bug Categorizes issue or PR as related to a bug. label Apr 12, 2021
@alexmnyc alexmnyc changed the title CA fails to schedules nodes onto spot ASG's with zero instances producing "node(s) didn't match node selector" on EKS CA fails to schedule nodes onto spot ASG's with zero instances producing "node(s) didn't match node selector" on EKS Apr 12, 2021
@bpineau
Copy link
Contributor

bpineau commented Apr 19, 2021

Hi,
Is cluster-autoscaler launched with --scale-up-from-zero?
Does the ASG have a matching label (ie. k8s.io/cluster-autoscaler/node-template/label/instance: m5a.large (per pod's nodeSelector, or m5.xlarge per kubelet args above))?

@alexmnyc
Copy link
Author

alexmnyc commented Apr 20, 2021

cluster autoscaler deployed with helm through with the following values.yaml. I don't see that option in https://github.com/kubernetes/autoscaler/blob/master/charts/cluster-autoscaler/values.yaml

cloudProvider: aws

awsRegion: ${region}

autoDiscovery:
  clusterName: ${cluster_name}

image:
  tag: v1.19.1

nodeSelector:
  group-name: autoscaler

extraArgs:
  balance-similar-node-groups: true
  skip-nodes-with-local-storage: false
  expander: least-waste
  stderrthreshold: info

extraVolumes:
  - name: ssl-certs
    hostPath:
      path: /etc/ssl/certs/ca-bundle.crt

extraVolumeMounts:
  - name: ssl-certs
    mountPath: /etc/ssl/certs/ca-certificates.crt #/etc/ssl/certs/ca-bundle.crt for Amazon Linux Worker Nodes
    readOnly: true

resources:
  limits:
   cpu: 100m
   memory: 300Mi
  requests:
    cpu: 100m
    memory: 300Mi

rbac:
  create: true
  serviceAccount:
    name: "${service_account}"
    annotations:
      eks.amazonaws.com/role-arn: ${iam_role_arn}

Node labels where provides as "--node-labels=node.kubernetes.io/lifecycle=spot,group-name=spot,instance=m5.xlarge" in the spot template. I didn't know that the node labels had to follow any specific nomenclature to be target.
Are you saying that this

"--node-labels=node.kubernetes.io/lifecycle=spot,group-name=spot,instance=m5.xlarge"

must be replace with

"--node-labels=node.kubernetes.io/lifecycle=spot,k8s.io/cluster-autoscaler/node-template/label/group-name=spot,k8s.io/cluster-autoscaler/node-template/label/instance=m5.xlarge"

and the deployment.yaml should target nodeSelector using a fully qualified label name with k8s.... i.e. ?

kind: Deployment
metadata:
  name: nginx-to-scaleout
spec:
  replicas: 20
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        service: nginx
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx-to-scaleout
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 500m
            memory: 512Mi
      nodeSelector:
        k8s.io/cluster-autoscaler/node-template/label/instance: m5a.large

I see that when the nodes do come up labels are present

ip-10-247-21-5.ec2.internal     Ready      <none>   91s   v1.19.6-eks-49a6c0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1b,group-name=spot,instance=m5.xlarge,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-247-21-5.my.net,kubernetes.io/os=linux,node.kubernetes.io/instance-type=m5.xlarge,node.kubernetes.io/lifecycle=spot,topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1b

@alexmnyc
Copy link
Author

@bpineau I tried adding these labels to the ASG yesterday as you mentioned, and to the kubelet extra args --node-labels i.e. "--node-labels=node.kubernetes.io/lifecycle=spot,group-name=spot,instance=m5a.large"

image

I did a deployment with both selectors and still the same results

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-to-scaleout
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        service: nginx
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx-to-scaleout
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 500m
            memory: 512Mi
      nodeSelector:
        instance: m5a.large
        k8s.io/cluster-autoscaler/node-template/label/instance: m5a.large

@alexmnyc
Copy link
Author

--scale-up-from-zero

I0421 15:32:06.329470 1 flags.go:52] FLAG: --add-dir-header="false"
I0421 15:32:06.329537 1 flags.go:52] FLAG: --address=":8085"
I0421 15:32:06.329545 1 flags.go:52] FLAG: --alsologtostderr="false"
I0421 15:32:06.329550 1 flags.go:52] FLAG: --aws-use-static-instance-list="false"
I0421 15:32:06.329561 1 flags.go:52] FLAG: --balance-similar-node-groups="true"
I0421 15:32:06.329568 1 flags.go:52] FLAG: --balancing-ignore-label="[]"
I0421 15:32:06.329574 1 flags.go:52] FLAG: --cloud-config=""
I0421 15:32:06.329579 1 flags.go:52] FLAG: --cloud-provider="aws"
I0421 15:32:06.329585 1 flags.go:52] FLAG: --cloud-provider-gce-l7lb-src-cidrs="130.211.0.0/22,35.191.0.0/16"
I0421 15:32:06.329593 1 flags.go:52] FLAG: --cloud-provider-gce-lb-src-cidrs="130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16"
I0421 15:32:06.329601 1 flags.go:52] FLAG: --cluster-name=""
I0421 15:32:06.329606 1 flags.go:52] FLAG: --clusterapi-cloud-config-authoritative="false"
I0421 15:32:06.329612 1 flags.go:52] FLAG: --cores-total="0:320000"
I0421 15:32:06.329617 1 flags.go:52] FLAG: --estimator="binpacking"
I0421 15:32:06.329623 1 flags.go:52] FLAG: --expander="least-waste"
I0421 15:32:06.329628 1 flags.go:52] FLAG: --expendable-pods-priority-cutoff="-10"
I0421 15:32:06.329634 1 flags.go:52] FLAG: --gpu-total="[]"
I0421 15:32:06.329640 1 flags.go:52] FLAG: --ignore-daemonsets-utilization="false"
I0421 15:32:06.329645 1 flags.go:52] FLAG: --ignore-mirror-pods-utilization="false"
I0421 15:32:06.329650 1 flags.go:52] FLAG: --ignore-taint="[]"
I0421 15:32:06.329657 1 flags.go:52] FLAG: --kubeconfig=""
I0421 15:32:06.329662 1 flags.go:52] FLAG: --kubernetes=""
I0421 15:32:06.329667 1 flags.go:52] FLAG: --leader-elect="true"
I0421 15:32:06.329674 1 flags.go:52] FLAG: --leader-elect-lease-duration="15s"
I0421 15:32:06.329686 1 flags.go:52] FLAG: --leader-elect-renew-deadline="10s"
I0421 15:32:06.329692 1 flags.go:52] FLAG: --leader-elect-resource-lock="leases"
I0421 15:32:06.329699 1 flags.go:52] FLAG: --leader-elect-resource-name=""
I0421 15:32:06.329705 1 flags.go:52] FLAG: --leader-elect-resource-namespace=""
I0421 15:32:06.329743 1 flags.go:52] FLAG: --leader-elect-retry-period="2s"
I0421 15:32:06.329753 1 flags.go:52] FLAG: --log-backtrace-at=":0"
I0421 15:32:06.329762 1 flags.go:52] FLAG: --log-dir=""
I0421 15:32:06.329768 1 flags.go:52] FLAG: --log-file=""
I0421 15:32:06.329773 1 flags.go:52] FLAG: --log-file-max-size="1800"
I0421 15:32:06.329779 1 flags.go:52] FLAG: --logtostderr="true"
I0421 15:32:06.329784 1 flags.go:52] FLAG: --max-autoprovisioned-node-group-count="15"
I0421 15:32:06.329794 1 flags.go:52] FLAG: --max-bulk-soft-taint-count="10"
I0421 15:32:06.329799 1 flags.go:52] FLAG: --max-bulk-soft-taint-time="3s"
I0421 15:32:06.329805 1 flags.go:52] FLAG: --max-empty-bulk-delete="10"
I0421 15:32:06.329810 1 flags.go:52] FLAG: --max-failing-time="15m0s"
I0421 15:32:06.329816 1 flags.go:52] FLAG: --max-graceful-termination-sec="600"
I0421 15:32:06.329821 1 flags.go:52] FLAG: --max-inactivity="10m0s"
I0421 15:32:06.329827 1 flags.go:52] FLAG: --max-node-provision-time="15m0s"
I0421 15:32:06.329844 1 flags.go:52] FLAG: --max-nodes-total="0"
I0421 15:32:06.329850 1 flags.go:52] FLAG: --max-total-unready-percentage="45"
I0421 15:32:06.329857 1 flags.go:52] FLAG: --memory-total="0:6400000"
I0421 15:32:06.329863 1 flags.go:52] FLAG: --min-replica-count="0"
I0421 15:32:06.329869 1 flags.go:52] FLAG: --namespace="kube-system"
I0421 15:32:06.329874 1 flags.go:52] FLAG: --new-pod-scale-up-delay="0s"
I0421 15:32:06.329879 1 flags.go:52] FLAG: --node-autoprovisioning-enabled="false"
I0421 15:32:06.329885 1 flags.go:52] FLAG: --node-deletion-delay-timeout="2m0s"
I0421 15:32:06.329890 1 flags.go:52] FLAG: --node-group-auto-discovery="[asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/XXXXI]"
I0421 15:32:06.329903 1 flags.go:52] FLAG: --nodes="[]"
I0421 15:32:06.329909 1 flags.go:52] FLAG: --ok-total-unready-count="3"
I0421 15:32:06.329915 1 flags.go:52] FLAG: --profiling="false"
I0421 15:32:06.329920 1 flags.go:52] FLAG: --regional="false"
I0421 15:32:06.329925 1 flags.go:52] FLAG: --scale-down-candidates-pool-min-count="50"
I0421 15:32:06.329931 1 flags.go:52] FLAG: --scale-down-candidates-pool-ratio="0.1"
I0421 15:32:06.329937 1 flags.go:52] FLAG: --scale-down-delay-after-add="10m0s"
I0421 15:32:06.329943 1 flags.go:52] FLAG: --scale-down-delay-after-delete="0s"
I0421 15:32:06.329948 1 flags.go:52] FLAG: --scale-down-delay-after-failure="3m0s"
I0421 15:32:06.329953 1 flags.go:52] FLAG: --scale-down-enabled="true"
I0421 15:32:06.329959 1 flags.go:52] FLAG: --scale-down-gpu-utilization-threshold="0.5"
I0421 15:32:06.329967 1 flags.go:52] FLAG: --scale-down-non-empty-candidates-count="30"
I0421 15:32:06.329973 1 flags.go:52] FLAG: --scale-down-unneeded-time="10m0s"
I0421 15:32:06.329979 1 flags.go:52] FLAG: --scale-down-unready-time="20m0s"
I0421 15:32:06.329985 1 flags.go:52] FLAG: --scale-down-utilization-threshold="0.5"
I0421 15:32:06.329991 1 flags.go:52] FLAG: --scale-up-from-zero="true"
I0421 15:32:06.329996 1 flags.go:52] FLAG: --scan-interval="10s"
I0421 15:32:06.330002 1 flags.go:52] FLAG: --skip-headers="false"
I0421 15:32:06.330007 1 flags.go:52] FLAG: --skip-log-headers="false"
I0421 15:32:06.330013 1 flags.go:52] FLAG: --skip-nodes-with-local-storage="false"
I0421 15:32:06.330018 1 flags.go:52] FLAG: --skip-nodes-with-system-pods="true"
I0421 15:32:06.330023 1 flags.go:52] FLAG: --stderrthreshold="0"
I0421 15:32:06.330029 1 flags.go:52] FLAG: --unremovable-node-recheck-timeout="5m0s"
I0421 15:32:06.330034 1 flags.go:52] FLAG: --v="4"
I0421 15:32:06.330040 1 flags.go:52] FLAG: --vmodule=""
I0421 15:32:06.330045 1 flags.go:52] FLAG: --write-status-configmap="true"

@bpineau
Copy link
Contributor

bpineau commented Apr 21, 2021

The k8s.io/cluster-autoscaler/node-template/label/instance: m5a.large is an ASG label telling the autoscaler this ASG will provide nodes labeled instance: m5a.large and therefore can provide nodes able to satisfy a nodeSelector: instance:... constraint asked by your pods (but do not place the k8s.io/... in nodeSelector, that's meant for the ASG).
More details in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#auto-discovery-setup
The autoscaler logs should give hints about which ASG it discovered (using the --node-group-auto-discovery argument), and why it considered them unsuitable for a given pending pod.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 20, 2021
@tulanian
Copy link

Havng the same problem. ASGs labeled with k8s.io/cluster-autoscaler/node-template/label/size: large and pod nodeselector of size: large but the autoscaler doesn't spin up a node.

@gitrojones
Copy link

gitrojones commented Sep 3, 2021

^ Bump

Having the same issue as well. ASGs labeled with k8s.io/cluster-autoscaler/node-template/label/group: builder and pod nodeselector of group: builder but seeing predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector;

Scale up from zero is enabled on the autoscaler deployment.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 15, 2021
@dschunack
Copy link
Contributor

/remove-lifecycle rotten

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 23, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 22, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

9 participants