CA fails to schedule nodes onto spot ASG's with zero instances producing "node(s) didn't match node selector" on EKS #4010

alexmnyc · 2021-04-12T13:19:09Z

My nodes:

--kubelet-extra-args "--node-labels=node.kubernetes.io/lifecycle=spot,group-name=spot,instance=m5.xlarge" 'pie-gp-8pN1ZRZI'

My ASG of instance=m5.xlarge

desired=0 min=0 max=20

Scheduling this deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-to-scaleout
spec:
  replicas: 20
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        service: nginx
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx-to-scaleout
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 500m
            memory: 512Mi
      nodeSelector:
        instance: m5a.large

CA logs:

node(s) didn't match node selector

Related to #4002 ?

The text was updated successfully, but these errors were encountered:

bpineau · 2021-04-19T08:10:46Z

Hi,
Is cluster-autoscaler launched with --scale-up-from-zero?
Does the ASG have a matching label (ie. k8s.io/cluster-autoscaler/node-template/label/instance: m5a.large (per pod's nodeSelector, or m5.xlarge per kubelet args above))?

alexmnyc · 2021-04-20T19:30:01Z

cluster autoscaler deployed with helm through with the following values.yaml. I don't see that option in https://github.com/kubernetes/autoscaler/blob/master/charts/cluster-autoscaler/values.yaml

cloudProvider: aws

awsRegion: ${region}

autoDiscovery:
  clusterName: ${cluster_name}

image:
  tag: v1.19.1

nodeSelector:
  group-name: autoscaler

extraArgs:
  balance-similar-node-groups: true
  skip-nodes-with-local-storage: false
  expander: least-waste
  stderrthreshold: info

extraVolumes:
  - name: ssl-certs
    hostPath:
      path: /etc/ssl/certs/ca-bundle.crt

extraVolumeMounts:
  - name: ssl-certs
    mountPath: /etc/ssl/certs/ca-certificates.crt #/etc/ssl/certs/ca-bundle.crt for Amazon Linux Worker Nodes
    readOnly: true

resources:
  limits:
   cpu: 100m
   memory: 300Mi
  requests:
    cpu: 100m
    memory: 300Mi

rbac:
  create: true
  serviceAccount:
    name: "${service_account}"
    annotations:
      eks.amazonaws.com/role-arn: ${iam_role_arn}

Node labels where provides as "--node-labels=node.kubernetes.io/lifecycle=spot,group-name=spot,instance=m5.xlarge" in the spot template. I didn't know that the node labels had to follow any specific nomenclature to be target.
Are you saying that this

"--node-labels=node.kubernetes.io/lifecycle=spot,group-name=spot,instance=m5.xlarge"

must be replace with

"--node-labels=node.kubernetes.io/lifecycle=spot,k8s.io/cluster-autoscaler/node-template/label/group-name=spot,k8s.io/cluster-autoscaler/node-template/label/instance=m5.xlarge"

and the deployment.yaml should target nodeSelector using a fully qualified label name with k8s.... i.e. ?

kind: Deployment
metadata:
  name: nginx-to-scaleout
spec:
  replicas: 20
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        service: nginx
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx-to-scaleout
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 500m
            memory: 512Mi
      nodeSelector:
        k8s.io/cluster-autoscaler/node-template/label/instance: m5a.large

I see that when the nodes do come up labels are present

ip-10-247-21-5.ec2.internal     Ready      <none>   91s   v1.19.6-eks-49a6c0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1b,group-name=spot,instance=m5.xlarge,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-247-21-5.my.net,kubernetes.io/os=linux,node.kubernetes.io/instance-type=m5.xlarge,node.kubernetes.io/lifecycle=spot,topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1b

alexmnyc · 2021-04-21T14:05:54Z

@bpineau I tried adding these labels to the ASG yesterday as you mentioned, and to the kubelet extra args --node-labels i.e. "--node-labels=node.kubernetes.io/lifecycle=spot,group-name=spot,instance=m5a.large"

I did a deployment with both selectors and still the same results

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-to-scaleout
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        service: nginx
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx-to-scaleout
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 500m
            memory: 512Mi
      nodeSelector:
        instance: m5a.large
        k8s.io/cluster-autoscaler/node-template/label/instance: m5a.large

alexmnyc · 2021-04-21T15:34:13Z

--scale-up-from-zero

I0421 15:32:06.329470 I0421 15:32:06.329537 I0421 15:32:06.329545 I0421 15:32:06.329550 I0421 15:32:06.329561 I0421 15:32:06.329568 I0421 15:32:06.329574 I0421 15:32:06.329579 I0421 15:32:06.329585 I0421 15:32:06.329593 I0421 15:32:06.329601 I0421 15:32:06.329606 I0421 15:32:06.329612 I0421 15:32:06.329617 I0421 15:32:06.329623 I0421 15:32:06.329628 I0421 15:32:06.329634 I0421 15:32:06.329640 I0421 15:32:06.329645 I0421 15:32:06.329650 I0421 15:32:06.329657 I0421 15:32:06.329662 I0421 15:32:06.329667 I0421 15:32:06.329674 I0421 15:32:06.329686 I0421 15:32:06.329692 I0421 15:32:06.329699 I0421 15:32:06.329705 I0421 15:32:06.329743 I0421 15:32:06.329753 I0421 15:32:06.329762 I0421 15:32:06.329768 I0421 15:32:06.329773 I0421 15:32:06.329779 I0421 15:32:06.329784 I0421 15:32:06.329794 I0421 15:32:06.329799 I0421 15:32:06.329805 I0421 15:32:06.329810 I0421 15:32:06.329816 I0421 15:32:06.329821 I0421 15:32:06.329827 I0421 15:32:06.329844 I0421 15:32:06.329850 I0421 15:32:06.329857 I0421 15:32:06.329863 I0421 15:32:06.329869 I0421 15:32:06.329874 I0421 15:32:06.329879 I0421 15:32:06.329885 I0421 15:32:06.329890 I0421 15:32:06.329903 I0421 15:32:06.329909 I0421 15:32:06.329915 I0421 15:32:06.329920 I0421 15:32:06.329925 I0421 15:32:06.329931 I0421 15:32:06.329937 I0421 15:32:06.329943 I0421 15:32:06.329948 I0421 15:32:06.329953 I0421 15:32:06.329959 I0421 15:32:06.329967 I0421 15:32:06.329973 I0421 15:32:06.329979 I0421 15:32:06.329985 I0421 15:32:06.329991 I0421 15:32:06.329996 I0421 15:32:06.330002 I0421 15:32:06.330007 I0421 15:32:06.330013 I0421 15:32:06.330018 I0421 15:32:06.330023 I0421 15:32:06.330029 I0421 15:32:06.330034 I0421 15:32:06.330040 I0421 15:32:06.330045 1 flags.go:52] FLAG: --add-dir-header="false"
1 flags.go:52] FLAG: --address=":8085"
1 flags.go:52] FLAG: --alsologtostderr="false"
1 flags.go:52] FLAG: --aws-use-static-instance-list="false"
1 flags.go:52] FLAG: --balance-similar-node-groups="true"
1 flags.go:52] FLAG: --balancing-ignore-label="[]"
1 flags.go:52] FLAG: --cloud-config=""
1 flags.go:52] FLAG: --cloud-provider="aws"
1 flags.go:52] FLAG: --cloud-provider-gce-l7lb-src-cidrs="130.211.0.0/22,35.191.0.0/16"
1 flags.go:52] FLAG: --cloud-provider-gce-lb-src-cidrs="130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16"
1 flags.go:52] FLAG: --cluster-name=""
1 flags.go:52] FLAG: --clusterapi-cloud-config-authoritative="false"
1 flags.go:52] FLAG: --cores-total="0:320000"
1 flags.go:52] FLAG: --estimator="binpacking"
1 flags.go:52] FLAG: --expander="least-waste"
1 flags.go:52] FLAG: --expendable-pods-priority-cutoff="-10"
1 flags.go:52] FLAG: --gpu-total="[]"
1 flags.go:52] FLAG: --ignore-daemonsets-utilization="false"
1 flags.go:52] FLAG: --ignore-mirror-pods-utilization="false"
1 flags.go:52] FLAG: --ignore-taint="[]"
1 flags.go:52] FLAG: --kubeconfig=""
1 flags.go:52] FLAG: --kubernetes=""
1 flags.go:52] FLAG: --leader-elect="true"
1 flags.go:52] FLAG: --leader-elect-lease-duration="15s"
1 flags.go:52] FLAG: --leader-elect-renew-deadline="10s"
1 flags.go:52] FLAG: --leader-elect-resource-lock="leases"
1 flags.go:52] FLAG: --leader-elect-resource-name=""
1 flags.go:52] FLAG: --leader-elect-resource-namespace=""
1 flags.go:52] FLAG: --leader-elect-retry-period="2s"
1 flags.go:52] FLAG: --log-backtrace-at=":0"
1 flags.go:52] FLAG: --log-dir=""
1 flags.go:52] FLAG: --log-file=""
1 flags.go:52] FLAG: --log-file-max-size="1800"
1 flags.go:52] FLAG: --logtostderr="true"
1 flags.go:52] FLAG: --max-autoprovisioned-node-group-count="15"
1 flags.go:52] FLAG: --max-bulk-soft-taint-count="10"
1 flags.go:52] FLAG: --max-bulk-soft-taint-time="3s"
1 flags.go:52] FLAG: --max-empty-bulk-delete="10"
1 flags.go:52] FLAG: --max-failing-time="15m0s"
1 flags.go:52] FLAG: --max-graceful-termination-sec="600"
1 flags.go:52] FLAG: --max-inactivity="10m0s"
1 flags.go:52] FLAG: --max-node-provision-time="15m0s"
1 flags.go:52] FLAG: --max-nodes-total="0"
1 flags.go:52] FLAG: --max-total-unready-percentage="45"
1 flags.go:52] FLAG: --memory-total="0:6400000"
1 flags.go:52] FLAG: --min-replica-count="0"
1 flags.go:52] FLAG: --namespace="kube-system"
1 flags.go:52] FLAG: --new-pod-scale-up-delay="0s"
1 flags.go:52] FLAG: --node-autoprovisioning-enabled="false"
1 flags.go:52] FLAG: --node-deletion-delay-timeout="2m0s"
1 flags.go:52] FLAG: --node-group-auto-discovery="[asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/XXXXI]"
1 flags.go:52] FLAG: --nodes="[]"
1 flags.go:52] FLAG: --ok-total-unready-count="3"
1 flags.go:52] FLAG: --profiling="false"
1 flags.go:52] FLAG: --regional="false"
1 flags.go:52] FLAG: --scale-down-candidates-pool-min-count="50"
1 flags.go:52] FLAG: --scale-down-candidates-pool-ratio="0.1"
1 flags.go:52] FLAG: --scale-down-delay-after-add="10m0s"
1 flags.go:52] FLAG: --scale-down-delay-after-delete="0s"
1 flags.go:52] FLAG: --scale-down-delay-after-failure="3m0s"
1 flags.go:52] FLAG: --scale-down-enabled="true"
1 flags.go:52] FLAG: --scale-down-gpu-utilization-threshold="0.5"
1 flags.go:52] FLAG: --scale-down-non-empty-candidates-count="30"
1 flags.go:52] FLAG: --scale-down-unneeded-time="10m0s"
1 flags.go:52] FLAG: --scale-down-unready-time="20m0s"
1 flags.go:52] FLAG: --scale-down-utilization-threshold="0.5"
1 flags.go:52] FLAG: --scale-up-from-zero="true"
1 flags.go:52] FLAG: --scan-interval="10s"
1 flags.go:52] FLAG: --skip-headers="false"
1 flags.go:52] FLAG: --skip-log-headers="false"
1 flags.go:52] FLAG: --skip-nodes-with-local-storage="false"
1 flags.go:52] FLAG: --skip-nodes-with-system-pods="true"
1 flags.go:52] FLAG: --stderrthreshold="0"
1 flags.go:52] FLAG: --unremovable-node-recheck-timeout="5m0s"
1 flags.go:52] FLAG: --v="4"
1 flags.go:52] FLAG: --vmodule=""
1 flags.go:52] FLAG: --write-status-configmap="true"

bpineau · 2021-04-21T21:36:59Z

The k8s.io/cluster-autoscaler/node-template/label/instance: m5a.large is an ASG label telling the autoscaler this ASG will provide nodes labeled instance: m5a.large and therefore can provide nodes able to satisfy a nodeSelector: instance:... constraint asked by your pods (but do not place the k8s.io/... in nodeSelector, that's meant for the ASG).
More details in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#auto-discovery-setup
The autoscaler logs should give hints about which ASG it discovered (using the --node-group-auto-discovery argument), and why it considered them unsuitable for a given pending pod.

fejta-bot · 2021-07-20T22:36:01Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

tulanian · 2021-08-12T13:38:05Z

Havng the same problem. ASGs labeled with k8s.io/cluster-autoscaler/node-template/label/size: large and pod nodeselector of size: large but the autoscaler doesn't spin up a node.

gitrojones · 2021-09-03T17:13:57Z

^ Bump

Having the same issue as well. ASGs labeled with k8s.io/cluster-autoscaler/node-template/label/group: builder and pod nodeselector of group: builder but seeing predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector;

Scale up from zero is enabled on the autoscaler deployment.

k8s-triage-robot · 2021-10-15T11:24:54Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

dschunack · 2021-10-25T06:39:18Z

/remove-lifecycle rotten

k8s-triage-robot · 2022-01-23T06:56:44Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-02-22T07:53:47Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-03-24T08:31:45Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-03-24T08:32:04Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alexmnyc added the kind/bug Categorizes issue or PR as related to a bug. label Apr 12, 2021

alexmnyc changed the title ~~CA fails to schedules nodes onto spot ASG's with zero instances producing "node(s) didn't match node selector" on EKS~~ CA fails to schedule nodes onto spot ASG's with zero instances producing "node(s) didn't match node selector" on EKS Apr 12, 2021

alexmnyc mentioned this issue Apr 21, 2021

Cluster Autoscaler does not start new nodes when Taints and NodeSelector are used in EKS #3802

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 20, 2021

jbartosik added the area/cluster-autoscaler label Sep 15, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 15, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 25, 2021

adamnovak mentioned this issue Nov 30, 2021

Cluster Autoscaler does not interpret labels specified with k8s.io/cluster-autoscaler/node-template/label/* tags on an AWS ASG unless those tags are set to propagate to the instances #4490

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 23, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 22, 2022

k8s-ci-robot closed this as completed Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CA fails to schedule nodes onto spot ASG's with zero instances producing "node(s) didn't match node selector" on EKS #4010

CA fails to schedule nodes onto spot ASG's with zero instances producing "node(s) didn't match node selector" on EKS #4010

alexmnyc commented Apr 12, 2021 •

edited

Loading

bpineau commented Apr 19, 2021

alexmnyc commented Apr 20, 2021 •

edited

Loading

alexmnyc commented Apr 21, 2021

alexmnyc commented Apr 21, 2021

bpineau commented Apr 21, 2021

fejta-bot commented Jul 20, 2021

tulanian commented Aug 12, 2021

gitrojones commented Sep 3, 2021 •

edited

Loading

k8s-triage-robot commented Oct 15, 2021

dschunack commented Oct 25, 2021

k8s-triage-robot commented Jan 23, 2022

k8s-triage-robot commented Feb 22, 2022

k8s-triage-robot commented Mar 24, 2022

k8s-ci-robot commented Mar 24, 2022

CA fails to schedule nodes onto spot ASG's with zero instances producing "node(s) didn't match node selector" on EKS #4010

CA fails to schedule nodes onto spot ASG's with zero instances producing "node(s) didn't match node selector" on EKS #4010

Comments

alexmnyc commented Apr 12, 2021 • edited Loading

bpineau commented Apr 19, 2021

alexmnyc commented Apr 20, 2021 • edited Loading

alexmnyc commented Apr 21, 2021

alexmnyc commented Apr 21, 2021

bpineau commented Apr 21, 2021

fejta-bot commented Jul 20, 2021

tulanian commented Aug 12, 2021

gitrojones commented Sep 3, 2021 • edited Loading

k8s-triage-robot commented Oct 15, 2021

dschunack commented Oct 25, 2021

k8s-triage-robot commented Jan 23, 2022

k8s-triage-robot commented Feb 22, 2022

k8s-triage-robot commented Mar 24, 2022

k8s-ci-robot commented Mar 24, 2022

alexmnyc commented Apr 12, 2021 •

edited

Loading

alexmnyc commented Apr 20, 2021 •

edited

Loading

gitrojones commented Sep 3, 2021 •

edited

Loading