Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-autoscaler pvc bound to old node, not scaling up. #4923

Closed
NissesSenap opened this issue May 30, 2022 · 11 comments
Closed

cluster-autoscaler pvc bound to old node, not scaling up. #4923

NissesSenap opened this issue May 30, 2022 · 11 comments
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@NissesSenap
Copy link

NissesSenap commented May 30, 2022

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: I don't know, the AKS one... I will find out

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"04ad1b56880418de7bd6feb9ff37a8518fbc1a0e", GitTreeState:"clean", BuildDate:"2022-05-05T22:03:08Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.9", GitCommit:"79b7a589d688b7dc8a55306c9c225ed7712df10d", GitTreeState:"clean", BuildDate:"2022-04-21T07:41:38Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

Azure AKS using spot instances with PVC

What did you expect to happen?:

A new node scaling up

What happened instead?:

Volume node affinity conflict:

pod didn't trigger scale-up: 2 node(s) didn't match Pod's node affinity/selector, 1 node(s) had volume node affinity conflict

How to reproduce it (as minimally and precisely as possible):

  1. setup AKS cluster
  2. setup a node pool with spot instances (don't think it needs to be spot).
  3. Have a job running that triggers the spot instance creation and this job needs to be using a PVC. With affinity node selection that only allows usage of the spot instance.
  4. When the job is done the autoscaler should remove the spot node since there is no more workload that it's needed to run.
  5. The PVC is still bound to the spot instance
  6. The job runs again and tries to spin up but it complains about the PVC still being bound to the old none existing node.

Anything else we need to know?:

Hi

I have what I think is a cluster-autoscaler issue.
We currently use thanos and together with it we use something called a compactor. The compactor runs using a cronjob and it got a cache disk which is a PVC.
To save money we use spot instances on AKS. If the job is not running the spot instances is shutdown.

The disk type that we are using is zrs which is multiple zone so the pvc shouldn't have any issues jumping around between the zones.

I'm getting an issue where my jobs can't start due to that the PVC is bound to a node that doesn't exist any more (this might be an AKS issue).
So when the cluster-autoscaler comes in and checks it tells me that a volume is in conflict.

Normal NotTriggerScaleUp 2m cluster-autoscaler pod didn't trigger scale-up: 2 node(s) didn't match Pod's node affinity/selector, 1 node(s) had volume node affinity conflict

I think this is thanks to the PVC shows as bound to a node that doesn't exist any more.
I think the cluster-autoscaler needs to add some logic to ignore if a specific node don't exist then ignore that volume, create a new server and the volume should become bound to the new server.

I have added all the output that you hopefully need below.

➜ k get job unbox-compactor-manuall -o yaml

apiVersion: batch/v1
kind: Job
metadata:
  labels:
    aadpodidbinding: monitor
    app.kubernetes.io/component: compactor
    app.kubernetes.io/name: unbox
    app.kubernetes.io/part-of: unbox
    controller-uid: cc663554-1d32-430b-8975-d2cbb7395338
    job-name: unbox-compactor-manuall
  name: unbox-compactor-manuall
  namespace: monitor
spec:
  backoffLimit: 6
  completions: 1
  parallelism: 1
  selector:
    matchLabels:
      controller-uid: cc663554-1d32-430b-8975-d2cbb7395338
  template:
    metadata:
      creationTimestamp: null
      labels:
        aadpodidbinding: monitor
        app.kubernetes.io/component: compactor
        app.kubernetes.io/name: unbox
        app.kubernetes.io/part-of: unbox
        controller-uid: cc663554-1d32-430b-8975-d2cbb7395338
        job-name: unbox-compactor-manuall
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: xkf.xenit.io/node-class
                operator: In
                values:
                - memory
            - matchExpressions:
              - key: kubernetes.azure.com/scalesetpriority
                operator: In
                values:
                - spot
      containers:
      - args:
        - compact
        - --data-dir=/tmp/data
        - --objstore.config-file=/etc/config/thanos.yaml
        - --retention.resolution-raw=10368000s
        - --retention.resolution-5m=10368000s
        - --retention.resolution-1h=10368000s
        - --compact.concurrency=1
        image: quay.io/thanos/thanos:v0.26.0
        imagePullPolicy: IfNotPresent
        name: compactor
        ports:
        - containerPort: 10902
          name: http
          protocol: TCP
        resources:
          limits:
            memory: 14Gi
          requests:
            cpu: "1"
            memory: 12Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/config/
          name: objectstore-secret
          readOnly: true
        - mountPath: /tmp/data
          name: data-volume
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.azure.com/scalesetpriority: spot
        xkf.xenit.io/node-class: memory
      priorityClassName: tenant-low
      restartPolicy: OnFailure
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: kubernetes.azure.com/scalesetpriority
        operator: Equal
        value: spot
      volumes:
      - name: objectstore-secret
        secret:
          defaultMode: 420
          secretName: unbox-thanos-objstore-config
      - name: data-volume
        persistentVolumeClaim:
          claimName: unbox-compactor-data2

k get pvc unbox-compactor-data2 -o yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    meta.helm.sh/release-name: unbox
    meta.helm.sh/release-namespace: monitor
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: disk.csi.azure.com
    volume.kubernetes.io/selected-node: aks-memoryspot1-11567145-vmss000003
  creationTimestamp: "2022-05-08T13:26:58Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app.kubernetes.io/instance: unbox
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: tenant
    app.kubernetes.io/part-of: unbox
    app.kubernetes.io/version: 1.16.0
    helm.sh/chart: tenant-0.3.1
    helm.toolkit.fluxcd.io/name: unbox
    helm.toolkit.fluxcd.io/namespace: monitor
  name: unbox-compactor-data2
  namespace: monitor
  resourceVersion: "373022994"
  uid: 5fe18e67-223d-462d-a43d-b96b8bdc6ffd
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 150Gi
  storageClassName: managed-csi-zrs
  volumeMode: Filesystem
  volumeName: pvc-5fe18e67-223d-462d-a43d-b96b8bdc6ffd
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 150Gi
  phase: Bound
➜ k get nodes
NAME                                STATUS   ROLES   AGE   VERSION
aks-default-30194926-vmss000000     Ready    agent   24d   v1.21.9
aks-default-30194926-vmss000001     Ready    agent   24d   v1.21.9
aks-memory2-40909807-vmss000000     Ready    agent   22d   v1.21.9
aks-memory2-40909807-vmss000001     Ready    agent   22d   v1.21.9
aks-standard2-24926856-vmss000000   Ready    agent   24d   v1.21.9
aks-standard2-24926856-vmss000001   Ready    agent   24d   v1.21.9
aks-standard2-24926856-vmss000002   Ready    agent   24d   v1.21.9
aks-standard2-24926856-vmss000003   Ready    agent   24d   v1.21.9
aks-standard2-24926856-vmss000004   Ready    agent   24d   v1.21.9
aks-standard2-24926856-vmss000014   Ready    agent   22d   v1.21.9
aks-standard2-24926856-vmss000015   Ready    agent   22d   v1.21.9
aks-standard2-24926856-vmss000016   Ready    agent   22d   v1.21.9
aks-standard2-24926856-vmss000018   Ready    agent   22d   v1.21.9
aks-standard2-24926856-vmss000019   Ready    agent   22d   v1.21.9

describe from the pod

➜ k describe pod  unbox-compactor-manuall-sn4m2
Name:                 unbox-compactor-manuall-sn4m2
Namespace:            monitor
Priority:             800000
Priority Class Name:  tenant-low
Node:                 <none>
Labels:               aadpodidbinding=monitor
                      app.kubernetes.io/component=compactor
                      app.kubernetes.io/name=unbox
                      app.kubernetes.io/part-of=unbox
                      controller-uid=cc663554-1d32-430b-8975-d2cbb7395338
                      job-name=unbox-compactor-manuall
Annotations:          <none>
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        Job/unbox-compactor-manuall
Containers:
  compactor:
    Image:      quay.io/thanos/thanos:v0.26.0
    Port:       10902/TCP
    Host Port:  0/TCP
    Args:
      compact
      --data-dir=/tmp/data
      --objstore.config-file=/etc/config/thanos.yaml
      --retention.resolution-raw=10368000s
      --retention.resolution-5m=10368000s
      --retention.resolution-1h=10368000s
      --compact.concurrency=1
    Limits:
      memory:  14Gi
    Requests:
      cpu:        1
      memory:     12Gi
    Environment:  <none>
    Mounts:
      /etc/config/ from objectstore-secret (ro)
      /tmp/data from data-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kmxdk (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  objectstore-secret:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  unbox-thanos-objstore-config
    Optional:    false
  data-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  unbox-compactor-data2
    ReadOnly:   false
  kube-api-access-kmxdk:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.azure.com/scalesetpriority=spot
                             xkf.xenit.io/node-class=memory
Tolerations:                 kubernetes.azure.com/scalesetpriority=spot:NoSchedule
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                     From                Message
  ----     ------             ----                    ----                -------
  Warning  FailedScheduling   7m52s                   default-scheduler   0/14 nodes are available: 12 node(s) didn't match Pod's node affinity/selector, 2 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate.
  Warning  FailedScheduling   6m40s (x1 over 7m51s)   default-scheduler   0/14 nodes are available: 12 node(s) didn't match Pod's node affinity/selector, 2 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate.
  Normal   NotTriggerScaleUp  2m40s (x20 over 7m42s)  cluster-autoscaler  pod didn't trigger scale-up: 2 node(s) didn't match Pod's node affinity/selector, 1 node(s) had volume node affinity conflict
@andyzhangx
Copy link
Member

andyzhangx commented May 31, 2022

I think this issue should be fixed with #4550 on 1.22 cluster
The cherry-pick PR(#4795) was merged in 4/8, it may not be ready on AKS yet.

@NissesSenap
Copy link
Author

Okay thanks @andyzhangx , do you have any eta for when it might be implemented on AKS?
I will be upgrading to 1.22 the next coming weeks.
After doing so do I need to do anything else do get the latest release, or is it just to sit back and wait while the Azure team do it's thing and start rolling out the latest release to their customers?

@gandhipr
Copy link
Contributor

gandhipr commented Jun 1, 2022

I observe CA 1.21 doesn't have this fix and it is possible to face this issue while using CA 1.21 https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-release-1.21/cluster-autoscaler/cloudprovider/azure/azure_template.go
1.22 onwards - we have this change in.

@NissesSenap
Copy link
Author

That is great news. We patched our cluster to 1.22 today. I will verify that it works tomorrow.

@NissesSenap
Copy link
Author

So I finally had time to verify this after patching my cluster to 1.22 and I now it works as expected.
Thanks alot for your answers @andyzhangx and @gandhipr

@andyzhangx
Copy link
Member

@gandhipr can you verify whether we still have such issue on AKS? thanks.

@andyzhangx
Copy link
Member

I could confirm that when node is deleted, volume.kubernetes.io/selected-node still exists for non-existent node with csi-provisioner 3.5.0, I think this requires a fix with csi-attacher instead of csi-provisioner since csi-provisioner is only responsible for create/delete volume, and when node is deleted, there is volume detach call in csi-attacher.

so with this wrong value of volume.kubernetes.io/selected-node, could cluster-autoscaler handle such issue? @gandhipr

@gandhipr
Copy link
Contributor

gandhipr commented Jan 18, 2024

yes, cluster-autoscaler vendors scheduler and scheduler uses this value.
Error returned on CAS pod is error returned by scheduler. @andyzhangx

@andyzhangx
Copy link
Member

yes, cluster-autoscaler vendors scheduler and scheduler uses this value. Error returned on CAS pod is error returned by scheduler. @andyzhangx

@gandhipr so this issue is not fixed since volume.kubernetes.io/selected-node won't change after first volume attach.

@danielhoult
Copy link

danielhoult commented Mar 3, 2024

I am facing this issue on AKS with kubernetes v1.27. Autoscaling is enabled and all 3 of my PVC are attached to deleted nodes. volume.kubernetes.io/selected-node shows nodes that have been removed. I can't schedule my pods due to "3 node(s) had volume node affinity conflict".
How do I resolve and get my application back up?

@pb6
Copy link

pb6 commented Apr 18, 2024

Got bitten by this today on AKS 1.27.7, when grafana-0 from stateful set was pending waiting for volume, assigned to spot instance that was long gone already. Got it up following these steps (maybe not all required):

  1. removed assigned node annotation from PVC. no change.
  2. since that nodepool was scaled to 0 and was not scaling up, deleted nodepool itself (tried to scale it up manually to 1, no luck, but maybe I should have scaled to 3, read below).
  3. noticed that deleting pending grafana pod triggered scale up event on another nodepool, but grafana still was not scheduled, due to nodeAffinity constraints left on PV
  4. looked at nodes and found the one in correct failure-domain.beta.kubernetes.io/zone, freed some resources on it, and grafana scheduled and started successfully.

We will see if it will work nicely in future, or that nodeAffinity will be left there forever and will break further reschedules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

6 participants