-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster-autoscaler pvc bound to old node, not scaling up. #4923
Comments
Okay thanks @andyzhangx , do you have any eta for when it might be implemented on AKS? |
I observe CA 1.21 doesn't have this fix and it is possible to face this issue while using CA 1.21 https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-release-1.21/cluster-autoscaler/cloudprovider/azure/azure_template.go |
That is great news. We patched our cluster to 1.22 today. I will verify that it works tomorrow. |
So I finally had time to verify this after patching my cluster to 1.22 and I now it works as expected. |
@gandhipr can you verify whether we still have such issue on AKS? thanks. |
I could confirm that when node is deleted, so with this wrong value of |
yes, cluster-autoscaler vendors scheduler and scheduler uses this value. |
@gandhipr so this issue is not fixed since |
I am facing this issue on AKS with kubernetes v1.27. Autoscaling is enabled and all 3 of my PVC are attached to deleted nodes. volume.kubernetes.io/selected-node shows nodes that have been removed. I can't schedule my pods due to "3 node(s) had volume node affinity conflict". |
Got bitten by this today on AKS 1.27.7, when grafana-0 from stateful set was pending waiting for volume, assigned to spot instance that was long gone already. Got it up following these steps (maybe not all required):
We will see if it will work nicely in future, or that nodeAffinity will be left there forever and will break further reschedules. |
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
Component version: I don't know, the AKS one... I will find out
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
Azure AKS using spot instances with PVC
What did you expect to happen?:
A new node scaling up
What happened instead?:
Volume node affinity conflict:
pod didn't trigger scale-up: 2 node(s) didn't match Pod's node affinity/selector, 1 node(s) had volume node affinity conflict
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Hi
I have what I think is a cluster-autoscaler issue.
We currently use thanos and together with it we use something called a compactor. The compactor runs using a cronjob and it got a cache disk which is a PVC.
To save money we use spot instances on AKS. If the job is not running the spot instances is shutdown.
The disk type that we are using is zrs which is multiple zone so the pvc shouldn't have any issues jumping around between the zones.
I'm getting an issue where my jobs can't start due to that the PVC is bound to a node that doesn't exist any more (this might be an AKS issue).
So when the cluster-autoscaler comes in and checks it tells me that a volume is in conflict.
Normal NotTriggerScaleUp 2m cluster-autoscaler pod didn't trigger scale-up: 2 node(s) didn't match Pod's node affinity/selector, 1 node(s) had volume node affinity conflict
I think this is thanks to the PVC shows as bound to a node that doesn't exist any more.
I think the cluster-autoscaler needs to add some logic to ignore if a specific node don't exist then ignore that volume, create a new server and the volume should become bound to the new server.
I have added all the output that you hopefully need below.
➜ k get job unbox-compactor-manuall -o yaml
k get pvc unbox-compactor-data2 -o yaml
describe from the pod
The text was updated successfully, but these errors were encountered: