You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After issuing a stop & start of Azure Kubernetes Service with Capsule helm chart installed, nodes are unable to become ready and cluster is not functional.
How to reproduce
Steps to reproduce the behavior:
Create & connect to AKS
$ az group create -g capsule-shutdown-test -l westeurope
$ az aks create -n aks-capsule-shutdown-test -g capsule-shutdown-test
$ az aks get-credentials -n aks-capsule-shutdown-test -g capsule-shutdown-test
$ kubelogin convert-kubeconfig -l azurecli
$ az aks stop -n aks-capsule-shutdown-test -g capsule-shutdown-test
Start cluster
$ az aks start -n aks-capsule-shutdown-test -g capsule-shutdown-test # This timeouts
Command above timeouts, nodes remain in NotReady state, every node claims it has problem with CIDRAssignment:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 13m kube-proxy
Normal Starting 14m kubelet Starting kubelet.
Warning InvalidDiskCapacity 14m kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 14m (x2 over 14m) kubelet Node aks-nodepool1-34984166-vmss000005 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 14m (x2 over 14m) kubelet Node aks-nodepool1-34984166-vmss000005 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 14m (x2 over 14m) kubelet Node aks-nodepool1-34984166-vmss000005 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 14m kubelet Updated Node Allocatable limit across pods
Normal RegisteredNode 14m node-controller Node aks-nodepool1-34984166-vmss000005 event: Registered Node aks-nodepool1-34984166-vmss000005 in Controller
Normal CIDRAssignmentFailed 4m5s (x1037 over 14m) cidrAllocator Node aks-nodepool1-34984166-vmss000005 status is now: CIDRAssignmentFailed
I was able to make cluster running again after removing nodes.capsule.clastix.io from ValidatingWebhookConfigurations:
After removing nodes validator, everything is normal again and pods are getting scheduled. So this looks like kubelets are trying to access webhook server in order to enable node to run webhook server (?).
@prometherion Thank you for the response. I did see this configuration option, I can set the failurePolicy to false, but would this be any disadvantage of setting this when I use capsule-proxy without BYOD?
Bug description
After issuing a stop & start of Azure Kubernetes Service with Capsule helm chart installed, nodes are unable to become ready and cluster is not functional.
How to reproduce
Steps to reproduce the behavior:
Command above timeouts, nodes remain in
NotReady
state, every node claims it has problem with CIDRAssignment:I was able to make cluster running again after removing
nodes.capsule.clastix.io
from ValidatingWebhookConfigurations:After removing nodes validator, everything is normal again and pods are getting scheduled. So this looks like kubelets are trying to access webhook server in order to enable node to run webhook server (?).
Expected behavior
I am able to stop&start AKS cluster with Capsule.
Additional context
clastix/capsule:v0.2.1
(from latest helm chart)capsule-0.3.5
v1.24.9
The text was updated successfully, but these errors were encountered: