Azure Kubernetes Service with Capsule chart installed is unable to start after shutdown #719

JacekLakis-TomTom · 2023-03-01T13:52:59Z

Bug description

After issuing a stop & start of Azure Kubernetes Service with Capsule helm chart installed, nodes are unable to become ready and cluster is not functional.

How to reproduce

Steps to reproduce the behavior:

Create & connect to AKS

$ az group create -g capsule-shutdown-test -l westeurope
$ az aks create -n aks-capsule-shutdown-test -g capsule-shutdown-test
$ az aks get-credentials -n aks-capsule-shutdown-test -g capsule-shutdown-test
$ kubelogin convert-kubeconfig -l azurecli

Install Capsule

$ helm -ncapsule-system install capsule clastix/capsule --create-namespace

Stop cluster

$ az aks stop -n aks-capsule-shutdown-test -g capsule-shutdown-test

Start cluster

$ az aks start -n aks-capsule-shutdown-test -g capsule-shutdown-test # This timeouts

Command above timeouts, nodes remain in NotReady state, every node claims it has problem with CIDRAssignment:

  Type     Reason                   Age                    From             Message
  ----     ------                   ----                   ----             -------
  Normal   Starting                 13m                    kube-proxy       
  Normal   Starting                 14m                    kubelet          Starting kubelet.
  Warning  InvalidDiskCapacity      14m                    kubelet          invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  14m (x2 over 14m)      kubelet          Node aks-nodepool1-34984166-vmss000005 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    14m (x2 over 14m)      kubelet          Node aks-nodepool1-34984166-vmss000005 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     14m (x2 over 14m)      kubelet          Node aks-nodepool1-34984166-vmss000005 status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  14m                    kubelet          Updated Node Allocatable limit across pods
  Normal   RegisteredNode           14m                    node-controller  Node aks-nodepool1-34984166-vmss000005 event: Registered Node aks-nodepool1-34984166-vmss000005 in Controller
  Normal   CIDRAssignmentFailed     4m5s (x1037 over 14m)  cidrAllocator    Node aks-nodepool1-34984166-vmss000005 status is now: CIDRAssignmentFailed

I was able to make cluster running again after removing nodes.capsule.clastix.io from ValidatingWebhookConfigurations:

$ kubectl edit validatingwebhookconfiguration capsule-validating-webhook-configuration

After removing nodes validator, everything is normal again and pods are getting scheduled. So this looks like kubelets are trying to access webhook server in order to enable node to run webhook server (?).

Expected behavior

I am able to stop&start AKS cluster with Capsule.

Additional context

Capsule version: clastix/capsule:v0.2.1 (from latest helm chart)
Helm Chart version: capsule-0.3.5
Kubernetes version: v1.24.9

The text was updated successfully, but these errors were encountered:

prometherion · 2023-03-01T14:46:14Z

Hey Jacek, thanks for reporting this.

I suspect it's a duplicate of this: #597 (comment)

We're already offering the customization of webhooks on the Helm Chart: may I ask you to give it a try, please?

JacekLakis-TomTom · 2023-03-01T14:49:55Z

@prometherion Thank you for the response. I did see this configuration option, I can set the failurePolicy to false, but would this be any disadvantage of setting this when I use capsule-proxy without BYOD?

prometherion · 2023-03-01T14:52:06Z

Those webhooks are required mostly for the BYOD feature, you're right.

Unless you're not allowing the Tenant Owners to label their nodes, you can disable that failure policy safely.

JacekLakis-TomTom · 2023-03-02T08:37:01Z

Works as expected with failurePolicy Ignore, thank you for support!

JacekLakis-TomTom added blocked-needs-validation Issue need triage and validation bug Something isn't working labels Mar 1, 2023

JacekLakis-TomTom closed this as completed Mar 2, 2023

prometherion self-assigned this Mar 2, 2023

prometherion added duplicate This issue or pull request already exists and removed bug Something isn't working blocked-needs-validation Issue need triage and validation labels Mar 2, 2023

prometherion mentioned this issue May 9, 2023

Unable to start aks cluster after capsule installation #758

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure Kubernetes Service with Capsule chart installed is unable to start after shutdown #719

Azure Kubernetes Service with Capsule chart installed is unable to start after shutdown #719

JacekLakis-TomTom commented Mar 1, 2023 •

edited

Loading

prometherion commented Mar 1, 2023

JacekLakis-TomTom commented Mar 1, 2023

prometherion commented Mar 1, 2023

JacekLakis-TomTom commented Mar 2, 2023

Azure Kubernetes Service with Capsule chart installed is unable to start after shutdown #719

Azure Kubernetes Service with Capsule chart installed is unable to start after shutdown #719

Comments

JacekLakis-TomTom commented Mar 1, 2023 • edited Loading

Bug description

How to reproduce

Expected behavior

Additional context

prometherion commented Mar 1, 2023

JacekLakis-TomTom commented Mar 1, 2023

prometherion commented Mar 1, 2023

JacekLakis-TomTom commented Mar 2, 2023

JacekLakis-TomTom commented Mar 1, 2023 •

edited

Loading