Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Kubernetes Service with Capsule chart installed is unable to start after shutdown #719

Closed
JacekLakis-TomTom opened this issue Mar 1, 2023 · 4 comments
Assignees
Labels
duplicate This issue or pull request already exists

Comments

@JacekLakis-TomTom
Copy link

JacekLakis-TomTom commented Mar 1, 2023

Bug description

After issuing a stop & start of Azure Kubernetes Service with Capsule helm chart installed, nodes are unable to become ready and cluster is not functional.

How to reproduce

Steps to reproduce the behavior:

  1. Create & connect to AKS
$ az group create -g capsule-shutdown-test -l westeurope
$ az aks create -n aks-capsule-shutdown-test -g capsule-shutdown-test
$ az aks get-credentials -n aks-capsule-shutdown-test -g capsule-shutdown-test
$ kubelogin convert-kubeconfig -l azurecli
  1. Install Capsule
$ helm -ncapsule-system install capsule clastix/capsule --create-namespace
  1. Stop cluster
$ az aks stop -n aks-capsule-shutdown-test -g capsule-shutdown-test
  1. Start cluster
$ az aks start -n aks-capsule-shutdown-test -g capsule-shutdown-test # This timeouts

Command above timeouts, nodes remain in NotReady state, every node claims it has problem with CIDRAssignment:

  Type     Reason                   Age                    From             Message
  ----     ------                   ----                   ----             -------
  Normal   Starting                 13m                    kube-proxy       
  Normal   Starting                 14m                    kubelet          Starting kubelet.
  Warning  InvalidDiskCapacity      14m                    kubelet          invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  14m (x2 over 14m)      kubelet          Node aks-nodepool1-34984166-vmss000005 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    14m (x2 over 14m)      kubelet          Node aks-nodepool1-34984166-vmss000005 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     14m (x2 over 14m)      kubelet          Node aks-nodepool1-34984166-vmss000005 status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  14m                    kubelet          Updated Node Allocatable limit across pods
  Normal   RegisteredNode           14m                    node-controller  Node aks-nodepool1-34984166-vmss000005 event: Registered Node aks-nodepool1-34984166-vmss000005 in Controller
  Normal   CIDRAssignmentFailed     4m5s (x1037 over 14m)  cidrAllocator    Node aks-nodepool1-34984166-vmss000005 status is now: CIDRAssignmentFailed

I was able to make cluster running again after removing nodes.capsule.clastix.io from ValidatingWebhookConfigurations:

$ kubectl edit validatingwebhookconfiguration capsule-validating-webhook-configuration

After removing nodes validator, everything is normal again and pods are getting scheduled. So this looks like kubelets are trying to access webhook server in order to enable node to run webhook server (?).

Expected behavior

I am able to stop&start AKS cluster with Capsule.

Additional context

  • Capsule version: clastix/capsule:v0.2.1 (from latest helm chart)
  • Helm Chart version: capsule-0.3.5
  • Kubernetes version: v1.24.9
@JacekLakis-TomTom JacekLakis-TomTom added blocked-needs-validation Issue need triage and validation bug Something isn't working labels Mar 1, 2023
@prometherion
Copy link
Member

Hey Jacek, thanks for reporting this.

I suspect it's a duplicate of this: #597 (comment)

We're already offering the customization of webhooks on the Helm Chart: may I ask you to give it a try, please?

@JacekLakis-TomTom
Copy link
Author

@prometherion Thank you for the response. I did see this configuration option, I can set the failurePolicy to false, but would this be any disadvantage of setting this when I use capsule-proxy without BYOD?

@prometherion
Copy link
Member

Those webhooks are required mostly for the BYOD feature, you're right.

Unless you're not allowing the Tenant Owners to label their nodes, you can disable that failure policy safely.

@JacekLakis-TomTom
Copy link
Author

Works as expected with failurePolicy Ignore, thank you for support!

@prometherion prometherion self-assigned this Mar 2, 2023
@prometherion prometherion added duplicate This issue or pull request already exists and removed bug Something isn't working blocked-needs-validation Issue need triage and validation labels Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants