Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to start aks cluster after capsule installation #758

Closed
micke-post opened this issue May 9, 2023 · 8 comments
Closed

Unable to start aks cluster after capsule installation #758

micke-post opened this issue May 9, 2023 · 8 comments
Assignees
Labels
duplicate This issue or pull request already exists question Further information is requested

Comments

@micke-post
Copy link

Bug description

We are using capsule and capsule proxy on hosted aks clusters and stumbled over an issue that broke a couple of them. Basically, if all nodes of a cluster are stopped and then started again, the cluster runs into a fail state that forced us to delete it.

From what we gathered the root of the issue is how capsule and hosted clusters (don't) work together. It is common for cloud providers to automatically set a node.cloudprovider.kubernetes.io/uninitialized taint on nodes after they are started or after scaling out (also see https://kubernetes.io/docs/reference/labels-annotations-taints/#node-cloudprovider-kubernetes-io-uninitialized). I believe the idea is basically to block new nodes from running workloads until its startup procedures have completed to ensure that it properly works for the hosted service. Once that's done, the cloud provider sends a request to the kubernetes api to remove the taint and allow workloads to run on it.

Which works fine for single nodes, after scaling out or upgrades or something. But if an entire cluster has been shut down (e.g. for troubleshooting, disaster recovery, or if it's a test instance that doesn't need to be running all the time) all nodes will have that taint, so all (non-critical) workloads are blocked until the taint is removed from these node.

Issue is that with capsule installed, (as far as I understand it) these calls to the management api are automatically rerouted to capsule instead, which isn't running at this point, turning it into a chicken-egg problem where azure cannot clear the tainted nodes since the capsule api is not available to receive the request, and capsule cannot start since all nodes it could run on are tainted.
As a result, azure is waiting for the cluster to finish it startup procedure, during which the cluster cannot be stopped or deleted since it's stuck in a startup state. After around 3-4 hours azure finally gives up and let's you retry the startup or delete it.

Now, maybe there's a way to recover a cluster in this state - but since this happened on dev environments it was easier to just delete it and deploy the cluster from scratch, so we didn't invest a a lot of time into trying to troubleshoot it. We ultimately solved the problem by adding tolerations for the taint to capsule and capsule proxy and (for good measure) raised the priority of the pods to ensure they always run:

tolerations: 
  - key: "CriticalAddonsOnly"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  - key: "node.cloudprovider.kubernetes.io/uninitialized"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

priorityClassName: system-node-critical

When the issue occurs, the error message looks like this:

kind: Event
apiVersion: v1
metadata:
  name: capsule-controller-manager-8558cb9f-mrmx2.175d20f1ebd0fad9
  namespace: capsule-system
  uid: 5bf63e01-92f4-4acd-af4d-af9c3f01f421
  resourceVersion: '2567'
  creationTimestamp: '2023-05-08T09:27:39Z'
  managedFields:
    - manager: kube-scheduler
      operation: Update
      apiVersion: events.k8s.io/v1
      time: '2023-05-08T09:27:39Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:action: {}
        f:eventTime: {}
        f:note: {}
        f:reason: {}
        f:regarding: {}
        f:reportingController: {}
        f:reportingInstance: {}
        f:type: {}
involvedObject:
  kind: Pod
  namespace: capsule-system
  name: capsule-controller-manager-8558cb9f-mrmx2
  uid: 7129bb2d-3e12-415f-a1c3-488e91a5460a
  apiVersion: v1
  resourceVersion: '26850'
reason: FailedScheduling
message: >-
  0/5 nodes are available: 1 node(s) had untolerated taint
  {node.cloudprovider.kubernetes.io/uninitialized: true}, 1 node(s) had
  untolerated taint {virtual-kubelet.io/provider: azure}, 3 node(s) had
  untolerated taint {CriticalAddonsOnly: true}. preemption: 0/5 nodes are
  available: 5 Preemption is not helpful for scheduling.
source: {}
firstTimestamp: null
lastTimestamp: null
type: Warning
eventTime: '2023-05-08T09:27:39.121978Z'
action: Scheduling
reportingComponent: default-scheduler
reportingInstance: default-scheduler-kube-scheduler-v2-744b659fbd-krxg9

Additional context

We are using the following versions with no changes made to the default values file for capsule or capsule-proxy when the error occurred:

  • Helm Chart version: capsule-0.4.2, capsule-proxy-0.4.3

To be clear - we mitigated the problem on our end, I just opened this issue to highlight that there could be a possible gotcha when getting started with capsule on azure.

@micke-post micke-post added blocked-needs-validation Issue need triage and validation bug Something isn't working labels May 9, 2023
@prometherion prometherion self-assigned this May 9, 2023
@prometherion
Copy link
Member

Hey @micke-post, could this be a duplicate of this issue?

#597 (comment)

@micke-post
Copy link
Author

I don't think so, one of the first things I tried is adding the following configuration to the capsule helm value file:

webhooks:
  nodes:
    # Set to ignore to prevent the cluster from breaking when restarting with capsule installed
    failurePolicy: Ignore

  namespaceOwnerReference:
    failurePolicy: Ignore
  cordoning:
    failurePolicy: Ignore
  ingresses:
    failurePolicy: Ignore
  namespaces:
    failurePolicy: Ignore
  networkpolicies:
    failurePolicy: Ignore
  pods:
    failurePolicy: Ignore
  persistentvolumeclaims:
    failurePolicy: Ignore
  tenants:
    failurePolicy: Ignore
  tenantResourceObjects:
    failurePolicy: Ignore
  services:
    failurePolicy: Ignore
  defaults:
    ingress:
      failurePolicy: Ignore
    pvc:
      failurePolicy: Ignore
    pods:
      failurePolicy: Ignore

But that didn't seem to make any difference to the deployment.

@MaxFedotov
Copy link
Collaborator

I think it was because capsule was unable to start and patch webhook configuration due to taints on nodes. One of the ways to fix it is to add toleration for this taint for capsule deployment

@prometherion
Copy link
Member

I think it was because capsule was unable to start and patch webhook configuration due to taints on nodes.

I'm not getting the point of the issue here, also considering the code.

https://github.com/clastix/capsule/blob/5977bbd9e11fba4c5f947c83912208c291e0b642/controllers/tls/manager.go#L255-L273

Capsule is patching the webhook by adding the required CA, we're not changing the failure policy of the nodes.
Please, may I ask you to double-check if the ValidatingWebhookConfiguration nodes webhook is set to Ignore as a failure policy?

Because this is what happened to @JacekLakis-TomTom with #719 and he was able to solve it as follows.

#719 (comment)

@MaxFedotov
Copy link
Collaborator

MaxFedotov commented May 9, 2023

@prometherion oops, you are right there. My fault, we are really not patching configuration in code. But nevertheless, setting
tolerations: operator: "Exists"
for capsule will help to prevent a lot of different possible issues with webhooks when cluster is not in a stable state.

@micke-post
Copy link
Author

Alright, nevermind. I now retried the whole configuration on multiple newly deployed clusters and I cannot get them to fail again, tolerations or not. I'm still going to leave them in since, as @MaxFedotov mentioned, it's probably not a bad idea to have them, but it now looks like just setting the webhooks to ignore is fine.

One more question - I'm now setting all webhooks to ignore, but I guess that's not the best idea? Is there a specific webhook I should set to ignore for the startup to work?

@prometherion
Copy link
Member

One more question - I'm now setting all webhooks to ignore, but I guess that's not the best idea? Is there a specific webhook I should set to ignore for the startup to work?

With a Failure policy you're getting a best-effort multi-tenancy in Kubernetes, e.g.

I'm a Tenant Owner with an assigned quota of namespaces I can create. For any reason, Capsule is not able to serve the validation webhook of Namespace creations, thus the request is sent to the API Server directly without the Capsule mangling, ending up in creating multiple namespaces and bypassing the quota.

This is applicable to all the features offered by Capsule, so play it at your own risk: I would suggest running Capsule with a higher priority class, on a specific node pool which are not gonna be shutdown, and with a replicas of two instances so you can ensure HA for the webhooks.

@prometherion
Copy link
Member

Closing since it seems solved to me, feel free to open it back, @micke-post!

@prometherion prometherion added duplicate This issue or pull request already exists question Further information is requested and removed bug Something isn't working blocked-needs-validation Issue need triage and validation labels Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants