-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to start aks cluster after capsule installation #758
Comments
Hey @micke-post, could this be a duplicate of this issue? |
I don't think so, one of the first things I tried is adding the following configuration to the capsule helm value file:
But that didn't seem to make any difference to the deployment. |
I think it was because capsule was unable to start and patch webhook configuration due to taints on nodes. One of the ways to fix it is to add toleration for this taint for capsule deployment |
I'm not getting the point of the issue here, also considering the code. Capsule is patching the webhook by adding the required CA, we're not changing the failure policy of the nodes. Because this is what happened to @JacekLakis-TomTom with #719 and he was able to solve it as follows. |
@prometherion oops, you are right there. My fault, we are really not patching configuration in code. But nevertheless, setting |
Alright, nevermind. I now retried the whole configuration on multiple newly deployed clusters and I cannot get them to fail again, tolerations or not. I'm still going to leave them in since, as @MaxFedotov mentioned, it's probably not a bad idea to have them, but it now looks like just setting the webhooks to ignore is fine. One more question - I'm now setting all webhooks to ignore, but I guess that's not the best idea? Is there a specific webhook I should set to ignore for the startup to work? |
With a I'm a Tenant Owner with an assigned quota of namespaces I can create. For any reason, Capsule is not able to serve the validation webhook of Namespace creations, thus the request is sent to the API Server directly without the Capsule mangling, ending up in creating multiple namespaces and bypassing the quota. This is applicable to all the features offered by Capsule, so play it at your own risk: I would suggest running Capsule with a higher priority class, on a specific node pool which are not gonna be shutdown, and with a replicas of two instances so you can ensure HA for the webhooks. |
Closing since it seems solved to me, feel free to open it back, @micke-post! |
Bug description
We are using capsule and capsule proxy on hosted aks clusters and stumbled over an issue that broke a couple of them. Basically, if all nodes of a cluster are stopped and then started again, the cluster runs into a fail state that forced us to delete it.
From what we gathered the root of the issue is how capsule and hosted clusters (don't) work together. It is common for cloud providers to automatically set a node.cloudprovider.kubernetes.io/uninitialized taint on nodes after they are started or after scaling out (also see https://kubernetes.io/docs/reference/labels-annotations-taints/#node-cloudprovider-kubernetes-io-uninitialized). I believe the idea is basically to block new nodes from running workloads until its startup procedures have completed to ensure that it properly works for the hosted service. Once that's done, the cloud provider sends a request to the kubernetes api to remove the taint and allow workloads to run on it.
Which works fine for single nodes, after scaling out or upgrades or something. But if an entire cluster has been shut down (e.g. for troubleshooting, disaster recovery, or if it's a test instance that doesn't need to be running all the time) all nodes will have that taint, so all (non-critical) workloads are blocked until the taint is removed from these node.
Issue is that with capsule installed, (as far as I understand it) these calls to the management api are automatically rerouted to capsule instead, which isn't running at this point, turning it into a chicken-egg problem where azure cannot clear the tainted nodes since the capsule api is not available to receive the request, and capsule cannot start since all nodes it could run on are tainted.
As a result, azure is waiting for the cluster to finish it startup procedure, during which the cluster cannot be stopped or deleted since it's stuck in a startup state. After around 3-4 hours azure finally gives up and let's you retry the startup or delete it.
Now, maybe there's a way to recover a cluster in this state - but since this happened on dev environments it was easier to just delete it and deploy the cluster from scratch, so we didn't invest a a lot of time into trying to troubleshoot it. We ultimately solved the problem by adding tolerations for the taint to capsule and capsule proxy and (for good measure) raised the priority of the pods to ensure they always run:
When the issue occurs, the error message looks like this:
Additional context
We are using the following versions with no changes made to the default values file for capsule or capsule-proxy when the error occurred:
To be clear - we mitigated the problem on our end, I just opened this issue to highlight that there could be a possible gotcha when getting started with capsule on azure.
The text was updated successfully, but these errors were encountered: