-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KCP Doesn't Remediate Faulty Machines During Cluster Formation #7496
Comments
This is as-designed right now - KCP will not do re-mediation based on MHC until there are at least the number of healthy desired KCP machines running. This is to ensure stability when a cluster is coming up. On your unhealthy machine you should see a log like:
If that's there then MHC is correctly labelling the machine for remediation, but KCP is specifically deciding not to remediate until there is a stable control plane. That said if there's a safe, stable way to do this it could be interesting. One option today is to implement externalRemediation to manage this outside of core Cluster API. It's a hard problem as when the underlying infrastructure isn't working, it's likely another Control Plane Machine will also fail as there's a real environment issue in your case the network being cut off for one of the KCP nodes. |
Thanks for that feedback @killianmuldoon. I certainly don't know the basis behind the design decision here (i.e., why remediating during CP formation is risky). Its downside, as demonstrated, is that the partially provisioned CP will remain stuck in that state: the new CP machine can never join the cluster, and CAPI keeps waiting for it to. Stable, yes, but not in a useful way. I'd like CAPI to be able to recover from provisioning problems occurring during cluster formation that it "knows how to" recover from after cluster formation completes. Can you or anyone shed more light on the risk of remediating during CP formation? |
The major risk at this point is that the etcd cluster is knocked into a state that it can't automatically recover from - e.g losing the leader, losing the majority. Given that this is happening at bootstrap time it's probably easier and faster to just automatically restart if you're confident the KCP machine failure is something flaky, rather than something clearly wrong with the underlying infrastructure. |
/triage accepted I agree this is an interesting new use case to cover if we can find a safe, stable way to do this Some context that I hope can help in shaping the discussion:
Now, as reported above, the last condition prevents remediation during cluster formation; before relaxing this check in this new iteration, IMO we should address at least the following questions:
/area control-plane |
/assign I'm working to some idea to solve this problem; I will follow up with some more details here or in PR with an amendment to the KCP proposal |
#7855 proposes an amendment to the KCP proposal so it will be possible to remediate failure happening while provisioning the CP (both first CP and CP machines while current replica < desired replica). In order to make this more robust/not aggressive on the infrastructure (e.g. avoid infinite remediation if the first machine fails consistently) I have added optional support for controlling the number of retry and a delay between each retry. |
What steps did you take and what happened:
I1104 14:18:03.141623 1 controller.go:364] "Scaling up control plane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" kubeadmControlPlane="default/jweite-test-control-plane" namespace="default" name="jweite-test-control-plane" reconcileID=5f541f90-9549-496e-81c0-9befe23c1994 cluster="jweite-test" Desired=3 Existing=2 I1104 14:18:03.141831 1 scale.go:212] "msg"="Waiting for control plane to pass preflight checks" "cluster-name"="jweite-test" "name"="jweite-test-control-plane" "namespace"="default" "failures"="[machine jweite-test-control-plane-zqgfk does not have APIServerPodHealthy condition, machine jweite-test-control-plane-zqgfk does not have ControllerManagerPodHealthy condition, machine jweite-test-control-plane-zqgfk does not have SchedulerPodHealthy condition, machine jweite-test-control-plane-zqgfk does not have EtcdPodHealthy condition, machine jweite-test-control-plane-zqgfk does not have EtcdMemberHealthy condition]"
What did you expect to happen:
The KCP to remediate the bad machine by deleting it.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
From my code read of controlPlane/kubeadm/internal/controller/remediation.go reconcileUnhealthyMachines() insists that the cluster be fully formed (provisioned machines == desired replicas) before it will act. But the cluster cannot fully form if the machine successfully started cannot join the cluster because of an external issue such as the one I simulated. IMO remediation would be an appropriate response to this situation.
Environment:
kubectl version
): v1.20.10/etc/os-release
): Darwin: MacOS 12.6.1/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
The text was updated successfully, but these errors were encountered: