-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Self healing control plane: Enable MHC to watch control plane machines #2836
Comments
/area control-plane |
/milestone v0.3.x |
I'd like to propose that for control plane nodes managed by a controller reference (e.g. KubeadmConfig) we annotate the Machine without deleting it. To describe the flow a little bit in more details:
The above introduces a new contract for control plane providers, that they'll need to respect the annotation on each reconciliation loop and remediate on their own. On the other hand, it gives more power to the control plane controllers to control when and how to remediate its machines. Thoughts, questions, concerns? |
/milestone v0.3.4 |
cc @ncdc @sedefsavas |
@vincepri your comment got me thinking, what if instead of modifying the MHC behavior based on the type of the controller reference we actually modified the Machine controller to handle deletion behavior differently in that case instead? That would allow for blocking deletion behavior not only on MHC-initiated deletions, but any external deletions until the owning controller "allows" the deletion (possibly via annotation). It might be good to also extend that behavior to machine infrastructure providers as well, since they own the actual resource that is probably more important to block deletion of. |
@detiber Thinking through how that would work
This would be interesting to pursue, depending how we implement it we can allow any generic owner-controller reference to act this way — although this would require changes to MachineSet controller as well. We could use a special finalizer here that KCP adds on creates, instead of a specific annotation, which would make the behavior opt-in instead. One thing to consider is that now KCP will have to patch before issuing a delete on Machines. |
I do like the idea of making the feature opt-in rather than opt-out. I do worry a bit about trying to use a finalizer for this, since finalizers block the deletion of the underlying k8s resource from storage and do not block the reconciliation of the deletion by other controllers. If we follow that path the deletion is likely going to be too far along for us to be able to attempt to maintain quorum in a recoverable way before it's too late. We could block reconciliation of delete in the Machine (and InfraMachine) controller(s), but that would diverge from the more common semantic use of finalizers in Kubernetes. We could keep the same behavior by applying an annotation to block deletion reconciliation during resource creation in KCP (and anywhere else that wants to implement the behavior) and have KCP remove the annotation to "allow" the deletion to proceed. This wouldn't preclude the addition/removal of a finalizer which might not be a bad thing to also add, I just worry about overloading the generally understood semantics of a finalizer for the complete implementation of the behavior we are potentially describing. |
I am +1 for the idea of this being opt-in rather than opt out To me, using a finalizer seems like the natural route since they're intended to be used for special deletion logic. My idea was that the Machine controller could check the the finalizers list consists only of it's own finalizer before proceeding with the deletion logic, that said, I have a vague feeling like this could be bad and make for a fragile blocking situation in the future, there's a comment on the types which suggests to me someone else has tried something like this before (ordered finalizers), but I don't know how likely this situation would be to arise within the CAPI ecosystem If we use a opt in annotation, is there a possibility that two controllers could want to set/remove that known annotation? Should it somehow be a list of annotations (prefixed name?) so that multiple controllers could block the deletion of the machine independently if required? (Again, this feels like a list of finalizers and checking for being the last finalizer and I'm wondering if there's much of a difference) |
In a given controller that we own, we can control if we respect the order and/or count of the finalizers in the list. But we can't assume that any other controller is doing the same. I think we could be reasonably assured that all controllers in cluster-api plus the infrastructure providers with which we're involved could do this, but there's no way to guarantee that an arbitrary bootstrap/control plane/infrastructure provider does follow the same rules. My gut feeling is that it would be safer only to issue a delete when we truly intend to delete a thing. In other words, I think I'm in favor of Vince's idea of having MHC "mark" (annotate/label) that a Machine is unhealthy (oooh or maybe it could add to .status.conditions, once we have that?), and that the controller responsible for that Machine is the one that is expected to issue the delete. Just my 2 cents... oh, and we also have #1514 which has some prior discussion in this area too. |
For v1alpha3, we should keep the scope as minimal as possible. Supporting this feature for control plane machines is a very specific use case which we can tackle, learn, and extend once we're ready to work on vNext. That said, I'd like us to find a common solution if we want to tackle #1514 or #2846 in v0.3.x |
mm I'm not fully sold on why we wouldn't put a finaliser on machines to enforce that the KCP cleans up the etcd membership. That would effectively solve #2818 / #2759. If we solved the above then this is orthogonal and transparent to MHCs. It would work consistently for any voluntary machine disruption. If I want to manually "remediate" my machine I just Otherwise starting with Vince's annotation workflow seems reasonable to me. |
If we go down the path of MHC annotating unhealthy Machines, I think we should change up the current implementation as well and keep it consistent. From a user point of view, it won't be a breaking change, the delete will be issued by another controller, but that's pretty much it. I'd like to see a google doc proposal though, given that there are a lot of parts that need to be touched. Possibly, we need to include use cases from #1514 or #2846. |
Doesn't this require a coordinated order of operations across the KCP, Machine, and InfraMachine controllers though? Or can we do it without strict ordering? |
I suppose we could block Machine and InfraMachine controllers from proceeding until this finalizer is removed. But I don't think is even necessary and we can probably do it without strict ordering: The KCP will watch a deletion timestamp and will trigger the "remove the etcd member" immediately while the node still needs to be drained. Worst case scenario If by any chance it happens that the infra goes away before "remove the etcd member", etcd would be still functional with just a degraded member for a minimal period of time. |
KCP should create a new member though before the Machine is deleted, we should give full control to the controller's owner reference, rather than issuing a delete and then blocking it |
Not if the number of deletions issued exceeds the amount that would cause quorum to be lost, or worse include the last member which would cause complete data loss. |
maxUnhealthy cannot necessarily be relied upon, since there is nothing to ensure that it is sync'd with a value that would prevent the loss of quorum compared. |
In the issue desc above I envision this MHC to be owned by the KCP controller which would enforce |
So to summarise I think my preference would be:
|
For control plane nodes, I'd prefer to go down the path of the annotation approach as outlined in #2836 (comment) (and discuss to do the same for Machines in a MachineSet). This gives the controller owner ability to apply any remediation steps it deems necessary before proceeding and it separates the controllers responsibilities even more. If you're all +1 on it, we should code up the changes in the MHC controller first. After that, and after the current work on KCP is merged, we follow-up with changes to KCP to handle the annotation properly. |
+1 to annotation approach. |
I think using annotations will make sense for the control plane as it needs the special handling for scaling down and back up. As for annotations in general, I have some thoughts, provided the Machine is guaranteed to be deleted, then they're ok, it's just a proxy for the delete request essentially. My concern with annotations in the general sense comes from scenarios where the Machine is not going to be deleted (baremetal?). It becomes hard to define when the machine has been remediated. Who's responsible for removing the annotation from the Machine? And when? And then what prevents MHC immediately marking it unhealthy again? You kind of have to wipe the status from the Machine/reset it if you will to allow the MHC to treat it as a new Machine, but is wiping the status desirable in a scenario like that? |
I'd expect MHC to keep checking a Machine and remove the annotation if it goes back to healthy
Nothing should prevent this imo
MHC shouldn't know/understand that a Machine has been reset, I would expect an external reconciler to do thing out of band, and outside core functionality. For example, if an external controller changes the NodeRef on the Machine's status that's not a supported behavior from CAPI perspective, but MHC shouldn't care what's in the value, but solely read it and do the health checks necessary. I think it'd be great to discuss this in a little more details in tomorrow's meeting. |
MHC is intentionally simple today and based on the principle that machines are "cattle". It's up to each controller (KCP, machineSet...) to react gracefully and protectively if they choose to when a MHC or something else requests a machine deletion. I think consumers of the machine semantics shouldn't care about this in their designs. In the same way that they don't care about pods, they just request a machine deletion and we just honour PDBs. Is there any particular flaw/concern you see with this approach #2836 (comment) for this scenario? I see room for flexibility and pluggable remediation strategies in certain environments. In particular in a given baremetal environment a reboot might be preferred over deletion. For articulating and figuring the best mechanism to allow this pluggability (annotations, API, crd...) I'd expect a RFE extending the original MHC one so we can flesh out the motivations, e2e details and workflow. If we want to deviate remediation behaviour also for pools of machines e.g control plane machines we should probably consider it just another use case in that wider discussion. |
Annotations wouldn't change that, it would really only separate the responsibilities of a health check and the remediation process.
"Add ability to KCP controller to watch a deletion timestamp and remove the etcd member" this isn't far from what we'll have today in #2841. The biggest drawback of this approach is that we cannot control when the underlying infrastructure is getting deleted, and we cannot scale up before scaling down.
See #2846
If we can reach consensus on annotations and separation of responsibilities, we follow-up and amend the proposal. |
A MHC managed by the KCP enforcing the maxUnhealthy value based on current replicas would protect you from shooting yourself in the foot, that's why we have maxUnhealhty semantic.
Cool, I'm all good with that. MHC should be just an integration point between node problem detector tooling and remediation strategies (if there's a need for them). |
KCP has to scale up first, then delete. Can you help me understand how MHC allows for this? If we separate the concerns, and have MHC only mark, and some other thing(s) remediate (sweep), I think we'll be able to make everyone happy. WDYT? |
It's a bit tricky. It could be avoided by having that remediation-controller adding extra annotation to differentiate between unhealthy Machine and unhealthy Machine that just got remediated. |
my two cents. The machine is starting to fail check annotations is something that might not be relevant for this issue, but it could be a super valuable source of information for implementing conditions on machines. |
Seems like we need a separate state of "remediation applied" that KCP would apply if we want to handle a machine going from unhealthy to "maybe healthy" and have the MHC be the one who decides that it's fully healthy again |
This sounds more like a matter of configuration options in MHC and timeouts, rather than having a full state machine. Understanding that remediation has been applied is something that the remediating controller needs to track, not a health check. |
I think the problem to solve is how we healthcheck/mark unhealthy/remediate in an idempotent way, as @n1r1 says, we don't want to get into a loop. Currently, MHC signals remediation by deleting the machine, which is irreversible. Adding and removing annotations because of the health status changes rather changes the current beahviour. My concern would be that MHC marks the machine unhealthy with an annotation, the remediating controller starts the remediation process, eg for baremetal, IPMI reboot the machine, then it somehow needs to know it's already remediated so that is doesn't re-remediate? Does it remove the annotation? If it does, MHC will probably put it straight back (unless it clears the Machine status completely and re-triggers the new node timeout (nodeStartupTimeout)) Does it add another annotation to say it's remediated? If so, it then needs to remove it again once the machine goes healthy? What if it never goes healthy? Perhaps MHC needs to have a timeout between remediations, a I am concerned this is going to end up being a bit state machiney I don't think any of this really applies to the KCP model if it's going to remove the machines always, but that won't necessarily be the case if KCP is on baremetal will it? |
We have published a MachineHealthCheck enhancement proposal in google docs. The document is now open to everyone's comments CAPI MachineHealthCheck enhancement |
Moving this out of v0.3.4 for now, if we get the proposal merged in this cycle, we can target the full support in the next version. /milestone v0.3.x |
/assign @benmoss |
As per April 27th 2020 community guidelines, this project follows the process outlined in https://github.com/kubernetes-sigs/cluster-api/blob/master/CONTRIBUTING.md#proposal-process-caep for large features or changes. Following those guidelines, I'll go ahead and close this issue for now and defer to contributors interested in pushing the proposal forward to open a collaborative document proposal instead, and follow the process as described. /close |
@vincepri: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Goals
Non-Goals
User Story
As a user/operator I'd my control plane machines/nodes to be self healing. I.e gracefully deleted and recreated for safe scenarios where etcd quorum wouldn't be violated.
Detailed Description
The Control Plane Provider should able to reconcile by removing etcd member for voluntary machine disruptions i.e deletion.
#2818
Currently MHC only remediate machines owned by a machineSet. With the above in place we can relax this constraint to remediate any machine owned by a controller i.e controlPlaneController #2828.
spec.maxUnhealthy
should be leveraged to ensure remediation is not triggered in a quorum disruptive manner e.g only 1 for a 3 machines control plane or 2 for a 5 machines control plane.The Control Plane is a critical and sensitive component, where we want to favour control over flexibility. Therefore I propose introducing a new field
EnableAutorepair
for theKubeadmControlPlaneSpec
which creates, owns a MHC resource and enforce the maxUnhealthy value appropriately based on the Control Plane size.if we have general agreement in the approach the issue describes we can then discuss putting any extra safety in place if we wish to prevent users from shooting themselves in the foot creating out of band MHCs. E.g we could let MHC to remediate masters machines only if the MHC is owned by the Control Plane.
Contract changes [optional]
[Describe contract changes between Cluster API controllers, if applicable.]
Data model changes [optional]
[Describe contract changes between Cluster API controllers, if applicable.]
/kind proposal
The text was updated successfully, but these errors were encountered: