-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External Machine remediation #2846
Comments
Given that |
@vincepri So if infrastructure provider needs its own remediation logic, it would be better to implement the whole new provider specific MachineHealthCheck controller and not to use Cluster API implementation at all? |
It seems sub-optimal to duplicate the development and maintenance effort when there is 90-95% overlap between the current MHC and a version of it that facilitates provider specific workflows. |
Waiting for We're working on an enhancement and will link to it from this issue once its ready. wrt #1514, because metal takes so much longer to reprovision than clouds, the workflows we're interested in most at this stage involve avoiding Machine deletion rather than delaying it, so at first glance it doesn't seem like a good fit - although the hook pattern looks like something we might be able to reuse. |
Just for some more context, regarding Machine remediation, for (pre-delete hooks), there is a solution we were brainstorming yesterday when thinking about control plane, you can read it more here and some comments below that, it follows somewhat what's in #1514.
How would you recreate the Machine in this use case? As I read more of the use case you mentioned, it almost sounds like you might need a remediation strategy on MHC? Looking forward to read the proposal. |
/milestone Next Setting the milestone to Next until we have a better idea what the proposal entails, and where it would fit. |
We wouldn’t. Returning the node to a healthy state could/would be achieved by (for example) power cycling the hardware instead. No machine deletion or recreation required
That would be ideal, but mostly we’ve been hearing that what we want is to specific to bare metal |
@vincepri If you replace KCP with “a bare metal specific controller” and “scaling down and then back up” with “rebooting the node”, then you’ve got pretty much what we’re doing downstream at the moment. There are some challenges with that approach, which is why we’ll propose a MachineRemediationRequest CRD. |
@vincepri we would like to discuss this issue at tomorrow's community meeting. Here’s a idea on what we’d like to discuss: "We would like to Cluster API After the matter has been discussed at the meeting, we could make a formal proposal. I think it would be good at first to know what kind of reception such a proposal could receive? We can talk all alternatives at the meeting, but this is how we would like to see health check and remediation in the future. |
I think it's useful to have MHC do the marking, and something else do the sweeping. This is what @vincepri is proposing in #2836 (comment). It sounds like you are interested in having custom remediation logic, so it's not always "delete". That makes sense to me. We might be able to handle all of this with annotations on the Machine (the equivalent of something like mhc.unhealthy=true and mhc.strategy=metal3). MHC would be responsible for removing mhc.unhealthy when it's back to normal. Would something like that be sufficient? |
@ncdc @vincepri yes we are looking for a custom remediation. Good thing with the CRD is that providers could annotate MRR and MHC can watch the status of the remediation. For example, reboot or reprovision of the physical node takes a long time. If we need to reboot the node multiple times, we are talking about minutes maybe more. In this cases annotations like We also need to figure out what to do when provider is unable to remediate the node? With MRR CR we can solve these cases as well. Providers could set There is many corner cases when working with bare metal and remediation of such node is not always straightforward. But I think CAPI MHC should delete the unhealthy machine when things go wrong. I am not totally sure can we cover this with @vincepri proposal? |
I would actually assume that the provider's MRC would do that |
We've been able to make-do with an annotation downstream, but there are various issues that would be alleviated by using a CRD instead. Nir will join the community call to elaborate and discuss |
I posted some thoughts on the remediation problem on the other thread regarding KCP MHC integration, they may actually be more relevant to this thread #2836 (comment) In particular the idea of having some sort of delay. ie MHC adds two(?) annotations, one saying Then, the MHC would normally (Downstream where we have done this annotation approach) immediately re-add the |
Downstream we solved this by having an extra annotation to indicate that remediation is in progress and removing it (and MHC annotation) only after we think that remediation was successful. (see my PR downstream if this helps).
|
@vincepri pointed me at this issue as a point of collaboration. I have kubernetes/enhancements#1411 for a source of coordinating disruptive actions against nodes. In particular, disruptive actions might not be limited to just cluster-api components. For example, in OpenShift, we update on-disk configuration and reboot hosts. We obviously don't want to remediate a system while this is happening. I'm not too tied to the suggested implementation in that KEP, it could be an annotation on the node if we can get that past the API review people. Or it could be a lease object in an existing namespace. But coordinating around the node allows all projects to be on the same page, rather than having to have a specific cluster-api coordination point. |
@michaelgugino we have written enhancement google document that we are going to share to the whole community very soon. We would be really grateful if you would take part in the debate as soon as that document is made public. Those things that you have suggested in your enhancement proposal must definitely be considered before we can reach a final solution. We discussed external machine remediation at last week’s community meeting and came to the conclusion that these issues need to be addressed on a larger scale. It is good that @vincepri point you at this issue. I think we need active cooperation and also longer-term plans for the cluster api so it supports bare metal use cases better. |
@michaelgugino @vincepri @beekhof @JoelSpeed We have published a MachineHealthCheck enhancement proposal in google docs. The document is now open to everyone's comments CAPI MachineHealthCheck enhancement |
Generally we try to remove burden from providers and enforce contracts and behaviour at the core machine API layer. This provides consistent and reproducible behaviours and universal CAPI UX across any environment. This is exactly why we removed the ability for actuators to remediate/recreate instances themselves, because there's not guarantees. Instead we put this knowledge in the MHC and we let it be handled in a provider agnostic manner. Remediation only knows about the core machine API. Any new provider has MHC for free. This is a CAPI feature, not a provider specific one. As fas as I can see the only real need atm is to support only two remediation strategies delete and reboot. And reboot is not necessarily a baremetal/metal3 nor a MHC specific thing. So instead of letting providers to do "anything" and hope for the best we should focus on adding first class provider agnostic support on the core machine CAPI and therefore on MHC for "reboot". To this end the first step should be to ensure we have a "reboot" semantic at the core machine API layer (see related #1808 and #2927). Then each actuator who chooses to implement the rebooting semantic has automatically MHC support for it. So then a remediation reboot strategy does not need to know about provider details, it just consume the appropriate semantic, which can be consumed not only by MHC. We need to ensure we keep this safe boundaries for the different building blocks to interact in a consistent manner. If we give complete freedom to providers and hope for the best I envision we will certainly degrade the CAPI UX for end users. |
@enxebre I had a brief chat yesterday with @n1r1 and Feruz, and I posted a summary from our discussion in the linked document. While I agree that there are more use cases other than the current remediation (delete a Machine), it hasn't been in scope for Cluster API, nor MachineHealthCheck. Moreover, we mentioned multiple times that in-place remediation isn't something we as a community want to focus on for now, or add to the mix of supported combinations. That said, as we did in the past, we want to allow for these behaviors to happen outside of Cluster API core and contracts, learn from it, adapt it, and embrace it longer term. |
@enxebre I really want a reboot API too, but off and on primitives are more useful for in-place remediation. This can be seen in what we ended up with in order to support in-place recovery downstream on metal3: https://github.com/metal3-io/metal3-docs/blob/master/design/reboot-interface.md |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/lifecycle frozen |
User Story
As a Cluster Api infrastructure provider I would like to implement my own machine remediation logic and use my own logic instead of default one in MachineHealthCheck.
Detailed Description
We expect cluster api to enable the use of provider specific remediation logic as part of the MachineHealthCheck. Cluster api MachineHealthCheck should mark machines unhealthy and return them healthy after remediation is done.
/kind feature
The text was updated successfully, but these errors were encountered: