External Machine remediation #2846

jan-est · 2020-04-02T08:11:28Z

User Story

As a Cluster Api infrastructure provider I would like to implement my own machine remediation logic and use my own logic instead of default one in MachineHealthCheck.

Detailed Description

We expect cluster api to enable the use of provider specific remediation logic as part of the MachineHealthCheck. Cluster api MachineHealthCheck should mark machines unhealthy and return them healthy after remediation is done.

/kind feature

vincepri · 2020-04-02T16:28:56Z

Given that MachineHealthCheck isn't enabled by default and it's opt-in for v1alpha3, you could model a new controller after it and do remediation externally that way

jan-est · 2020-04-06T07:47:34Z

@vincepri So if infrastructure provider needs its own remediation logic, it would be better to implement the whole new provider specific MachineHealthCheck controller and not to use Cluster API implementation at all?

beekhof · 2020-04-06T11:31:28Z

It seems sub-optimal to duplicate the development and maintenance effort when there is 90-95% overlap between the current MHC and a version of it that facilitates provider specific workflows.

vincepri · 2020-04-06T14:09:34Z

@jan-est @beekhof When I replied I was looking through the lenses of how something like this could be implemented in v1alpha3. For v1alpha4 (and onwards), we've discussed to have a hooking mechanism #1514 to determine what to do when a Machine gets deleted, is this a use case for that issue?

beekhof · 2020-04-07T01:46:29Z

Waiting for v1alpha4 or later isn't a deal breaker, but it would be preferable if we could converge on a design "soon" so that the patches we're carrying for v1alpha3 aren't fundamentally incompatible.

We're working on an enhancement and will link to it from this issue once its ready.

wrt #1514, because metal takes so much longer to reprovision than clouds, the workflows we're interested in most at this stage involve avoiding Machine deletion rather than delaying it, so at first glance it doesn't seem like a good fit - although the hook pattern looks like something we might be able to reuse.

jan-est · 2020-04-07T11:43:16Z

@vincepri as @beekhof mentioned we have been working on an enhancement proposal. I would say the same that we will finish that proposal and continue discussion after that.

vincepri · 2020-04-07T14:02:22Z

Just for some more context, regarding Machine remediation, for (pre-delete hooks), there is a solution we were brainstorming yesterday when thinking about control plane, you can read it more here and some comments below that, it follows somewhat what's in #1514.

the workflows we're interested in most at this stage involve avoiding Machine deletion rather than delaying it

How would you recreate the Machine in this use case?

As I read more of the use case you mentioned, it almost sounds like you might need a remediation strategy on MHC? Looking forward to read the proposal.

vincepri · 2020-04-07T14:02:57Z

/milestone Next

Setting the milestone to Next until we have a better idea what the proposal entails, and where it would fit.

beekhof · 2020-04-12T06:01:33Z

Just for some more context, regarding Machine remediation, for (pre-delete hooks), there is a solution we were brainstorming yesterday when thinking about control plane, you can read it more here and some comments below that, it follows somewhat what's in #1514.

the workflows we're interested in most at this stage involve avoiding Machine deletion rather than delaying it

How would you recreate the Machine in this use case?

We wouldn’t. Returning the node to a healthy state could/would be achieved by (for example) power cycling the hardware instead. No machine deletion or recreation required

As I read more of the use case you mentioned, it almost sounds like you might need a remediation strategy on MHC?

That would be ideal, but mostly we’ve been hearing that what we want is to specific to bare metal

beekhof · 2020-04-12T06:13:04Z

Just for some more context, regarding Machine remediation, for (pre-delete hooks), there is a solution we were brainstorming yesterday when thinking about control plane, you can read it more here and some comments below that, it follows somewhat what's in #1514.

KCP receives a notification
KCP controller looks at all the machines that have been annotated by MHC
KCP remediates these machines by scaling down and then back up

@vincepri If you replace KCP with “a bare metal specific controller” and “scaling down and then back up” with “rebooting the node”, then you’ve got pretty much what we’re doing downstream at the moment.

There are some challenges with that approach, which is why we’ll propose a MachineRemediationRequest CRD.

jan-est · 2020-04-14T10:26:23Z

@vincepri we would like to discuss this issue at tomorrow's community meeting. Here’s a idea on what we’d like to discuss:

"We would like to Cluster API MachineHealthCheck split the health check and remediation from each other. We would like to CAPI MHC check the health of Nodes in the workload clusters and create new CR called MachineRemediationRequest (MRR) for each machine/node found unhealthy. MRR is created out from a template, it's a provider level object and is reconciled by a provider level remediation controller for example metal3MachineRemediationController. CAPI MHC will delete the MRR when Node in the workload cluster is found healthy again."

After the matter has been discussed at the meeting, we could make a formal proposal. I think it would be good at first to know what kind of reception such a proposal could receive? We can talk all alternatives at the meeting, but this is how we would like to see health check and remediation in the future.

ncdc · 2020-04-14T15:01:44Z

I think it's useful to have MHC do the marking, and something else do the sweeping. This is what @vincepri is proposing in #2836 (comment).

It sounds like you are interested in having custom remediation logic, so it's not always "delete". That makes sense to me. We might be able to handle all of this with annotations on the Machine (the equivalent of something like mhc.unhealthy=true and mhc.strategy=metal3). MHC would be responsible for removing mhc.unhealthy when it's back to normal.

Would something like that be sufficient?

jan-est · 2020-04-15T07:13:57Z

@ncdc @vincepri yes we are looking for a custom remediation. Good thing with the CRD is that providers could annotate MRR and MHC can watch the status of the remediation. For example, reboot or reprovision of the physical node takes a long time. If we need to reboot the node multiple times, we are talking about minutes maybe more. In this cases annotations likeremediation-in-progress can be used to prevent early deletion of the unhealthy machine.

We also need to figure out what to do when provider is unable to remediate the node? With MRR CR we can solve these cases as well. Providers could set remediation-failed annotation on MRR and based on that MHC could delete the machine object.

There is many corner cases when working with bare metal and remediation of such node is not always straightforward. But I think CAPI MHC should delete the unhealthy machine when things go wrong. I am not totally sure can we cover this with @vincepri proposal?

beekhof · 2020-04-15T09:07:20Z

CAPI MHC will delete the MRR when Node in the workload cluster is found healthy again."

I would actually assume that the provider's MRC would do that

beekhof · 2020-04-15T09:14:56Z

I think it's useful to have MHC do the marking, and something else do the sweeping. This is what @vincepri is proposing in #2836 (comment).

It sounds like you are interested in having custom remediation logic, so it's not always "delete". That makes sense to me. We might be able to handle all of this with annotations on the Machine (the equivalent of something like mhc.unhealthy=true and mhc.strategy=metal3). MHC would be responsible for removing mhc.unhealthy when it's back to normal.

Would something like that be sufficient?

We've been able to make-do with an annotation downstream, but there are various issues that would be alleviated by using a CRD instead. Nir will join the community call to elaborate and discuss

JoelSpeed · 2020-04-15T16:53:10Z

I posted some thoughts on the remediation problem on the other thread regarding KCP MHC integration, they may actually be more relevant to this thread #2836 (comment)

In particular the idea of having some sort of delay.

ie MHC adds two(?) annotations, one saying remediation-required and one saying lastRemediated: time.Now(). The external controller can do its remediation logic and remove the remediation-required annotation so that is has some marker to say that it's done the remediation

Then, the MHC would normally (Downstream where we have done this annotation approach) immediately re-add the needs-remediation annotation, triggering a remediation loop (🙁). But, if we add this lastRemediated annotation, we could build in some remediationTimeout so that the MHC doesn't re-mark the Machine for remediation within a given period, which would hopefully give some time for the remediation to work and the Machine would either go away or be healthy again by the time this timeout times out

n1r1 · 2020-04-16T07:05:20Z

Downstream we solved this by having an extra annotation to indicate that remediation is in progress and removing it (and MHC annotation) only after we think that remediation was successful. (see my PR downstream if this helps).

remediationTimeout sounds like an interesting idea. We thought that MHC could have finalizer on MRR to detect when remediation is complete and start a grace period for marking it unhealthy again.
I guess we can discuss this once we submit the full proposal

michaelgugino · 2020-04-21T00:13:57Z

or example, reboot or reprovision of the physical node takes a long time. If we need to reboot the node multiple times, we are talking about minutes maybe more. In this cases annotations likeremediation-in-progress can be used to prevent early deletion of the unhealthy machine.

@jan-est

@vincepri pointed me at this issue as a point of collaboration.

I have kubernetes/enhancements#1411 for a source of coordinating disruptive actions against nodes. In particular, disruptive actions might not be limited to just cluster-api components. For example, in OpenShift, we update on-disk configuration and reboot hosts. We obviously don't want to remediate a system while this is happening.

I'm not too tied to the suggested implementation in that KEP, it could be an annotation on the node if we can get that past the API review people. Or it could be a lease object in an existing namespace. But coordinating around the node allows all projects to be on the same page, rather than having to have a specific cluster-api coordination point.

jan-est · 2020-04-21T06:50:07Z

@michaelgugino we have written enhancement google document that we are going to share to the whole community very soon. We would be really grateful if you would take part in the debate as soon as that document is made public. Those things that you have suggested in your enhancement proposal must definitely be considered before we can reach a final solution.

We discussed external machine remediation at last week’s community meeting and came to the conclusion that these issues need to be addressed on a larger scale. It is good that @vincepri point you at this issue. I think we need active cooperation and also longer-term plans for the cluster api so it supports bare metal use cases better.

jan-est · 2020-04-21T12:06:22Z

@michaelgugino @vincepri @beekhof @JoelSpeed We have published a MachineHealthCheck enhancement proposal in google docs. The document is now open to everyone's comments CAPI MachineHealthCheck enhancement

enxebre · 2020-04-23T08:00:42Z

Generally we try to remove burden from providers and enforce contracts and behaviour at the core machine API layer. This provides consistent and reproducible behaviours and universal CAPI UX across any environment.

This is exactly why we removed the ability for actuators to remediate/recreate instances themselves, because there's not guarantees. Instead we put this knowledge in the MHC and we let it be handled in a provider agnostic manner. Remediation only knows about the core machine API. Any new provider has MHC for free. This is a CAPI feature, not a provider specific one.

As fas as I can see the only real need atm is to support only two remediation strategies delete and reboot. And reboot is not necessarily a baremetal/metal3 nor a MHC specific thing.

So instead of letting providers to do "anything" and hope for the best we should focus on adding first class provider agnostic support on the core machine CAPI and therefore on MHC for "reboot".

To this end the first step should be to ensure we have a "reboot" semantic at the core machine API layer (see related #1808 and #2927).

Then each actuator who chooses to implement the rebooting semantic has automatically MHC support for it.

So then a remediation reboot strategy does not need to know about provider details, it just consume the appropriate semantic, which can be consumed not only by MHC.

We need to ensure we keep this safe boundaries for the different building blocks to interact in a consistent manner. If we give complete freedom to providers and hope for the best I envision we will certainly degrade the CAPI UX for end users.

vincepri · 2020-04-23T14:04:05Z

@enxebre I had a brief chat yesterday with @n1r1 and Feruz, and I posted a summary from our discussion in the linked document.

While I agree that there are more use cases other than the current remediation (delete a Machine), it hasn't been in scope for Cluster API, nor MachineHealthCheck. Moreover, we mentioned multiple times that in-place remediation isn't something we as a community want to focus on for now, or add to the mix of supported combinations.

That said, as we did in the past, we want to allow for these behaviors to happen outside of Cluster API core and contracts, learn from it, adapt it, and embrace it longer term.

beekhof · 2020-04-24T01:23:47Z

@enxebre I really want a reboot API too, but off and on primitives are more useful for in-place remediation. This can be seen in what we ended up with in order to support in-place recovery downstream on metal3: https://github.com/metal3-io/metal3-docs/blob/master/design/reboot-interface.md

fejta-bot · 2020-07-23T01:53:03Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

vincepri · 2020-07-23T02:00:21Z

/lifecycle frozen

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 2, 2020

k8s-ci-robot added this to the Next milestone Apr 7, 2020

vincepri mentioned this issue Apr 8, 2020

Self healing control plane: Enable MHC to watch control plane machines #2836

Closed

detiber mentioned this issue Apr 16, 2020

✨ Update MHC proposal with new annotation strategy #2920

Merged

vincepri mentioned this issue Apr 20, 2020

machine-controller: block deletion with finalizer/annotation #1514

Closed

n1r1 mentioned this issue Jun 15, 2020

📖 External Remediation Proposal #3190

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 23, 2020

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 23, 2020

k8s-ci-robot closed this as completed in #3190 Aug 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External Machine remediation #2846

External Machine remediation #2846

jan-est commented Apr 2, 2020

vincepri commented Apr 2, 2020

jan-est commented Apr 6, 2020

beekhof commented Apr 6, 2020

vincepri commented Apr 6, 2020

beekhof commented Apr 7, 2020

jan-est commented Apr 7, 2020

vincepri commented Apr 7, 2020

vincepri commented Apr 7, 2020

beekhof commented Apr 12, 2020

beekhof commented Apr 12, 2020

jan-est commented Apr 14, 2020

ncdc commented Apr 14, 2020

jan-est commented Apr 15, 2020 •

edited

Loading

beekhof commented Apr 15, 2020

beekhof commented Apr 15, 2020

JoelSpeed commented Apr 15, 2020 •

edited

Loading

n1r1 commented Apr 16, 2020

michaelgugino commented Apr 21, 2020

jan-est commented Apr 21, 2020

jan-est commented Apr 21, 2020

enxebre commented Apr 23, 2020 •

edited

Loading

vincepri commented Apr 23, 2020

beekhof commented Apr 24, 2020

fejta-bot commented Jul 23, 2020

vincepri commented Jul 23, 2020

External Machine remediation #2846

External Machine remediation #2846

Comments

jan-est commented Apr 2, 2020

vincepri commented Apr 2, 2020

jan-est commented Apr 6, 2020

beekhof commented Apr 6, 2020

vincepri commented Apr 6, 2020

beekhof commented Apr 7, 2020

jan-est commented Apr 7, 2020

vincepri commented Apr 7, 2020

vincepri commented Apr 7, 2020

beekhof commented Apr 12, 2020

beekhof commented Apr 12, 2020

jan-est commented Apr 14, 2020

ncdc commented Apr 14, 2020

jan-est commented Apr 15, 2020 • edited Loading

beekhof commented Apr 15, 2020

beekhof commented Apr 15, 2020

JoelSpeed commented Apr 15, 2020 • edited Loading

n1r1 commented Apr 16, 2020

michaelgugino commented Apr 21, 2020

jan-est commented Apr 21, 2020

jan-est commented Apr 21, 2020

enxebre commented Apr 23, 2020 • edited Loading

vincepri commented Apr 23, 2020

beekhof commented Apr 24, 2020

fejta-bot commented Jul 23, 2020

vincepri commented Jul 23, 2020

jan-est commented Apr 15, 2020 •

edited

Loading

JoelSpeed commented Apr 15, 2020 •

edited

Loading

enxebre commented Apr 23, 2020 •

edited

Loading