Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External Machine remediation #2846

Closed
jan-est opened this issue Apr 2, 2020 · 25 comments · Fixed by #3190
Closed

External Machine remediation #2846

jan-est opened this issue Apr 2, 2020 · 25 comments · Fixed by #3190
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@jan-est
Copy link
Contributor

jan-est commented Apr 2, 2020

User Story

As a Cluster Api infrastructure provider I would like to implement my own machine remediation logic and use my own logic instead of default one in MachineHealthCheck.

Detailed Description

We expect cluster api to enable the use of provider specific remediation logic as part of the MachineHealthCheck. Cluster api MachineHealthCheck should mark machines unhealthy and return them healthy after remediation is done.

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 2, 2020
@vincepri
Copy link
Member

vincepri commented Apr 2, 2020

Given that MachineHealthCheck isn't enabled by default and it's opt-in for v1alpha3, you could model a new controller after it and do remediation externally that way

@jan-est
Copy link
Contributor Author

jan-est commented Apr 6, 2020

@vincepri So if infrastructure provider needs its own remediation logic, it would be better to implement the whole new provider specific MachineHealthCheck controller and not to use Cluster API implementation at all?

@beekhof
Copy link

beekhof commented Apr 6, 2020

It seems sub-optimal to duplicate the development and maintenance effort when there is 90-95% overlap between the current MHC and a version of it that facilitates provider specific workflows.

@vincepri
Copy link
Member

vincepri commented Apr 6, 2020

@jan-est @beekhof When I replied I was looking through the lenses of how something like this could be implemented in v1alpha3. For v1alpha4 (and onwards), we've discussed to have a hooking mechanism #1514 to determine what to do when a Machine gets deleted, is this a use case for that issue?

@beekhof
Copy link

beekhof commented Apr 7, 2020

Waiting for v1alpha4 or later isn't a deal breaker, but it would be preferable if we could converge on a design "soon" so that the patches we're carrying for v1alpha3 aren't fundamentally incompatible.

We're working on an enhancement and will link to it from this issue once its ready.

wrt #1514, because metal takes so much longer to reprovision than clouds, the workflows we're interested in most at this stage involve avoiding Machine deletion rather than delaying it, so at first glance it doesn't seem like a good fit - although the hook pattern looks like something we might be able to reuse.

@jan-est
Copy link
Contributor Author

jan-est commented Apr 7, 2020

@vincepri as @beekhof mentioned we have been working on an enhancement proposal. I would say the same that we will finish that proposal and continue discussion after that.

@vincepri
Copy link
Member

vincepri commented Apr 7, 2020

Just for some more context, regarding Machine remediation, for (pre-delete hooks), there is a solution we were brainstorming yesterday when thinking about control plane, you can read it more here and some comments below that, it follows somewhat what's in #1514.

the workflows we're interested in most at this stage involve avoiding Machine deletion rather than delaying it

How would you recreate the Machine in this use case?

As I read more of the use case you mentioned, it almost sounds like you might need a remediation strategy on MHC? Looking forward to read the proposal.

@vincepri
Copy link
Member

vincepri commented Apr 7, 2020

/milestone Next

Setting the milestone to Next until we have a better idea what the proposal entails, and where it would fit.

@beekhof
Copy link

beekhof commented Apr 12, 2020

Just for some more context, regarding Machine remediation, for (pre-delete hooks), there is a solution we were brainstorming yesterday when thinking about control plane, you can read it more here and some comments below that, it follows somewhat what's in #1514.

the workflows we're interested in most at this stage involve avoiding Machine deletion rather than delaying it

How would you recreate the Machine in this use case?

We wouldn’t. Returning the node to a healthy state could/would be achieved by (for example) power cycling the hardware instead. No machine deletion or recreation required

As I read more of the use case you mentioned, it almost sounds like you might need a remediation strategy on MHC?

That would be ideal, but mostly we’ve been hearing that what we want is to specific to bare metal

@beekhof
Copy link

beekhof commented Apr 12, 2020

Just for some more context, regarding Machine remediation, for (pre-delete hooks), there is a solution we were brainstorming yesterday when thinking about control plane, you can read it more here and some comments below that, it follows somewhat what's in #1514.

KCP receives a notification
KCP controller looks at all the machines that have been annotated by MHC
KCP remediates these machines by scaling down and then back up

@vincepri If you replace KCP with “a bare metal specific controller” and “scaling down and then back up” with “rebooting the node”, then you’ve got pretty much what we’re doing downstream at the moment.

There are some challenges with that approach, which is why we’ll propose a MachineRemediationRequest CRD.

@jan-est
Copy link
Contributor Author

jan-est commented Apr 14, 2020

@vincepri we would like to discuss this issue at tomorrow's community meeting. Here’s a idea on what we’d like to discuss:

"We would like to Cluster API MachineHealthCheck split the health check and remediation from each other. We would like to CAPI MHC check the health of Nodes in the workload clusters and create new CR called MachineRemediationRequest (MRR) for each machine/node found unhealthy. MRR is created out from a template, it's a provider level object and is reconciled by a provider level remediation controller for example metal3MachineRemediationController. CAPI MHC will delete the MRR when Node in the workload cluster is found healthy again."

After the matter has been discussed at the meeting, we could make a formal proposal. I think it would be good at first to know what kind of reception such a proposal could receive? We can talk all alternatives at the meeting, but this is how we would like to see health check and remediation in the future.

@ncdc
Copy link
Contributor

ncdc commented Apr 14, 2020

I think it's useful to have MHC do the marking, and something else do the sweeping. This is what @vincepri is proposing in #2836 (comment).

It sounds like you are interested in having custom remediation logic, so it's not always "delete". That makes sense to me. We might be able to handle all of this with annotations on the Machine (the equivalent of something like mhc.unhealthy=true and mhc.strategy=metal3). MHC would be responsible for removing mhc.unhealthy when it's back to normal.

Would something like that be sufficient?

@jan-est
Copy link
Contributor Author

jan-est commented Apr 15, 2020

@ncdc @vincepri yes we are looking for a custom remediation. Good thing with the CRD is that providers could annotate MRR and MHC can watch the status of the remediation. For example, reboot or reprovision of the physical node takes a long time. If we need to reboot the node multiple times, we are talking about minutes maybe more. In this cases annotations likeremediation-in-progress can be used to prevent early deletion of the unhealthy machine.

We also need to figure out what to do when provider is unable to remediate the node? With MRR CR we can solve these cases as well. Providers could set remediation-failed annotation on MRR and based on that MHC could delete the machine object.

There is many corner cases when working with bare metal and remediation of such node is not always straightforward. But I think CAPI MHC should delete the unhealthy machine when things go wrong. I am not totally sure can we cover this with @vincepri proposal?

@beekhof
Copy link

beekhof commented Apr 15, 2020

CAPI MHC will delete the MRR when Node in the workload cluster is found healthy again."

I would actually assume that the provider's MRC would do that

@beekhof
Copy link

beekhof commented Apr 15, 2020

I think it's useful to have MHC do the marking, and something else do the sweeping. This is what @vincepri is proposing in #2836 (comment).

It sounds like you are interested in having custom remediation logic, so it's not always "delete". That makes sense to me. We might be able to handle all of this with annotations on the Machine (the equivalent of something like mhc.unhealthy=true and mhc.strategy=metal3). MHC would be responsible for removing mhc.unhealthy when it's back to normal.

Would something like that be sufficient?

We've been able to make-do with an annotation downstream, but there are various issues that would be alleviated by using a CRD instead. Nir will join the community call to elaborate and discuss

@JoelSpeed
Copy link
Contributor

JoelSpeed commented Apr 15, 2020

I posted some thoughts on the remediation problem on the other thread regarding KCP MHC integration, they may actually be more relevant to this thread #2836 (comment)

In particular the idea of having some sort of delay.

ie MHC adds two(?) annotations, one saying remediation-required and one saying lastRemediated: time.Now(). The external controller can do its remediation logic and remove the remediation-required annotation so that is has some marker to say that it's done the remediation

Then, the MHC would normally (Downstream where we have done this annotation approach) immediately re-add the needs-remediation annotation, triggering a remediation loop (🙁). But, if we add this lastRemediated annotation, we could build in some remediationTimeout so that the MHC doesn't re-mark the Machine for remediation within a given period, which would hopefully give some time for the remediation to work and the Machine would either go away or be healthy again by the time this timeout times out

@n1r1
Copy link
Contributor

n1r1 commented Apr 16, 2020

Downstream we solved this by having an extra annotation to indicate that remediation is in progress and removing it (and MHC annotation) only after we think that remediation was successful. (see my PR downstream if this helps).

remediationTimeout sounds like an interesting idea. We thought that MHC could have finalizer on MRR to detect when remediation is complete and start a grace period for marking it unhealthy again.
I guess we can discuss this once we submit the full proposal

@michaelgugino
Copy link
Contributor

or example, reboot or reprovision of the physical node takes a long time. If we need to reboot the node multiple times, we are talking about minutes maybe more. In this cases annotations likeremediation-in-progress can be used to prevent early deletion of the unhealthy machine.

@jan-est

@vincepri pointed me at this issue as a point of collaboration.

I have kubernetes/enhancements#1411 for a source of coordinating disruptive actions against nodes. In particular, disruptive actions might not be limited to just cluster-api components. For example, in OpenShift, we update on-disk configuration and reboot hosts. We obviously don't want to remediate a system while this is happening.

I'm not too tied to the suggested implementation in that KEP, it could be an annotation on the node if we can get that past the API review people. Or it could be a lease object in an existing namespace. But coordinating around the node allows all projects to be on the same page, rather than having to have a specific cluster-api coordination point.

@jan-est
Copy link
Contributor Author

jan-est commented Apr 21, 2020

@michaelgugino we have written enhancement google document that we are going to share to the whole community very soon. We would be really grateful if you would take part in the debate as soon as that document is made public. Those things that you have suggested in your enhancement proposal must definitely be considered before we can reach a final solution.

We discussed external machine remediation at last week’s community meeting and came to the conclusion that these issues need to be addressed on a larger scale. It is good that @vincepri point you at this issue. I think we need active cooperation and also longer-term plans for the cluster api so it supports bare metal use cases better.

@jan-est
Copy link
Contributor Author

jan-est commented Apr 21, 2020

@michaelgugino @vincepri @beekhof @JoelSpeed We have published a MachineHealthCheck enhancement proposal in google docs. The document is now open to everyone's comments CAPI MachineHealthCheck enhancement

@enxebre
Copy link
Member

enxebre commented Apr 23, 2020

Generally we try to remove burden from providers and enforce contracts and behaviour at the core machine API layer. This provides consistent and reproducible behaviours and universal CAPI UX across any environment.

This is exactly why we removed the ability for actuators to remediate/recreate instances themselves, because there's not guarantees. Instead we put this knowledge in the MHC and we let it be handled in a provider agnostic manner. Remediation only knows about the core machine API. Any new provider has MHC for free. This is a CAPI feature, not a provider specific one.

As fas as I can see the only real need atm is to support only two remediation strategies delete and reboot. And reboot is not necessarily a baremetal/metal3 nor a MHC specific thing.

So instead of letting providers to do "anything" and hope for the best we should focus on adding first class provider agnostic support on the core machine CAPI and therefore on MHC for "reboot".

To this end the first step should be to ensure we have a "reboot" semantic at the core machine API layer (see related #1808 and #2927).

Then each actuator who chooses to implement the rebooting semantic has automatically MHC support for it.

So then a remediation reboot strategy does not need to know about provider details, it just consume the appropriate semantic, which can be consumed not only by MHC.

We need to ensure we keep this safe boundaries for the different building blocks to interact in a consistent manner. If we give complete freedom to providers and hope for the best I envision we will certainly degrade the CAPI UX for end users.

@vincepri
Copy link
Member

@enxebre I had a brief chat yesterday with @n1r1 and Feruz, and I posted a summary from our discussion in the linked document.

While I agree that there are more use cases other than the current remediation (delete a Machine), it hasn't been in scope for Cluster API, nor MachineHealthCheck. Moreover, we mentioned multiple times that in-place remediation isn't something we as a community want to focus on for now, or add to the mix of supported combinations.

That said, as we did in the past, we want to allow for these behaviors to happen outside of Cluster API core and contracts, learn from it, adapt it, and embrace it longer term.

@beekhof
Copy link

beekhof commented Apr 24, 2020

@enxebre I really want a reboot API too, but off and on primitives are more useful for in-place remediation. This can be seen in what we ended up with in order to support in-place recovery downstream on metal3: https://github.com/metal3-io/metal3-docs/blob/master/design/reboot-interface.md

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 23, 2020
@vincepri
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants