✨ Add MHC remediation to KCP #3185

benmoss · 2020-06-11T21:19:48Z

What this PR does / why we need it:
Adds MHC remediation to KCP.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2976

benmoss · 2020-06-11T21:23:48Z

Lots to probably discuss on this one, maybe we can schedule some time to do an overview or some kind of review session.

vincepri · 2020-06-18T18:31:19Z

/assign @ncdc @detiber

I won't have time to review until early next week

detiber

I'm having a bit of a hard time following the new logic, not sure if it's mainly due to the github ui or the changes themselves.

I do see that we are short circuiting remediation if the requested number of control plane < 3, we really only need to do that when the KCP is not configured to use external etcd. I'm also not entirely sure how these changes are not ensuring that other operations being taken would not affect the quorum of the etcd cluster.

controlplane/kubeadm/internal/cluster.go

controlplane/kubeadm/controllers/upgrade.go

controlplane/kubeadm/controllers/controller.go

controlplane/kubeadm/internal/control_plane.go

benmoss · 2020-06-22T18:46:56Z

/hold I am making some more changes to this

benmoss · 2020-06-22T20:28:15Z

/hold cancel

vincepri

/hold

After taking a quick look, it seems the amount of changes went beyond the original issue. Given that we're close to release for v0.3.7, I'd feel more comfortable if we reduce the scope to the minimum necessary to add the functionality.

How does that sound?

cc @detiber @ncdc

sedefsavas

@benmoss thanks for this PR, removing upgrade is a good call.
Added couple of comments.

controlplane/kubeadm/controllers/controller.go

controlplane/kubeadm/internal/machinefilters/machine_filters.go

controlplane/kubeadm/internal/control_plane.go

controlplane/kubeadm/controllers/controller.go

controlplane/kubeadm/internal/control_plane.go

ncdc · 2020-06-23T15:48:57Z

I just took a look at the changes here. They appear to be a fairly substantial restructuring. Before doing major refactoring, we typically like to discuss proposed design changes, preferably in an issue.

Would it be possible to put this on hold for the time being, and create a smaller PR that adds KCP/MHC remediation with little to no refactoring? Later, after we've had more time to go over your proposed design changes, we can revisit the larger set of changes you have here.

benmoss · 2020-06-23T16:08:24Z

Sounds great, I'll start over

vincepri · 2020-06-23T18:25:18Z

/milestone v0.3.7

detiber · 2020-07-02T15:31:49Z

controlplane/kubeadm/controllers/controller.go

+	if err := r.Client.Delete(ctx, machine); err != nil {
+		return errors.Wrap(err, "failed to delete unhealthy machine")
+	}
+
+	patchHelper, err := patch.NewHelper(machine, r.Client)
+	if err != nil {
+		return errors.Wrap(err, "failed to initialize patch helper")
+	}
+
+	conditions.MarkTrue(machine, clusterv1.MachineOwnerRemediatedCondition)
+	if err := patchHelper.Patch(ctx, machine); err != nil {
+		return errors.Wrap(err, "failed to patch unhealthy machine")
+	}


Do we need to worry about possible issues if MachineOwnerRemediatedCondition is never set to True in the event that there is an error, crash, restart, etc between the deletion happening and the condition patch here?

What is the failure mode if that field is never set to True?

Is there any way we can add some type of reentrancy to ensure that we can go back and ensure it is set after a deletion is successful, but we may not have patched the condition?

I think it's maybe fine, I don't think there's even a reason we need to do this other than it fulfills the contract we created, but as far as I know there is no code planned or intended that would observe a True MachineOwnerRemediated condition and do anything with it.

It feels less than great to then ignore this potential problem, but I am loathe to introduce a bunch of complexity to handle ensuring a useless bit flip operation happens.

@benmoss I'm good with ignoring for now if there are no existing/proposed workflows that would be impacted by it. It would probably be good to add something to the MHC doc around it as a potential future concern, though.

controlplane/kubeadm/controllers/controller.go

controlplane/kubeadm/internal/workload_cluster_etcd.go

detiber · 2020-07-02T15:50:08Z

controlplane/kubeadm/internal/workload_cluster_etcd.go

+	if machine.Status.NodeRef == nil {
+		return defaultTolerance
+	}


I believe this is safe based on the ordering of operations for kubeadm join, but it may be good to add a comment to that effect, since technically a static pod deployment could be running even if there is no Node in the cluster yet.

I decided to do this just because we can't determine the responsiveness of it if it doesn't have a NodeRef. As far as I can imagine it wouldn't make sense that a machine without a NodeRef could be running the etcd static pod and therefore be an etcd member, but I'm not certain of that.

I don't entirely know what comment to add to represent that 😸

controlplane/kubeadm/internal/workload_cluster_etcd.go

fabriziopandini

@benmoss thanks for updating the PR, really appreciated!

controlplane/kubeadm/controllers/controller.go

benmoss · 2020-07-09T13:10:22Z

I wrote up a doc explaining the design goals of this feature here: https://docs.google.com/document/d/1hJza3X-XbVV_yczB5N6vXbl_97D0bOVQ0OwGovcnth0/edit?usp=sharing

controlplane/kubeadm/controllers/controller.go

Do not remediate unless: - we have at least 3 machines - etcd quorum will be preserved - we have sufficient replicas (don't need to scale up)

Co-authored-by: Jason DeTiberus <[email protected]>

vincepri · 2020-08-03T19:03:03Z

/milestone v0.4.0

Move the logic to inside RemediationAllowed, it makes more sense to have all the logic be in the one method

CecileRobertMichon · 2020-08-28T19:04:23Z

Make sure you also update the docs to reflect this change, https://cluster-api.sigs.k8s.io/tasks/healthcheck.html#limitations-and-caveats-of-a-machinehealthcheck say control plane machines are not supported

k8s-ci-robot · 2020-08-28T19:04:30Z

@benmoss: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 11, 2020

k8s-ci-robot requested review from ncdc and vincepri June 11, 2020 21:20

k8s-ci-robot assigned detiber and ncdc Jun 18, 2020

detiber reviewed Jun 18, 2020

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 20, 2020

benmoss force-pushed the kcp-remediation branch from 98c15ed to 7bea1f2 Compare June 22, 2020 17:21

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 22, 2020

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 22, 2020

benmoss mentioned this pull request Jun 22, 2020

KCP should pause reconciliation while machines are failed but not eligible for automatic remediation #3230

Closed

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 22, 2020

vincepri reviewed Jun 22, 2020

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 22, 2020

sedefsavas reviewed Jun 23, 2020

View reviewed changes

k8s-ci-robot added this to the v0.3.7 milestone Jun 23, 2020

sedefsavas mentioned this pull request Jun 23, 2020

🐛 KubeadmControlPlane shouldn't rely on hashing to determine if a Machine is outdated #3234

Merged

benmoss force-pushed the kcp-remediation branch from 828daa3 to c97f63a Compare June 24, 2020 16:57

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 24, 2020

detiber mentioned this pull request Jul 2, 2020

📖 CAEP: machine deletion phase hooks #3132

Merged

detiber reviewed Jul 2, 2020

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 2, 2020

fabriziopandini reviewed Jul 6, 2020

View reviewed changes

controlplane/kubeadm/controllers/controller.go Outdated Show resolved Hide resolved

controlplane/kubeadm/controllers/controller.go Outdated Show resolved Hide resolved

benmoss mentioned this pull request Jul 6, 2020

KCP resilience to machine disk space issues #3289

Closed

benmoss force-pushed the kcp-remediation branch from dd29d09 to 62bd289 Compare July 6, 2020 18:09

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 6, 2020

fabriziopandini reviewed Jul 13, 2020

View reviewed changes

controlplane/kubeadm/controllers/controller.go Show resolved Hide resolved

benmoss force-pushed the kcp-remediation branch from 62bd289 to 5dde6a4 Compare July 15, 2020 14:20

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 15, 2020

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 24, 2020

Ben Moss and others added 6 commits August 3, 2020 12:47

KCP handles MHC remediation

a854877

Do not remediate unless: - we have at least 3 machines - etcd quorum will be preserved - we have sufficient replicas (don't need to scale up)

Special-case remediating a machine with unresponsive etcd

56be5ed

Fix failure tolerance calculation

b2d6df6

Update controlplane/kubeadm/controllers/controller.go

8146ce5

Co-authored-by: Jason DeTiberus <[email protected]>

comments, comments, comments

ae95eb8

Move HasDeletingMachine requeue up in the stack

2590f2a

benmoss force-pushed the kcp-remediation branch from fb7a3f6 to 2590f2a Compare August 3, 2020 16:56

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 3, 2020

k8s-ci-robot modified the milestones: v0.3.x, v0.4.0 Aug 3, 2020

Add more comments about scale-up blocking remediation

0733640

Move the logic to inside RemediationAllowed, it makes more sense to have all the logic be in the one method

fabriziopandini mentioned this pull request Aug 17, 2020

kubeadm controlplane doesn't roll back in case of upgrade error #3483

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 28, 2020

benmoss closed this Aug 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Add MHC remediation to KCP #3185

✨ Add MHC remediation to KCP #3185

benmoss commented Jun 11, 2020

benmoss commented Jun 11, 2020 •

edited

Loading

vincepri commented Jun 18, 2020

detiber left a comment

benmoss commented Jun 22, 2020

benmoss commented Jun 22, 2020

vincepri left a comment •

edited

Loading

sedefsavas left a comment

ncdc commented Jun 23, 2020

benmoss commented Jun 23, 2020

vincepri commented Jun 23, 2020

detiber Jul 2, 2020

benmoss Jul 6, 2020 •

edited

Loading

detiber Jul 6, 2020

detiber Jul 2, 2020

benmoss Jul 6, 2020

fabriziopandini left a comment

benmoss commented Jul 9, 2020

vincepri commented Aug 3, 2020

CecileRobertMichon commented Aug 28, 2020

k8s-ci-robot commented Aug 28, 2020

✨ Add MHC remediation to KCP #3185

✨ Add MHC remediation to KCP #3185

Conversation

benmoss commented Jun 11, 2020

benmoss commented Jun 11, 2020 • edited Loading

vincepri commented Jun 18, 2020

detiber left a comment

Choose a reason for hiding this comment

benmoss commented Jun 22, 2020

benmoss commented Jun 22, 2020

vincepri left a comment • edited Loading

Choose a reason for hiding this comment

sedefsavas left a comment

Choose a reason for hiding this comment

ncdc commented Jun 23, 2020

benmoss commented Jun 23, 2020

vincepri commented Jun 23, 2020

detiber Jul 2, 2020

Choose a reason for hiding this comment

benmoss Jul 6, 2020 • edited Loading

Choose a reason for hiding this comment

detiber Jul 6, 2020

Choose a reason for hiding this comment

detiber Jul 2, 2020

Choose a reason for hiding this comment

benmoss Jul 6, 2020

Choose a reason for hiding this comment

fabriziopandini left a comment

Choose a reason for hiding this comment

benmoss commented Jul 9, 2020

vincepri commented Aug 3, 2020

CecileRobertMichon commented Aug 28, 2020

k8s-ci-robot commented Aug 28, 2020

benmoss commented Jun 11, 2020 •

edited

Loading

vincepri left a comment •

edited

Loading

benmoss Jul 6, 2020 •

edited

Loading