Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add MHC remediation to KCP #3185

Closed
wants to merge 7 commits into from

Conversation

benmoss
Copy link

@benmoss benmoss commented Jun 11, 2020

What this PR does / why we need it:
Adds MHC remediation to KCP.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2976

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 11, 2020
@k8s-ci-robot k8s-ci-robot requested review from ncdc and vincepri June 11, 2020 21:20
@benmoss
Copy link
Author

benmoss commented Jun 11, 2020

Lots to probably discuss on this one, maybe we can schedule some time to do an overview or some kind of review session.

@vincepri
Copy link
Member

/assign @ncdc @detiber

I won't have time to review until early next week

Copy link
Member

@detiber detiber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a bit of a hard time following the new logic, not sure if it's mainly due to the github ui or the changes themselves.

I do see that we are short circuiting remediation if the requested number of control plane < 3, we really only need to do that when the KCP is not configured to use external etcd. I'm also not entirely sure how these changes are not ensuring that other operations being taken would not affect the quorum of the etcd cluster.

controlplane/kubeadm/internal/cluster.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/upgrade.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/controller.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/control_plane.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/control_plane.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 20, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 22, 2020
@benmoss
Copy link
Author

benmoss commented Jun 22, 2020

/hold I am making some more changes to this

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 22, 2020
@benmoss
Copy link
Author

benmoss commented Jun 22, 2020

/hold cancel

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 22, 2020
Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

After taking a quick look, it seems the amount of changes went beyond the original issue. Given that we're close to release for v0.3.7, I'd feel more comfortable if we reduce the scope to the minimum necessary to add the functionality.

How does that sound?

cc @detiber @ncdc

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 22, 2020
Copy link

@sedefsavas sedefsavas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benmoss thanks for this PR, removing upgrade is a good call.
Added couple of comments.

controlplane/kubeadm/controllers/controller.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/control_plane.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/controller.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/control_plane.go Outdated Show resolved Hide resolved
@ncdc
Copy link
Contributor

ncdc commented Jun 23, 2020

I just took a look at the changes here. They appear to be a fairly substantial restructuring. Before doing major refactoring, we typically like to discuss proposed design changes, preferably in an issue.

Would it be possible to put this on hold for the time being, and create a smaller PR that adds KCP/MHC remediation with little to no refactoring? Later, after we've had more time to go over your proposed design changes, we can revisit the larger set of changes you have here.

@benmoss
Copy link
Author

benmoss commented Jun 23, 2020

Sounds great, I'll start over

@vincepri
Copy link
Member

/milestone v0.3.7

@k8s-ci-robot k8s-ci-robot added this to the v0.3.7 milestone Jun 23, 2020
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 24, 2020
Comment on lines 628 to 630
if err := r.Client.Delete(ctx, machine); err != nil {
return errors.Wrap(err, "failed to delete unhealthy machine")
}

patchHelper, err := patch.NewHelper(machine, r.Client)
if err != nil {
return errors.Wrap(err, "failed to initialize patch helper")
}

conditions.MarkTrue(machine, clusterv1.MachineOwnerRemediatedCondition)
if err := patchHelper.Patch(ctx, machine); err != nil {
return errors.Wrap(err, "failed to patch unhealthy machine")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to worry about possible issues if MachineOwnerRemediatedCondition is never set to True in the event that there is an error, crash, restart, etc between the deletion happening and the condition patch here?

What is the failure mode if that field is never set to True?

Is there any way we can add some type of reentrancy to ensure that we can go back and ensure it is set after a deletion is successful, but we may not have patched the condition?

Copy link
Author

@benmoss benmoss Jul 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's maybe fine, I don't think there's even a reason we need to do this other than it fulfills the contract we created, but as far as I know there is no code planned or intended that would observe a True MachineOwnerRemediated condition and do anything with it.

It feels less than great to then ignore this potential problem, but I am loathe to introduce a bunch of complexity to handle ensuring a useless bit flip operation happens.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benmoss I'm good with ignoring for now if there are no existing/proposed workflows that would be impacted by it. It would probably be good to add something to the MHC doc around it as a potential future concern, though.

controlplane/kubeadm/controllers/controller.go Outdated Show resolved Hide resolved
Comment on lines +351 to +364
if machine.Status.NodeRef == nil {
return defaultTolerance
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is safe based on the ordering of operations for kubeadm join, but it may be good to add a comment to that effect, since technically a static pod deployment could be running even if there is no Node in the cluster yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to do this just because we can't determine the responsiveness of it if it doesn't have a NodeRef. As far as I can imagine it wouldn't make sense that a machine without a NodeRef could be running the etcd static pod and therefore be an etcd member, but I'm not certain of that.

I don't entirely know what comment to add to represent that 😸

controlplane/kubeadm/internal/workload_cluster_etcd.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 2, 2020
Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benmoss thanks for updating the PR, really appreciated!

controlplane/kubeadm/controllers/controller.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/controller.go Outdated Show resolved Hide resolved
@benmoss benmoss force-pushed the kcp-remediation branch from dd29d09 to 62bd289 Compare July 6, 2020 18:09
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 6, 2020
@benmoss
Copy link
Author

benmoss commented Jul 9, 2020

I wrote up a doc explaining the design goals of this feature here: https://docs.google.com/document/d/1hJza3X-XbVV_yczB5N6vXbl_97D0bOVQ0OwGovcnth0/edit?usp=sharing

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 15, 2020
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 24, 2020
Ben Moss and others added 6 commits August 3, 2020 12:47
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 3, 2020
@vincepri
Copy link
Member

vincepri commented Aug 3, 2020

/milestone v0.4.0

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.3.x, v0.4.0 Aug 3, 2020
Move the logic to inside RemediationAllowed, it makes more sense to have
all the logic be in the one method
@CecileRobertMichon
Copy link
Contributor

Make sure you also update the docs to reflect this change, https://cluster-api.sigs.k8s.io/tasks/healthcheck.html#limitations-and-caveats-of-a-machinehealthcheck say control plane machines are not supported

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 28, 2020
@k8s-ci-robot
Copy link
Contributor

@benmoss: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@benmoss benmoss closed this Aug 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KCP unhealthy remediation support
8 participants