-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Add MHC remediation to KCP #3185
Conversation
Lots to probably discuss on this one, maybe we can schedule some time to do an overview or some kind of review session. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a bit of a hard time following the new logic, not sure if it's mainly due to the github ui or the changes themselves.
I do see that we are short circuiting remediation if the requested number of control plane < 3, we really only need to do that when the KCP is not configured to use external etcd. I'm also not entirely sure how these changes are not ensuring that other operations being taken would not affect the quorum of the etcd cluster.
/hold I am making some more changes to this |
/hold cancel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benmoss thanks for this PR, removing upgrade is a good call.
Added couple of comments.
I just took a look at the changes here. They appear to be a fairly substantial restructuring. Before doing major refactoring, we typically like to discuss proposed design changes, preferably in an issue. Would it be possible to put this on hold for the time being, and create a smaller PR that adds KCP/MHC remediation with little to no refactoring? Later, after we've had more time to go over your proposed design changes, we can revisit the larger set of changes you have here. |
Sounds great, I'll start over |
/milestone v0.3.7 |
if err := r.Client.Delete(ctx, machine); err != nil { | ||
return errors.Wrap(err, "failed to delete unhealthy machine") | ||
} | ||
|
||
patchHelper, err := patch.NewHelper(machine, r.Client) | ||
if err != nil { | ||
return errors.Wrap(err, "failed to initialize patch helper") | ||
} | ||
|
||
conditions.MarkTrue(machine, clusterv1.MachineOwnerRemediatedCondition) | ||
if err := patchHelper.Patch(ctx, machine); err != nil { | ||
return errors.Wrap(err, "failed to patch unhealthy machine") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to worry about possible issues if MachineOwnerRemediatedCondition is never set to True in the event that there is an error, crash, restart, etc between the deletion happening and the condition patch here?
What is the failure mode if that field is never set to True?
Is there any way we can add some type of reentrancy to ensure that we can go back and ensure it is set after a deletion is successful, but we may not have patched the condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's maybe fine, I don't think there's even a reason we need to do this other than it fulfills the contract we created, but as far as I know there is no code planned or intended that would observe a True MachineOwnerRemediated
condition and do anything with it.
It feels less than great to then ignore this potential problem, but I am loathe to introduce a bunch of complexity to handle ensuring a useless bit flip operation happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benmoss I'm good with ignoring for now if there are no existing/proposed workflows that would be impacted by it. It would probably be good to add something to the MHC doc around it as a potential future concern, though.
if machine.Status.NodeRef == nil { | ||
return defaultTolerance | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is safe based on the ordering of operations for kubeadm join, but it may be good to add a comment to that effect, since technically a static pod deployment could be running even if there is no Node in the cluster yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to do this just because we can't determine the responsiveness of it if it doesn't have a NodeRef. As far as I can imagine it wouldn't make sense that a machine without a NodeRef could be running the etcd static pod and therefore be an etcd member, but I'm not certain of that.
I don't entirely know what comment to add to represent that 😸
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benmoss thanks for updating the PR, really appreciated!
I wrote up a doc explaining the design goals of this feature here: https://docs.google.com/document/d/1hJza3X-XbVV_yczB5N6vXbl_97D0bOVQ0OwGovcnth0/edit?usp=sharing |
Do not remediate unless: - we have at least 3 machines - etcd quorum will be preserved - we have sufficient replicas (don't need to scale up)
Co-authored-by: Jason DeTiberus <[email protected]>
fb7a3f6
to
2590f2a
Compare
/milestone v0.4.0 |
Move the logic to inside RemediationAllowed, it makes more sense to have all the logic be in the one method
Make sure you also update the docs to reflect this change, https://cluster-api.sigs.k8s.io/tasks/healthcheck.html#limitations-and-caveats-of-a-machinehealthcheck say control plane machines are not supported |
@benmoss: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
Adds MHC remediation to KCP.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #2976