-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Add nodeDeletionTimeout property to Machine #5608
✨ Add nodeDeletionTimeout property to Machine #5608
Conversation
Hi @schrej. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
bdfbba3
to
a827644
Compare
/ok-to-test |
If this needs to be granular per pools of nodes i.e machineDeployment and per providers it should be a field, but only if we think the use case is justified. Long term we might want to consider grouping node lifecycle related properties e.g drain-time-out/node-delete-time-out under the same struct field. Also depending on the problem and scenarios we might want to use a different angle, e.g lifecycle hooks #5556. @schrej could you please elaborate on the particular problem and scenarios being faced in metal3 so it's easier to understand and agree on the rationale for the best solution? |
When, for whatever reason, node deletion fails with metal3, the node will never get deleted and remain in the cluster indefinitely. There is also no other component that takes care of deletion, like in other providers where a CPI would do that. While retrying indefinitely doesn't guarantee that the node gets deleted, it's at least continuously attempted, and it is also visible in the control plane cluster that something is wrong. |
Is there any particular predictable reason you've identified for this to happen? Would retrying indefinitely solve any of them? have we considered e.may be have a condition instead to address the visibility concern? |
There are a number of race conditions that lead to this, like MHC kicking in on a slow provisioning, cleaning up a node that is just joining, deletion of nodes on the backend taking a bit long, so kubelet is still running when CAPI tries to delete the node and others. |
Thanks @MaxRink
How exactly does this make client.delete(node) fail? How can the kubelet be running when CAPI tries to delete the node? We intentionally only try to delete the Node after we ensure the underlying infra is gone [1], otherwise stateful assumptions might be broken [2]. [1] cluster-api/controllers/machine_controller.go Lines 367 to 376 in c52fdb3
[2] #2565 |
Like i said, its not only an issue with Metal3 and not limited to those exact conditions above. That race condition can also arise due to API Server being unreachable for a moment as the stacked loadbalancer lease changes after an controlplane upgrade, general network whoes, implementation bugs in providers .... |
All those seems to be temporary exceptional cases that I question they should be addressed by a runtime flag or a spec.field that is to express user intent.
I think I see NodeDrainTimeout from a different angle as it intersects with particular user workloads and user defined particular node pools, therefore it providers value for users to customise it and it make sense for it to be a spec.field. IMO NodeDeletionTimeout is something intrinsic to the cluster lifecycle which has no intersection with user intent or workloads, so I'd expect CAPI to handle it without putting any burden on the user. Based on the above I wonder if we should consider to approach the problem transparently for users and retry indefinitely by erroring so controller runtime would handle the rate limiting and backoff. Then #1446 would require to either fix the issue or delete the cluster as suggested above which seems reasonable and
Thoughts? |
I initially thought the current behaviour is a bug, so I fully agree 😄. I never really understood the use case in that issue either. Maybe at that time the behaviour of the machine controller was different, and it didn't consider cluster deletion yet? A configurable timeout might make sense in addition though. Not to solve this problem, but as a general feature. Imo that could also be a configurable flag, and doesn't need to be configurable on a per cluster level, which just makes it more complex. |
@schrej Are you ok proceeding with a field here instead of a flag? |
|
a827644
to
5d8e3b4
Compare
|
5d8e3b4
to
1ea72e5
Compare
424ef66
to
f10f591
Compare
f10f591
to
38fc2c3
Compare
Sorry for the delay. I'll take a look next week. Currently a lot to do for the v1.1 release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/assign @sbueringer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last nit, otherwise lgtm from my side.
@@ -152,6 +152,7 @@ func (in *KubeadmControlPlane) ValidateUpdate(old runtime.Object) error { | |||
{spec, "machineTemplate", "infrastructureRef", "apiVersion"}, | |||
{spec, "machineTemplate", "infrastructureRef", "name"}, | |||
{spec, "machineTemplate", "nodeDrainTimeout"}, | |||
{spec, "machineTemplate", "nodeDeletionTimeout"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please also extend the unit test?
You only have to set a value in ~l.242 and then change it in ~367.
We just had a few cases in the past where we thought something is mutable, but it wasn't.
@schrej EasyCLA seems to have an issue, but I think it should be still in non-blocking mode until 5th February. |
e504706
to
f3b520e
Compare
Thx! lgtm from my side. I think we should probably squash the commits into one. |
lgtm pending squash |
/easycla |
The machine controller now uses the configured timeout when deleting the Node belonging to a Machine. It will also requeue the Machine for reconciliation if deletion fails, which will lead to infinite retries.
f3b520e
to
0d8d072
Compare
Squashed. Sorry for the delay, but it's skiing season 😬 |
No worries. Thx! /lgtm |
@sbueringer: GitHub didn't allow me to request PR reviews from the following users: for, approval. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Given that we were only waiting for the squash and already had lgtm's before |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sbueringer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Adds a
nodeDeletionTimeout
property to Machines to specify the timeout when deleting the Node belonging to a Machine.This contains the following change to Machine behavior:
Currently, errors occurring during Node deletion are ignored, which was introduced due to #1446.
With this PR, Node deletion errors are returned by the reconcile function, which causes the Machine to get re-queued for reconciliation. This leads to an infinite retry of Node deletion.
I've also added a test for machine deletion.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):closes #5516