Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add nodeDeletionTimeout property to Machine #5608

Merged

Conversation

schrej
Copy link
Member

@schrej schrej commented Nov 8, 2021

What this PR does / why we need it:
Adds a nodeDeletionTimeout property to Machines to specify the timeout when deleting the Node belonging to a Machine.

This contains the following change to Machine behavior:
Currently, errors occurring during Node deletion are ignored, which was introduced due to #1446.
With this PR, Node deletion errors are returned by the reconcile function, which causes the Machine to get re-queued for reconciliation. This leads to an infinite retry of Node deletion.

I've also added a test for machine deletion.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
closes #5516

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 8, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @schrej. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 8, 2021
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 8, 2021
@schrej schrej force-pushed the feature/node-drain-timeout-flag branch 2 times, most recently from bdfbba3 to a827644 Compare November 8, 2021 15:53
@MaxRink
Copy link
Contributor

MaxRink commented Nov 8, 2021

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 8, 2021
@enxebre
Copy link
Member

enxebre commented Nov 10, 2021

The reason behind that issue is a problem we've faced with metal3, and the only proper solution for that is waiting indefinitely.

If this needs to be granular per pools of nodes i.e machineDeployment and per providers it should be a field, but only if we think the use case is justified. Long term we might want to consider grouping node lifecycle related properties e.g drain-time-out/node-delete-time-out under the same struct field. Also depending on the problem and scenarios we might want to use a different angle, e.g lifecycle hooks #5556.

@schrej could you please elaborate on the particular problem and scenarios being faced in metal3 so it's easier to understand and agree on the rationale for the best solution?

@schrej
Copy link
Member Author

schrej commented Nov 10, 2021

When, for whatever reason, node deletion fails with metal3, the node will never get deleted and remain in the cluster indefinitely. There is also no other component that takes care of deletion, like in other providers where a CPI would do that.
If a new machine gets created that has the same IP address as the one where the node wasn't deleted, you end up with two nodes in the cluster that both have the same IP address.
In our case this prevented Calico from deploying to that new node properly, as it doesn't like it if two nodes have the same IP.

While retrying indefinitely doesn't guarantee that the node gets deleted, it's at least continuously attempted, and it is also visible in the control plane cluster that something is wrong.

@enxebre
Copy link
Member

enxebre commented Nov 10, 2021

for whatever reason

Is there any particular predictable reason you've identified for this to happen? Would retrying indefinitely solve any of them? have we considered e.may be have a condition instead to address the visibility concern?

@MaxRink
Copy link
Contributor

MaxRink commented Nov 10, 2021

for whatever reason

Is there any particular predictable reason you've identified for this to happen? Would retrying indefinitely solve any of them? have we considered e.may be have a condition instead to address the visibility concern?

There are a number of race conditions that lead to this, like MHC kicking in on a slow provisioning, cleaning up a node that is just joining, deletion of nodes on the backend taking a bit long, so kubelet is still running when CAPI tries to delete the node and others.
All of them can be resolved by retrying the deletion lateron tho, when the kubelet is not running anymore.

@enxebre
Copy link
Member

enxebre commented Nov 10, 2021

Thanks @MaxRink

There are a number of race conditions that lead to this, like MHC kicking in on a slow provisioning, cleaning up a node that is just joining,
cleaning up a node that is just joining
deletion of nodes on the backend taking a bit long, so kubelet is still running when CAPI tries to delete the node and others.

How exactly does this make client.delete(node) fail?

How can the kubelet be running when CAPI tries to delete the node? We intentionally only try to delete the Node after we ensure the underlying infra is gone [1], otherwise stateful assumptions might be broken [2].
Is this any different in metal3 context?

[1]

if ok, err := r.reconcileDeleteInfrastructure(ctx, m); !ok || err != nil {
return ctrl.Result{}, err
}
if ok, err := r.reconcileDeleteBootstrap(ctx, m); !ok || err != nil {
return ctrl.Result{}, err
}
// We only delete the node after the underlying infrastructure is gone.
// https://github.com/kubernetes-sigs/cluster-api/issues/2565

[2] #2565

@MaxRink
Copy link
Contributor

MaxRink commented Nov 10, 2021

Like i said, its not only an issue with Metal3 and not limited to those exact conditions above. That race condition can also arise due to API Server being unreachable for a moment as the stacked loadbalancer lease changes after an controlplane upgrade, general network whoes, implementation bugs in providers ....

@enxebre
Copy link
Member

enxebre commented Nov 12, 2021

that race condition can also arise due to API Server being unreachable for a moment as the stacked loadbalancer lease changes after an controlplane upgrade, general network whoes, implementation bugs in providers ....

All those seems to be temporary exceptional cases that I question they should be addressed by a runtime flag or a spec.field that is to express user intent.

We already have NodeDrainTimeout so it would be nicely consistent to have another field like NodeDeletionTimeout.

I think I see NodeDrainTimeout from a different angle as it intersects with particular user workloads and user defined particular node pools, therefore it providers value for users to customise it and it make sense for it to be a spec.field. IMO NodeDeletionTimeout is something intrinsic to the cluster lifecycle which has no intersection with user intent or workloads, so I'd expect CAPI to handle it without putting any burden on the user.

Based on the above I wonder if we should consider to approach the problem transparently for users and retry indefinitely by erroring so controller runtime would handle the rate limiting and backoff. Then #1446 would require to either fix the issue or delete the cluster as suggested above which seems reasonable and isDeleteNodeAllowed already accounts for it. E.g

if isDeleteNodeAllowed {
	log.Info("Deleting node", "node", m.Status.NodeRef.Name)

	var deleteNodeErr error
	waitErr := wait.PollImmediate(2*time.Second, 10*time.Second, func() (bool, error) {
		if deleteNodeErr = r.deleteNode(ctx, cluster, m.Status.NodeRef.Name); deleteNodeErr != nil && !apierrors.IsNotFound(errors.Cause(deleteNodeErr)) {
			return false, nil
		}
		return true, nil
	})
	if waitErr != nil {
		conditions.MarkFalse(m, clusterv1.MachineNodeHealthyCondition, clusterv1.DeletionFailedReason, clusterv1.ConditionSeverityWarning, "")
		r.recorder.Eventf(m, corev1.EventTypeWarning, "FailedDeleteNode", "error deleting Machine's node: %v", deleteNodeErr)
		return ctrl.Result{}, deleteNodeErr
	}
}

Thoughts?

@schrej
Copy link
Member Author

schrej commented Nov 15, 2021

Based on the above I wonder if we should consider to approach the problem transparently for users and retry indefinitely by erroring so controller runtime would handle the rate limiting and backoff. Then #1446 would require to either fix the issue or delete the cluster as suggested above which seems reasonable and isDeleteNodeAllowed already accounts for it.

I initially thought the current behaviour is a bug, so I fully agree 😄. I never really understood the use case in that issue either. Maybe at that time the behaviour of the machine controller was different, and it didn't consider cluster deletion yet?

A configurable timeout might make sense in addition though. Not to solve this problem, but as a general feature. Imo that could also be a configurable flag, and doesn't need to be configurable on a per cluster level, which just makes it more complex.

@vincepri
Copy link
Member

vincepri commented Dec 1, 2021

@schrej Are you ok proceeding with a field here instead of a flag?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 1, 2021
@schrej
Copy link
Member Author

schrej commented Dec 1, 2021

@schrej Are you ok proceeding with a field here instead of a flag?

I am. But what will it be? WaitForNodeDeletion or NodeDeletionTimeout?
With the latter we would always wait for node deletion again (see previous comment), which would be a behaviour change.

@vincepri
Copy link
Member

vincepri commented Dec 1, 2021

NodeDeletionTimeout sgtm

@schrej schrej force-pushed the feature/node-drain-timeout-flag branch from a827644 to 5d8e3b4 Compare December 6, 2021 15:57
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Dec 6, 2021

CLA Signed

The committers are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 6, 2021
@schrej schrej force-pushed the feature/node-drain-timeout-flag branch from 5d8e3b4 to 1ea72e5 Compare December 7, 2021 17:12
@schrej schrej changed the title ✨ Add --wait-for-node-deletion flag ✨ Add nodeDeletionTimeout property to Machine Dec 7, 2021
@schrej schrej force-pushed the feature/node-drain-timeout-flag branch from 424ef66 to f10f591 Compare January 25, 2022 10:01
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 25, 2022
@schrej schrej force-pushed the feature/node-drain-timeout-flag branch from f10f591 to 38fc2c3 Compare January 25, 2022 17:50
@schrej schrej requested a review from sbueringer January 25, 2022 18:16
@sbueringer
Copy link
Member

Sorry for the delay. I'll take a look next week. Currently a lot to do for the v1.1 release.

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
/assign @sbueringer

Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last nit, otherwise lgtm from my side.

@@ -152,6 +152,7 @@ func (in *KubeadmControlPlane) ValidateUpdate(old runtime.Object) error {
{spec, "machineTemplate", "infrastructureRef", "apiVersion"},
{spec, "machineTemplate", "infrastructureRef", "name"},
{spec, "machineTemplate", "nodeDrainTimeout"},
{spec, "machineTemplate", "nodeDeletionTimeout"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also extend the unit test?

You only have to set a value in ~l.242 and then change it in ~367.

We just had a few cases in the past where we thought something is mutable, but it wasn't.

@sbueringer
Copy link
Member

@schrej EasyCLA seems to have an issue, but I think it should be still in non-blocking mode until 5th February.

@schrej schrej force-pushed the feature/node-drain-timeout-flag branch from e504706 to f3b520e Compare February 1, 2022 09:33
@sbueringer
Copy link
Member

Thx! lgtm from my side. I think we should probably squash the commits into one.

@fabriziopandini
Copy link
Member

lgtm pending squash
great job @schrej!

@schrej
Copy link
Member Author

schrej commented Feb 10, 2022

/easycla

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 10, 2022
The machine controller now uses the configured timeout when deleting
the Node belonging to a Machine. It will also requeue the Machine for
reconciliation if deletion fails, which will lead to infinite retries.
@schrej schrej force-pushed the feature/node-drain-timeout-flag branch from f3b520e to 0d8d072 Compare February 14, 2022 09:58
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 14, 2022
@schrej
Copy link
Member Author

schrej commented Feb 14, 2022

Squashed.

Sorry for the delay, but it's skiing season 😬

@sbueringer
Copy link
Member

No worries. Thx!

/lgtm
/cc @fabriziopandini for approval

@k8s-ci-robot
Copy link
Contributor

@sbueringer: GitHub didn't allow me to request PR reviews from the following users: for, approval.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

No worries. Thx!

/lgtm
/cc @fabriziopandini for approval

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2022
@sbueringer
Copy link
Member

Given that we were only waiting for the squash and already had lgtm's before
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 22, 2022
@k8s-ci-robot k8s-ci-robot merged commit 4c44d05 into kubernetes-sigs:main Feb 22, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.2 milestone Feb 22, 2022
@schrej schrej deleted the feature/node-drain-timeout-flag branch March 9, 2022 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a configuration flag for Machine Deletion Timeouts
7 participants