✨ machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition #11166

chrischdi · 2024-09-10T15:04:05Z

What this PR does / why we need it:

Intruduces a struct at Machine.status.deletion
The struct contains two timestamps which get populated during machine deletion:
- nodeDrainStartTime
- nodeVolumeDetachStartTime

Both are used to track the starting time and used to check if we reached the timeouts instead of relying on the clusterv1.Condition's lastTransitionTime.
This is to get rid of the issue that the lastTransitionTime get's updated due to changes to the condition and to prepare for the changes for v1beta2, ~~where conditions don't have a lastTransitionTime anymore~~ which drops the separate conditions for drain and wait for detach.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #11126

/area machine

chrischdi · 2024-09-10T15:04:23Z

/test help

k8s-ci-robot · 2024-09-10T15:04:26Z

@chrischdi: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-cluster-api-build-main
/test pull-cluster-api-e2e-blocking-main
/test pull-cluster-api-e2e-conformance-ci-latest-main
/test pull-cluster-api-e2e-conformance-main
/test pull-cluster-api-e2e-latestk8s-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-mink8s-main
/test pull-cluster-api-e2e-upgrade-1-31-1-32-main
/test pull-cluster-api-test-main
/test pull-cluster-api-test-mink8s-main
/test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

/test pull-cluster-api-apidiff-main

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-apidiff-main
pull-cluster-api-build-main
pull-cluster-api-e2e-blocking-main
pull-cluster-api-test-main
pull-cluster-api-verify-main

In response to this:

/test help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

chrischdi · 2024-09-10T15:04:42Z

/test pull-cluster-api-e2e-main

… and volumeDetach instead of using the condition

chrischdi · 2024-09-10T15:47:50Z

/test pull-cluster-api-e2e-main

chrischdi · 2024-09-10T18:30:50Z

/retest

chrischdi · 2024-09-11T11:22:39Z

/retest

different flake

chrischdi · 2024-09-18T06:38:56Z

/assign sbueringer fabriziopandini

chrischdi · 2024-09-18T06:48:51Z

/test help

k8s-ci-robot · 2024-09-18T06:48:53Z

@chrischdi: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-cluster-api-build-main
/test pull-cluster-api-e2e-blocking-main
/test pull-cluster-api-e2e-conformance-ci-latest-main
/test pull-cluster-api-e2e-conformance-main
/test pull-cluster-api-e2e-latestk8s-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-mink8s-main
/test pull-cluster-api-e2e-upgrade-1-31-1-32-main
/test pull-cluster-api-test-main
/test pull-cluster-api-test-mink8s-main
/test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

/test pull-cluster-api-apidiff-main

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-apidiff-main
pull-cluster-api-build-main
pull-cluster-api-e2e-blocking-main
pull-cluster-api-test-main
pull-cluster-api-verify-main

In response to this:

/test help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

chrischdi · 2024-09-18T06:49:01Z

/test pull-cluster-api-e2e-main

fabriziopandini

Nice, also qualifies as a backport candidate IMO
Also changed the PR type to give some more visibility

/lgtm
cc @vincepri @JoelSpeed for the API changes

api/v1beta1/machine_types.go

k8s-ci-robot · 2024-09-18T07:58:13Z

LGTM label has been added.

Git tree hash: 5d3b081c89d004ac999478da2bc5bf7fc9f47730

JoelSpeed

I'm slightly confused by the premise of this PR given my understanding of @fabriziopandini's recent update to status.

A condition should be a perfectly suitable way to represent the drain state, but you mention there have been issues, do we have any of those written up that you can link to?

Both are used to track the starting time and used to check if we reached the timeouts instead of relying on the clusterv1.Condition's lastTransitionTime.

LastTransitionTime should only be changed when the status is change from true to false or vice versa, does that cause an issue here? Has it been behaving in that way?

This is to get rid of the issue that the lastTransitionTime get's updated due to changes to the condition and to prepare for the changes for v1beta2, where conditions don't have a lastTransitionTime anymore.

It sounds like yes, it sounds like any change to the condition has resulted in the transition time being updated, which is incorrect condition behaviour.

Also, lastTransitionTime is not going anywhere, see the docs for metav1.Condition here. The current implementation is compatible with the future set out in Fabrizio's plan.

api/v1beta1/machine_types.go

JoelSpeed · 2024-09-18T08:51:17Z

api/v1beta1/machine_types.go

+// MachineStatusDeletion is the deletion state of the Machine.
+type MachineStatusDeletion struct {
+	// NodeDrainStartTime is the time when the drain of the node started.
+	NodeDrainStartTime *metav1.Time `json:"nodeDrainStartTime,omitempty"`


Do we also need a finish time? Same for the other field?

I personally don't think that drain or detach finish time is an important info to have in the API (users & SRE mostly care about what is going on now and eventually why it is stuck, rarely they care about at what happened and for this log a more exhaustive)
But no strong opinion.

If I know that these have started, how do I know that they have finished if I don't have some field to tell me? 🤔 I realise eventually the machine is going away, but, what if it gets stuck terminating the instance, will that show up somewhere and will I know that drain and volume detach are done?

how do I know that they have finished if I don't have some field to tell me? 🤔

From controller-perspective we (at least currently) do not care:

either the controller tries to drain again which should be a no-op (happy path)

or the drain is skipped because the timeout is reached.

From the user perspective: the information where the deletion is at should be part of the Deleting condition I'd say.

I think in general the new Deleting condition should make clear at which phase of the deletion workflow we are (including making clear which parts are already completed)

JoelSpeed · 2024-09-18T08:52:29Z

api/v1beta1/machine_types.go

 }

 // ANCHOR_END: MachineStatus

+// MachineStatusDeletion is the deletion state of the Machine.
+type MachineStatusDeletion struct {
+	// NodeDrainStartTime is the time when the drain of the node started.


I feel like we can probably add some more context to this, what does it mean when it's not present for example, what does it mean if it has elapsed for some period and the Machine is still here, what hints can we give to end users?

Tried to add some more context on both keys.

api/v1beta1/machine_types.go

JoelSpeed · 2024-09-18T08:54:01Z

api/v1beta1/machine_types.go

 }

 // ANCHOR_END: MachineStatus

+// MachineStatusDeletion is the deletion state of the Machine.
+type MachineStatusDeletion struct {


Should we include information here related to the drain, such as the configuration for the timeout?

Since the drain fields are optional in the spec, it would be good perhaps to show the configured values here so that you can correlate between start time and expected end time just by looking at the status?

This is an interesting idea, not really sure how we can represent we are waiting forever in a clear way (not showing timeout 0).

A value of -1 potentially? But you're right, timeout 0 is awkward 🤔

I think the status that it is still waiting should then be part of the condition Deleting condition message? 🤔 (or as long as that one is not around, the DrainSucceeded condition message.

Yep, having this in the conditions seems the easiest way to address this

chrischdi · 2024-09-19T09:03:31Z

First of all, thanks for the great review!

I'm slightly confused by the premise of this PR given my understanding of @fabriziopandini's recent update to status.

A condition should be a perfectly suitable way to represent the drain state, but you mention there have been issues, do we have any of those written up that you can link to?

Both are used to track the starting time and used to check if we reached the timeouts instead of relying on the clusterv1.Condition's lastTransitionTime.

LastTransitionTime should only be changed when the status is change from true to false or vice versa, does that cause an issue here? Has it been behaving in that way?

This is to get rid of the issue that the lastTransitionTime get's updated due to changes to the condition and to prepare for the changes for v1beta2, where conditions don't have a lastTransitionTime anymore.

It sounds like yes, it sounds like any change to the condition has resulted in the transition time being updated, which is incorrect condition behaviour.

Yes that's the current behavior in the utils, but will be fixed with the v1beta2 conditions.

Also, lastTransitionTime is not going anywhere, see the docs for metav1.Condition here. The current implementation is compatible with the future set out in Fabrizio's plan.

That's true, but from the proposal: the target conditions for the final state of the Machine's do not contain the VolumeDetachSucceeded or DrainingSucceeded conditions we use currently to determine when we started draining/waiting for detach.

The information about the deletion status they showed should be moved / available at the Deleted condition instead.

JoelSpeed · 2024-09-19T09:44:34Z

api/v1beta1/machine_types.go

+type MachineStatusDeletion struct {
+	// NodeDrainStartTime is the time when the drain of the node started.
+	// Only present when the Machine has a deletionTimestamp, is being removed from the cluster
+	// and draining the node had been started.


Why would drain not have been started when there's a deletion timestamp?

Because of waiting for pre-drain hooks is one thing.

Note: the best pointer to overview the deletion process may be this: https://main.cluster-api.sigs.k8s.io/tasks/automated-machine-management/machine_deletions

I was thinking it might be useful to point a user to, or at least give them a hint to why node draining might not have started, can we include a line?

Another one may be the node is excluded from the drain entirely (there is an annotation for it).

I shortened it to:

// NodeDrainStartTime is the time when the drain of the node started and is used to determine // if the NodeDrainTimeout is exceeded. // Only present when the Machine has a deletionTimestamp and draining the node had been started.

Happy to get other suggestions though :-)

One way would be to say:

// NodeDrainStartTime is the time when the drain of the node started and is used to determine // if the NodeDrainTimeout is exceeded. // Only present when the Machine has a deletionTimestamp and draining the node had been started. // NodeDrainStartTime may not be set because the deletion is blocked by a pre-drain hook or draining is skipped for the machine.

But doing the same for WaitForNodeVolumeDetachStartTime would get very verbose 🤔 :

// WaitForNodeVolumeDetachStartTime is the time when waiting for volume detachment started // and is used to determine if the NodeVolumeDetachTimeout is exceeded. // Detaching volumes from nodes is usually done by CSI implementations and the current state // is observed from the node's `.Status.VolumesAttached` field. // Only present when the Machine has a deletionTimestamp and waiting for volume detachments had been started. // WaitForNodeVolumeDetachStartTime may not be set because the deletion is blocked by a pre-drain hook, stuck in drain or waiting for volume detachment is skipped for the machine. ```

api/v1beta1/machine_types.go

internal/controllers/machine/machine_controller.go

sbueringer

Last nit from my side, otherwise lgtm

internal/controllers/machine/machine_controller.go

sbueringer · 2024-09-25T12:10:11Z

/lgtm

k8s-ci-robot · 2024-09-25T12:10:18Z

LGTM label has been added.

Git tree hash: 2d7929ac4d700f1cf20b236e8e815c0509393408

sbueringer · 2024-09-27T10:19:13Z

@JoelSpeed @fabriziopandini Fine to merge from your side as well?

fabriziopandini

Looks good for me!
/lgtm

vincepri · 2024-09-28T05:26:04Z

api/v1beta1/machine_types.go

+	// +optional
+	NodeDrainStartTime *metav1.Time `json:"nodeDrainStartTime,omitempty"`
+
+	// waitForNodeVolumeDetachStartTime is the time when waiting for volume detachment started


Suggested change

// waitForNodeVolumeDetachStartTime is the time when waiting for volume detachment started

// WaitForNodeVolumeDetachStartTime is the time when waiting for volume detachment started

See #11166 (comment)

xref: #11238

JoelSpeed · 2024-09-30T15:26:57Z

/lgtm

sbueringer · 2024-09-30T19:16:16Z

Thx everyone!

/approve

k8s-ci-robot · 2024-09-30T19:16:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sbueringer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…in and volumeDetach instead of using the condition (kubernetes-sigs#11166) * machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition * fix tests * make generate * review fixes * fix openapi gen * review fixes * fix

chrischdi added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Sep 10, 2024

k8s-ci-robot added area/machine Issues or PRs related to machine lifecycle management cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 10, 2024

k8s-ci-robot requested review from JoelSpeed and sbueringer September 10, 2024 15:04

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 10, 2024

chrischdi changed the title ~~🌱 machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition~~ 🌱 [WIP] machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition Sep 10, 2024

chrischdi added 3 commits September 10, 2024 17:43

machine: Introduce Deletion status field and add timestamps for drain…

42eac4c

… and volumeDetach instead of using the condition

fix tests

6a0422e

make generate

b2ddb80

chrischdi force-pushed the pr-deletion-status branch from 0451196 to b2ddb80 Compare September 10, 2024 15:44

chrischdi changed the title ~~🌱 [WIP] machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition~~ 🌱 machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition Sep 18, 2024

k8s-ci-robot assigned fabriziopandini and sbueringer Sep 18, 2024

fabriziopandini reviewed Sep 18, 2024

View reviewed changes

api/v1beta1/machine_types.go Show resolved Hide resolved

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 18, 2024

fabriziopandini changed the title ~~🌱 machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition~~ ✨ machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition Sep 18, 2024

JoelSpeed reviewed Sep 18, 2024

View reviewed changes

review fixes

95b28b5

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 19, 2024

k8s-ci-robot requested a review from fabriziopandini September 19, 2024 09:01

JoelSpeed reviewed Sep 19, 2024

View reviewed changes

fix openapi gen

bcf9d6a

sbueringer reviewed Sep 23, 2024

View reviewed changes

review fixes

9149759

sbueringer reviewed Sep 25, 2024

View reviewed changes

internal/controllers/machine/machine_controller.go Outdated Show resolved Hide resolved

fix

b4f0943

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 25, 2024

fabriziopandini reviewed Sep 27, 2024

View reviewed changes

vincepri reviewed Sep 28, 2024

View reviewed changes

chrischdi mentioned this pull request Sep 30, 2024

Refactor godocs for API fields to start with the serialised versions #11238

Open

k8s-ci-robot assigned JoelSpeed Sep 30, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 30, 2024

k8s-ci-robot merged commit f8b7deb into kubernetes-sigs:main Sep 30, 2024
18 checks passed

k8s-ci-robot added this to the v1.9 milestone Sep 30, 2024

	// waitForNodeVolumeDetachStartTime is the time when waiting for volume detachment started
	// WaitForNodeVolumeDetachStartTime is the time when waiting for volume detachment started

✨ machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition #11166

✨ machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition #11166

Conversation

chrischdi commented Sep 10, 2024 • edited Loading

chrischdi commented Sep 10, 2024

k8s-ci-robot commented Sep 10, 2024

chrischdi commented Sep 10, 2024

chrischdi commented Sep 10, 2024

chrischdi commented Sep 10, 2024

chrischdi commented Sep 11, 2024

chrischdi commented Sep 18, 2024

chrischdi commented Sep 18, 2024

k8s-ci-robot commented Sep 18, 2024

chrischdi commented Sep 18, 2024

fabriziopandini left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 18, 2024

JoelSpeed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrischdi commented Sep 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbueringer left a comment

Choose a reason for hiding this comment

sbueringer commented Sep 25, 2024

k8s-ci-robot commented Sep 25, 2024

sbueringer commented Sep 27, 2024

fabriziopandini left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoelSpeed commented Sep 30, 2024

sbueringer commented Sep 30, 2024

k8s-ci-robot commented Sep 30, 2024

chrischdi commented Sep 10, 2024 •

edited

Loading