-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition #11166
✨ machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition #11166
Conversation
/test help |
@chrischdi: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test pull-cluster-api-e2e-main |
… and volumeDetach instead of using the condition
0451196
to
b2ddb80
Compare
/test pull-cluster-api-e2e-main |
/retest |
/retest different flake |
/assign sbueringer fabriziopandini |
/test help |
@chrischdi: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test pull-cluster-api-e2e-main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, also qualifies as a backport candidate IMO
Also changed the PR type to give some more visibility
/lgtm
cc @vincepri @JoelSpeed for the API changes
LGTM label has been added. Git tree hash: 5d3b081c89d004ac999478da2bc5bf7fc9f47730
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly confused by the premise of this PR given my understanding of @fabriziopandini's recent update to status.
A condition should be a perfectly suitable way to represent the drain state, but you mention there have been issues, do we have any of those written up that you can link to?
Both are used to track the starting time and used to check if we reached the timeouts instead of relying on the clusterv1.Condition's lastTransitionTime.
LastTransitionTime should only be changed when the status is change from true to false or vice versa, does that cause an issue here? Has it been behaving in that way?
This is to get rid of the issue that the lastTransitionTime get's updated due to changes to the condition and to prepare for the changes for v1beta2, where conditions don't have a lastTransitionTime anymore.
It sounds like yes, it sounds like any change to the condition has resulted in the transition time being updated, which is incorrect condition behaviour.
Also, lastTransitionTime
is not going anywhere, see the docs for metav1.Condition here. The current implementation is compatible with the future set out in Fabrizio's plan.
// MachineStatusDeletion is the deletion state of the Machine. | ||
type MachineStatusDeletion struct { | ||
// NodeDrainStartTime is the time when the drain of the node started. | ||
NodeDrainStartTime *metav1.Time `json:"nodeDrainStartTime,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need a finish time? Same for the other field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally don't think that drain or detach finish time is an important info to have in the API (users & SRE mostly care about what is going on now and eventually why it is stuck, rarely they care about at what happened and for this log a more exhaustive)
But no strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I know that these have started, how do I know that they have finished if I don't have some field to tell me? 🤔 I realise eventually the machine is going away, but, what if it gets stuck terminating the instance, will that show up somewhere and will I know that drain and volume detach are done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do I know that they have finished if I don't have some field to tell me? 🤔
From controller-perspective we (at least currently) do not care:
- either the controller tries to drain again which should be a no-op (happy path)
- or the drain is skipped because the timeout is reached.
From the user perspective: the information where the deletion is at should be part of the Deleting
condition I'd say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in general the new Deleting condition should make clear at which phase of the deletion workflow we are (including making clear which parts are already completed)
api/v1beta1/machine_types.go
Outdated
} | ||
|
||
// ANCHOR_END: MachineStatus | ||
|
||
// MachineStatusDeletion is the deletion state of the Machine. | ||
type MachineStatusDeletion struct { | ||
// NodeDrainStartTime is the time when the drain of the node started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we can probably add some more context to this, what does it mean when it's not present for example, what does it mean if it has elapsed for some period and the Machine is still here, what hints can we give to end users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to add some more context on both keys.
api/v1beta1/machine_types.go
Outdated
} | ||
|
||
// ANCHOR_END: MachineStatus | ||
|
||
// MachineStatusDeletion is the deletion state of the Machine. | ||
type MachineStatusDeletion struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include information here related to the drain, such as the configuration for the timeout?
Since the drain fields are optional in the spec, it would be good perhaps to show the configured values here so that you can correlate between start time and expected end time just by looking at the status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting idea, not really sure how we can represent we are waiting forever in a clear way (not showing timeout 0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A value of -1
potentially? But you're right, timeout 0 is awkward 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the status that it is still waiting should then be part of the condition Deleting
condition message? 🤔 (or as long as that one is not around, the DrainSucceeded
condition message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, having this in the conditions seems the easiest way to address this
First of all, thanks for the great review!
Yes that's the current behavior in the utils, but will be fixed with the v1beta2 conditions.
That's true, but from the proposal: the target conditions for the final state of the Machine's do not contain the The information about the deletion status they showed should be moved / available at the Deleted condition instead. |
api/v1beta1/machine_types.go
Outdated
type MachineStatusDeletion struct { | ||
// NodeDrainStartTime is the time when the drain of the node started. | ||
// Only present when the Machine has a deletionTimestamp, is being removed from the cluster | ||
// and draining the node had been started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would drain not have been started when there's a deletion timestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of waiting for pre-drain hooks is one thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: the best pointer to overview the deletion process may be this: https://main.cluster-api.sigs.k8s.io/tasks/automated-machine-management/machine_deletions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking it might be useful to point a user to, or at least give them a hint to why node draining might not have started, can we include a line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another one may be the node is excluded from the drain entirely (there is an annotation for it).
I shortened it to:
// NodeDrainStartTime is the time when the drain of the node started and is used to determine
// if the NodeDrainTimeout is exceeded.
// Only present when the Machine has a deletionTimestamp and draining the node had been started.
Happy to get other suggestions though :-)
One way would be to say:
// NodeDrainStartTime is the time when the drain of the node started and is used to determine
// if the NodeDrainTimeout is exceeded.
// Only present when the Machine has a deletionTimestamp and draining the node had been started.
// NodeDrainStartTime may not be set because the deletion is blocked by a pre-drain hook or draining is skipped for the machine.
But doing the same for WaitForNodeVolumeDetachStartTime
would get very verbose 🤔 :
// WaitForNodeVolumeDetachStartTime is the time when waiting for volume detachment started
// and is used to determine if the NodeVolumeDetachTimeout is exceeded.
// Detaching volumes from nodes is usually done by CSI implementations and the current state
// is observed from the node's `.Status.VolumesAttached` field.
// Only present when the Machine has a deletionTimestamp and waiting for volume detachments had been started.
// WaitForNodeVolumeDetachStartTime may not be set because the deletion is blocked by a pre-drain hook, stuck in drain or waiting for volume detachment is skipped for the machine.
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last nit from my side, otherwise lgtm
/lgtm |
LGTM label has been added. Git tree hash: 2d7929ac4d700f1cf20b236e8e815c0509393408
|
@JoelSpeed @fabriziopandini Fine to merge from your side as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for me!
/lgtm
// +optional | ||
NodeDrainStartTime *metav1.Time `json:"nodeDrainStartTime,omitempty"` | ||
|
||
// waitForNodeVolumeDetachStartTime is the time when waiting for volume detachment started |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// waitForNodeVolumeDetachStartTime is the time when waiting for volume detachment started | |
// WaitForNodeVolumeDetachStartTime is the time when waiting for volume detachment started |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #11166 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref: #11238
/lgtm |
Thx everyone! /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sbueringer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…in and volumeDetach instead of using the condition (kubernetes-sigs#11166) * machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition * fix tests * make generate * review fixes * fix openapi gen * review fixes * fix
What this PR does / why we need it:
Machine.status.deletion
nodeDrainStartTime
nodeVolumeDetachStartTime
Both are used to track the starting time and used to check if we reached the timeouts instead of relying on the clusterv1.Condition's lastTransitionTime.
This is to get rid of the issue that the lastTransitionTime get's updated due to changes to the condition and to prepare for the changes for v1beta2,
where conditions don't have a lastTransitionTime anymorewhich drops the separate conditions for drain and wait for detach.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #11126
/area machine