-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Add the ability to specify a drain timeout for machines #3662
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Welcome @namnx228! |
Hi @namnx228. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed CLA |
/assign @neolit123 |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good, thanks for working on it!
We'll need to add more tests and integration tests in the Machine controller, especially around the logic that checks if the node drain is still allowed or not
api/v1alpha3/machine_types.go
Outdated
@@ -89,6 +89,11 @@ type MachineSpec struct { | |||
// Must match a key in the FailureDomains map stored on the cluster object. | |||
// +optional | |||
FailureDomain *string `json:"failureDomain,omitempty"` | |||
|
|||
// NodeDrainTimeout is the total amount of time for draining a worker node | |||
// Note that this NodeDrainTimeout is different from `kubectl drain --timeout` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Note that this NodeDrainTimeout is different from `kubectl drain --timeout` | |
// | |
// NOTE: NodeDrainTimeout is different from `kubectl drain --timeout` |
Let's also add some color on how it's different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a default here?
controllers/machine_controller.go
Outdated
@@ -299,14 +299,18 @@ func (r *MachineReconciler) reconcileDelete(ctx context.Context, cluster *cluste | |||
conditions.MarkTrue(m, clusterv1.PreDrainDeleteHookSucceededCondition) | |||
|
|||
// Drain node before deletion and issue a patch in order to make this operation visible to the users. | |||
if _, exists := m.ObjectMeta.Annotations[clusterv1.ExcludeNodeDrainingAnnotation]; !exists { | |||
if _, exists := m.ObjectMeta.Annotations[clusterv1.ExcludeNodeDrainingAnnotation]; !exists && !isNodeDraintimeoutOver(m) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of chaining if conditions here, let's add a new if condition after entering this block that checks the timeout
controllers/machine_controller.go
Outdated
func isNodeDraintimeoutOver(machine *clusterv1.Machine) bool { | ||
// if the start draining condition does not exist | ||
if conditions.Get(machine, clusterv1.DrainingSucceededCondition) == nil { | ||
return false | ||
} | ||
// if the NodeDrainTineout type is not set by user | ||
if machine.Spec.NodeDrainTimeout <= 0 { | ||
return false | ||
} | ||
now := time.Now() | ||
firstTimeDrain := conditions.GetLastTransitionTime(machine, clusterv1.DrainingSucceededCondition) | ||
diff := now.Sub(firstTimeDrain.Time) | ||
return diff.Seconds() >= float64(machine.Spec.NodeDrainTimeout) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this a method? Seems everything else is on the MachineReconcile object
controllers/machine_controller.go
Outdated
@@ -363,6 +367,21 @@ func (r *MachineReconciler) reconcileDelete(ctx context.Context, cluster *cluste | |||
return ctrl.Result{}, nil | |||
} | |||
|
|||
func isNodeDraintimeoutOver(machine *clusterv1.Machine) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func isNodeDraintimeoutOver(machine *clusterv1.Machine) bool { | |
func isNodeDrainAllowed(machine *clusterv1.Machine) bool { |
To make it a little bit more generic, in case in the future we might want to check more than just a timeout. We can also add the check for the annotation if _, exists := m.ObjectMeta.Annotations[clusterv1.ExcludeNodeDrainingAnnotation]
/milestone v0.3.10 |
Thanks for your review, @vincepri. I will push a fix that responds to your comments soon.
|
dc94fbd
to
eb6371f
Compare
Same PR, usually changes should be tested before merging them, it has happened in the past that we've merged code without tests and forgot to follow-up, since then we've decided that PR should come with tests + docs whenever possible
For now yes, if they keep failing after a number of retries we might want to look into it a bit more. We've merged a bunch of fixes for the tests, can you try rebasing? |
14a1185
to
b22fc9d
Compare
b22fc9d
to
ca202ae
Compare
8f1cbe5
to
aede8b4
Compare
/retest |
Is it an issue in CI that the |
@namnx228 You can ignore the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/assign @CecileRobertMichon @ncdc
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vincepri The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
aede8b4
to
317c806
Compare
/test pull-cluster-api-test |
3 similar comments
/test pull-cluster-api-test |
/test pull-cluster-api-test |
/test pull-cluster-api-test |
317c806
to
89024ba
Compare
api/v1alpha3/machine_types.go
Outdated
// The default value is 0, meaning that the node can be drained without any time limitations. | ||
// NOTE: NodeDrainTimeout is different from `kubectl drain --timeout` | ||
// +optional | ||
NodeDrainTimeout int64 `json:"nodeDrainTimeout,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a metav1.Duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can also be metav1.Duration.
Do we have any significant advantage if now we change it from int64
to metav1.Duration
since it can lead to quite a lot of change to this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We typically use Duration for ... durations. The significant advantage is getting this right early on instead of having to make an API change later. @vincepri @CecileRobertMichon @detiber WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please, that's a good point. We should try to be consistent with the rest of the codebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Fixed!
controllers/machine_controller.go
Outdated
@@ -299,14 +299,20 @@ func (r *MachineReconciler) reconcileDelete(ctx context.Context, cluster *cluste | |||
conditions.MarkTrue(m, clusterv1.PreDrainDeleteHookSucceededCondition) | |||
|
|||
// Drain node before deletion and issue a patch in order to make this operation visible to the users. | |||
if _, exists := m.ObjectMeta.Annotations[clusterv1.ExcludeNodeDrainingAnnotation]; !exists { | |||
if drainAllowed := r.isNodeDrainAllowed(m); drainAllowed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if drainAllowed := r.isNodeDrainAllowed(m); drainAllowed { | |
if r.isNodeDrainAllowed(m) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
controllers/machine_controller.go
Outdated
return false | ||
} | ||
|
||
if timeout := r.isNodeDrainTimeoutOver(m); timeout { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if timeout := r.isNodeDrainTimeoutOver(m); timeout { | |
if r.isNodeDrainTimeoutOver(m) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed! Thanks
controllers/machine_controller.go
Outdated
|
||
} | ||
|
||
func (r *MachineReconciler) isNodeDrainTimeoutOver(machine *clusterv1.Machine) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably call this something like nodeDrainTimeoutExceeded
- @vincepri ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, was struggling to find a better name the other day
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed! Thanks for the suggestion
controllers/machine_controller.go
Outdated
// if the NodeDrainTineout type is not set by user | ||
if machine.Spec.NodeDrainTimeout <= 0 { | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this check is less expensive than conditions.Get
- maybe consider making it the first check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed! Thanks
controllers/machine_controller.go
Outdated
@@ -419,7 +453,6 @@ func (r *MachineReconciler) drainNode(ctx context.Context, cluster *clusterv1.Cl | |||
} | |||
return errors.Errorf("unable to get node %q: %v", nodeName, err) | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please undo
m1 := &clusterv1.Machine{ | ||
TypeMeta: metav1.TypeMeta{ | ||
Kind: "Machine", | ||
}, | ||
ObjectMeta: metav1.ObjectMeta{ | ||
Name: "m1", | ||
Namespace: "default", | ||
Labels: map[string]string{ | ||
clusterv1.ClusterLabelName: "test-cluster", | ||
}, | ||
}, | ||
} | ||
objs = append(objs, m1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we don't need it. Deleted. Thanks
89024ba
to
e073265
Compare
@namnx228: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
f3dfb76
to
d13f118
Compare
@ncdc all the required changes have been made. Please take a look again. Thanks! |
Add an option `nodeDrainTimeout` to KCP and machiceSpec of machinedeployment. `nodeDrainTimeout` defines the amount of time we want a node to be drained. The node is forcefully removed if the time is over. Note: Unset this option means there is no time limit.
d13f118
to
d4fe42f
Compare
/lgtm @namnx228 thanks for your patience w/the reviews, and thanks for doing this! |
Add an option
nodeDrainTimeout
to KCP and machiceSpec of machinedeployment.nodeDrainTimeout
defines the amount of time we want a node to be drained. The node is forcefully removed if the time is over.Note: Unset this option means there is no time limit.
How does this PR do?
nodeDrainTimeout
to KCP and to MachineSpec field of Machideployment.nodeDrainTimeout
option.DrainingSucceededCondition
is set to "False", and its last trasition time is the first time the machine is drained.Fixes #2331