Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignoring terminated pods in scaledown #3545

Conversation

dbenque
Copy link
Contributor

@dbenque dbenque commented Sep 28, 2020

Fix proposal for #3407

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 28, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @dbenque!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 28, 2020
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this mostly makes sense to me. i think my big concern is that we would mark a node for deletion while a pod was in the middle of terminating, but it seems like the deletion timestamp logic should help us prevent that.

it seems like there are no tests in the simulator/drain_test that enable this flag, is it possible to test there as well? (or did i just miss it)

@@ -414,7 +414,7 @@ func (sd *ScaleDown) checkNodeUtilization(timestamp time.Time, node *apiv1.Node,
return simulator.ScaleDownDisabledAnnotation, nil
}

utilInfo, err := simulator.CalculateUtilization(node, nodeInfo, sd.context.IgnoreDaemonSetsUtilization, sd.context.IgnoreMirrorPodsUtilization, sd.context.CloudProvider.GPULabel())
utilInfo, err := simulator.CalculateUtilization(node, nodeInfo, sd.context.IgnoreDaemonSetsUtilization, sd.context.IgnoreMirrorPodsUtilization, sd.context.IgnorePodsThatShouldBeTerminated, sd.context.CloudProvider.GPULabel())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do wonder if we should convert CalculateUtilization to take a context instead of continuing to grow the argument list?

this is not a blocker for me, just curious what others think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is something I was thinking about also, but I did not want to bring to many change in that PR.
Depending on what others think, I can do the modification in that PR, or refactor this in a follow-up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think your instinct not to introduce too many changes is a good one. it just struck me as odd while reading the code.

cluster-autoscaler/main.go Outdated Show resolved Hide resolved
@dbenque
Copy link
Contributor Author

dbenque commented Oct 13, 2020

this mostly makes sense to me. i think my big concern is that we would mark a node for deletion while a pod was in the middle of terminating, but it seems like the deletion timestamp logic should help us prevent that.

@elmiko , yes the logic is preventing this because the pod should be terminating AND it must have gone over its TerminationGracePeriod AND there is yet another static time buffer of 30s that was added. In other words Terminating related actions have been given enough time. If it was a Ready node (Kubelet not dead) the Kubelet would kill that pod.

@dbenque
Copy link
Contributor Author

dbenque commented Oct 15, 2020

@elmiko , thanks for your review, here are the changes 84a2f9f

@dbenque dbenque requested a review from elmiko October 15, 2020 12:45
@dbenque dbenque force-pushed the david.benque/ignore-terminated-pod-in-scaledown branch from 84a2f9f to facb95a Compare October 15, 2020 13:08
Copy link
Collaborator

@towca towca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the contribution! I have some comments, but nothing major, the overall approach LGTM.

cluster-autoscaler/main.go Outdated Show resolved Hide resolved
cluster-autoscaler/utils/drain/drain.go Outdated Show resolved Hide resolved
cluster-autoscaler/utils/drain/drain.go Outdated Show resolved Hide resolved
cluster-autoscaler/utils/drain/drain.go Outdated Show resolved Hide resolved
cluster-autoscaler/utils/drain/drain_test.go Outdated Show resolved Hide resolved
cluster-autoscaler/utils/drain/drain.go Outdated Show resolved Hide resolved
cluster-autoscaler/utils/drain/drain.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Nov 26, 2020
@dbenque dbenque force-pushed the david.benque/ignore-terminated-pod-in-scaledown branch 2 times, most recently from f2a16ed to 9b53c40 Compare November 26, 2020 20:06
@dbenque
Copy link
Contributor Author

dbenque commented Nov 26, 2020

@towca thanks a lot for your review.
I have amended the PR to take into account all your comments, in the end without the new parameter the diff is smaller.

@dbenque dbenque requested a review from towca November 26, 2020 20:09
@dbenque dbenque force-pushed the david.benque/ignore-terminated-pod-in-scaledown branch from 9b53c40 to 2c3c84c Compare December 15, 2020 16:50
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Dec 15, 2020
Copy link
Collaborator

@towca towca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments and sorry for the long review delay. Comments mostly test-related, sorry for the amount but I didn't focus on tests in the first pass.

cluster-autoscaler/simulator/cluster_test.go Outdated Show resolved Hide resolved
cluster-autoscaler/utils/drain/drain_test.go Outdated Show resolved Hide resolved
cluster-autoscaler/simulator/cluster_test.go Outdated Show resolved Hide resolved
cluster-autoscaler/simulator/drain_test.go Outdated Show resolved Hide resolved
cluster-autoscaler/simulator/drain_test.go Outdated Show resolved Hide resolved
cluster-autoscaler/utils/drain/drain_test.go Outdated Show resolved Hide resolved
cluster-autoscaler/utils/drain/drain_test.go Outdated Show resolved Hide resolved
cluster-autoscaler/utils/drain/drain_test.go Outdated Show resolved Hide resolved
cluster-autoscaler/utils/drain/drain_test.go Outdated Show resolved Hide resolved
cluster-autoscaler/utils/drain/drain_test.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 18, 2020
@dbenque dbenque force-pushed the david.benque/ignore-terminated-pod-in-scaledown branch from 4a2008d to 8eff34d Compare December 18, 2020 19:38
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 18, 2020
@dbenque dbenque requested a review from towca December 18, 2020 19:48
@dbenque
Copy link
Contributor Author

dbenque commented Dec 18, 2020

@towca thanks for your second review.
I have integrated all your comments. To have "timestamp" passed everywhere I had to modify couple of calling functions and now it is more homogeneous regarding how we deal with time.

Copy link
Collaborator

@towca towca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this contribution and for taking the time to improve some unrelated tests!

You have some gofmt errors that you need to resolve (see Travis). Could you also squash the commits into 1 while you're at it?

@dbenque dbenque force-pushed the david.benque/ignore-terminated-pod-in-scaledown branch from 8eff34d to 2529b66 Compare December 21, 2020 15:23
@dbenque
Copy link
Contributor Author

dbenque commented Dec 21, 2020

@towca Thanks for yours reviews. I have fixed the small gofmt problem, and squashed the commits.

@towca
Copy link
Collaborator

towca commented Dec 21, 2020

Thanks!
/lgtm
/assign @MaciekPytel for approval

@k8s-ci-robot
Copy link
Contributor

@towca: GitHub didn't allow me to assign the following users: for, approval.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Thanks!
/lgtm
/assign @MaciekPytel for approval

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 21, 2020
@vivekbagade
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dbenque, towca, vivekbagade

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 21, 2020
@k8s-ci-robot k8s-ci-robot merged commit c94ade5 into kubernetes:master Dec 21, 2020
@dbenque dbenque deleted the david.benque/ignore-terminated-pod-in-scaledown branch December 22, 2020 09:21
k8s-ci-robot added a commit that referenced this pull request May 17, 2021
[cluster-autoscaler] Backporting #3545 to release 1.18
k8s-ci-robot added a commit that referenced this pull request May 17, 2021
[cluster-autoscaler] Backporting #3545 to release 1.19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants