Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop (un)tainting nodes from unselected node groups. #6273

Merged
merged 11 commits into from
Feb 6, 2024

Conversation

fische
Copy link
Contributor

@fische fische commented Nov 13, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes a bug where at the start of RunOnce and during scale down, the code cleans DeletionCandidate taints from pretty much all nodes, even from those that are not part of the selected node groups. By selected node groups, I mean the ones passed through the --nodes flag. This is obviously undesired behaviour but I'd like to give a bit more context around how we came across this issue:
We are currently running our cluster on GKE which comes with its own cluster autoscaler. It lacks loads of options, so we have deployed our own CA and disabled the managed one on all node pools, as there is no way to completely remove it. This means we have 2 CAs running at the same time, even though one is not supposed to do anything. Given the bug I described above, they both are conflicting because while one (ours) tries to scale down some nodes, the other (GKE's) will remove the DeletionCandidate taint from those as from its perspective those nodes should not be scaled down even though their node groups have not been selected. This makes scale down very slow.

To give you a better idea, with the test case I've added, it fails with the following on master:

                Error Trace:    /home/fische/local/autoscaler/cluster-autoscaler/core/static_autoscaler_test.go:372
                Error:          Not equal:
                                expected: []v1.Taint{v1.Taint{Key:"DeletionCandidateOfClusterAutoscaler", Value:"1699914797", Effect:"PreferNoSchedule", TimeAdded:<nil>}}
                                actual  : []v1.Taint{}

                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1,8 +1,2 @@
                                -([]v1.Taint) (len=1) {
                                - (v1.Taint) {
                                -  Key: (string) (len=36) "DeletionCandidateOfClusterAutoscaler",
                                -  Value: (string) (len=10) "1699914797",
                                -  Effect: (v1.TaintEffect) (len=16) "PreferNoSchedule",
                                -  TimeAdded: (*v1.Time)(<nil>)
                                - }
                                +([]v1.Taint) {
                                 }
                Test:           TestStaticAutoscalerRunOnce

Which issue(s) this PR fixes:

I haven't raised any issue for this. Shall I?

Special notes for your reviewer:

I'm just not sure what we should do if there's an error filtering the nodes during scale down. Shall we continue? I've added a comment in the code in case you don't see what I'm talking about.

Does this PR introduce a user-facing change?

This does somewhat introduce a user-facing change, yes, so I've added a release note, even though it is behaviour that users should not rely on.

Fixed undesired behaviour at the beginning of autoscaler main logic and during scale down, where it would clean taints from nodes that are not part of selected node groups (--nodes flag).

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 13, 2023
@k8s-ci-robot
Copy link
Contributor

Welcome @fische!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

Copy link

linux-foundation-easycla bot commented Nov 13, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 13, 2023
@k8s-ci-robot k8s-ci-robot requested a review from x13n November 13, 2023 23:17
@Shubham82
Copy link
Contributor

@fische, you have to sign the CLA before the PR can be reviewed.
See the following document to sign the CLA: Signing Contributor License Agreements(CLA)

@Shubham82
Copy link
Contributor

To check EasyCLA

/easycla

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 14, 2023
@fische
Copy link
Contributor Author

fische commented Nov 14, 2023

@fische, you have to sign the CLA before the PR can be reviewed. See the following document to sign the CLA: Signing Contributor License Agreements(CLA)

Yup, sorry, I forgot about that.

@fische
Copy link
Contributor Author

fische commented Nov 21, 2023

@Shubham82 Would you mind taking a look now that the CLA has been signed please? 🙏

cluster-autoscaler/core/static_autoscaler.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/static_autoscaler_test.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/static_autoscaler_test.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/static_autoscaler.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/static_autoscaler.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 12, 2023
@fische fische requested a review from BigDarkClown December 12, 2023 20:04
@x13n
Copy link
Member

x13n commented Dec 28, 2023

/assign

cluster-autoscaler/core/static_autoscaler.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/static_autoscaler_test.go Outdated Show resolved Hide resolved
@fische fische requested a review from x13n January 22, 2024 11:27
@fische
Copy link
Contributor Author

fische commented Feb 2, 2024

@x13n Could you please have another look?

@x13n
Copy link
Member

x13n commented Feb 6, 2024

Apologies for slow review. LGTM now.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fische, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 6, 2024
@k8s-ci-robot k8s-ci-robot merged commit 3802594 into kubernetes:master Feb 6, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants