Stop (un)tainting nodes from unselected node groups. #6273

fische · 2023-11-13T23:16:46Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes a bug where at the start of RunOnce and during scale down, the code cleans DeletionCandidate taints from pretty much all nodes, even from those that are not part of the selected node groups. By selected node groups, I mean the ones passed through the --nodes flag. This is obviously undesired behaviour but I'd like to give a bit more context around how we came across this issue:
We are currently running our cluster on GKE which comes with its own cluster autoscaler. It lacks loads of options, so we have deployed our own CA and disabled the managed one on all node pools, as there is no way to completely remove it. This means we have 2 CAs running at the same time, even though one is not supposed to do anything. Given the bug I described above, they both are conflicting because while one (ours) tries to scale down some nodes, the other (GKE's) will remove the DeletionCandidate taint from those as from its perspective those nodes should not be scaled down even though their node groups have not been selected. This makes scale down very slow.

To give you a better idea, with the test case I've added, it fails with the following on master:

                Error Trace:    /home/fische/local/autoscaler/cluster-autoscaler/core/static_autoscaler_test.go:372
                Error:          Not equal:
                                expected: []v1.Taint{v1.Taint{Key:"DeletionCandidateOfClusterAutoscaler", Value:"1699914797", Effect:"PreferNoSchedule", TimeAdded:<nil>}}
                                actual  : []v1.Taint{}

                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1,8 +1,2 @@
                                -([]v1.Taint) (len=1) {
                                - (v1.Taint) {
                                -  Key: (string) (len=36) "DeletionCandidateOfClusterAutoscaler",
                                -  Value: (string) (len=10) "1699914797",
                                -  Effect: (v1.TaintEffect) (len=16) "PreferNoSchedule",
                                -  TimeAdded: (*v1.Time)(<nil>)
                                - }
                                +([]v1.Taint) {
                                 }
                Test:           TestStaticAutoscalerRunOnce

Which issue(s) this PR fixes:

I haven't raised any issue for this. Shall I?

Special notes for your reviewer:

I'm just not sure what we should do if there's an error filtering the nodes during scale down. Shall we continue? I've added a comment in the code in case you don't see what I'm talking about.

Does this PR introduce a user-facing change?

This does somewhat introduce a user-facing change, yes, so I've added a release note, even though it is behaviour that users should not rely on.

Fixed undesired behaviour at the beginning of autoscaler main logic and during scale down, where it would clean taints from nodes that are not part of selected node groups (--nodes flag).

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2023-11-13T23:16:55Z

Welcome @fische!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

linux-foundation-easycla · 2023-11-13T23:16:57Z

The committers listed above are authorized under a signed CLA.

✅ login: fische / name: Maxime Fischer (4901303, 91477ac, cc6ecec, 7383034, c3810ec, 95ad6c9, 368441d, 5f6eedc, 90e24ac, a47ef89, e8e3ad0)

Shubham82 · 2023-11-14T07:12:16Z

@fische, you have to sign the CLA before the PR can be reviewed.
See the following document to sign the CLA: Signing Contributor License Agreements(CLA)

Shubham82 · 2023-11-14T07:12:51Z

To check EasyCLA

/easycla

fische · 2023-11-14T22:36:40Z

@fische, you have to sign the CLA before the PR can be reviewed. See the following document to sign the CLA: Signing Contributor License Agreements(CLA)

Yup, sorry, I forgot about that.

fische · 2023-11-21T17:09:47Z

@Shubham82 Would you mind taking a look now that the CLA has been signed please? 🙏

cluster-autoscaler/core/static_autoscaler.go

cluster-autoscaler/core/static_autoscaler_test.go

cluster-autoscaler/core/static_autoscaler.go

x13n · 2023-12-28T10:56:28Z

/assign

cluster-autoscaler/core/static_autoscaler.go

cluster-autoscaler/core/static_autoscaler_test.go

fische · 2024-02-02T09:16:39Z

@x13n Could you please have another look?

x13n · 2024-02-06T14:03:02Z

Apologies for slow review. LGTM now.

/lgtm
/approve

k8s-ci-robot · 2024-02-06T14:03:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fische, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [x13n]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fische added 2 commits November 12, 2023 22:29

Stop (un)tainting nodes from unselected node groups.

4901303

Add test for node from unselected node group.

91477ac

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 13, 2023

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 13, 2023

k8s-ci-robot requested a review from feiskyer November 13, 2023 23:17

k8s-ci-robot added the area/cluster-autoscaler label Nov 13, 2023

k8s-ci-robot requested a review from x13n November 13, 2023 23:17

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 14, 2023

BigDarkClown reviewed Nov 27, 2023

View reviewed changes

fische added 6 commits December 12, 2023 11:26

Remove debug line.

cc6ecec

Fix typo.

7383034

Rename filterNodeGroups to filterNodeFromSelectedGroups.

c3810ec

Fallback to initial node list if filtering fails.

95ad6c9

Fix compilation errors.

368441d

Split new test into a separate function.

5f6eedc

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 12, 2023

Merge remote-tracking branch 'origin' into fix-taint-unselected-node

90e24ac

fische requested a review from BigDarkClown December 12, 2023 20:04

k8s-ci-robot assigned x13n Dec 28, 2023

x13n requested changes Dec 28, 2023

View reviewed changes

cluster-autoscaler/core/static_autoscaler.go Outdated Show resolved Hide resolved

cluster-autoscaler/core/static_autoscaler_test.go Outdated Show resolved Hide resolved

fische added 2 commits January 19, 2024 14:10

Only log when we fail to get NodeGroup for a node.

a47ef89

Move to table-based tests.

e8e3ad0

fische requested a review from x13n January 22, 2024 11:27

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 6, 2024

k8s-ci-robot merged commit 3802594 into kubernetes:master Feb 6, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop (un)tainting nodes from unselected node groups. #6273

Stop (un)tainting nodes from unselected node groups. #6273

fische commented Nov 13, 2023 •

edited

Loading

k8s-ci-robot commented Nov 13, 2023

linux-foundation-easycla bot commented Nov 13, 2023 •

edited

Loading

Shubham82 commented Nov 14, 2023

Shubham82 commented Nov 14, 2023

fische commented Nov 14, 2023

fische commented Nov 21, 2023

x13n commented Dec 28, 2023

fische commented Feb 2, 2024

x13n commented Feb 6, 2024

k8s-ci-robot commented Feb 6, 2024

Stop (un)tainting nodes from unselected node groups. #6273

Stop (un)tainting nodes from unselected node groups. #6273

Conversation

fische commented Nov 13, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Nov 13, 2023

linux-foundation-easycla bot commented Nov 13, 2023 • edited Loading

Shubham82 commented Nov 14, 2023

Shubham82 commented Nov 14, 2023

fische commented Nov 14, 2023

fische commented Nov 21, 2023

x13n commented Dec 28, 2023

fische commented Feb 2, 2024

x13n commented Feb 6, 2024

k8s-ci-robot commented Feb 6, 2024

fische commented Nov 13, 2023 •

edited

Loading

linux-foundation-easycla bot commented Nov 13, 2023 •

edited

Loading