-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop (un)tainting nodes from unselected node groups. #6273
Stop (un)tainting nodes from unselected node groups. #6273
Conversation
Welcome @fische! |
@fische, you have to sign the CLA before the PR can be reviewed. |
To check EasyCLA /easycla |
Yup, sorry, I forgot about that. |
@Shubham82 Would you mind taking a look now that the CLA has been signed please? 🙏 |
/assign |
@x13n Could you please have another look? |
Apologies for slow review. LGTM now. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fische, x13n The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes a bug where at the start of
RunOnce
and during scale down, the code cleans DeletionCandidate taints from pretty much all nodes, even from those that are not part of the selected node groups. By selected node groups, I mean the ones passed through the--nodes
flag. This is obviously undesired behaviour but I'd like to give a bit more context around how we came across this issue:We are currently running our cluster on GKE which comes with its own cluster autoscaler. It lacks loads of options, so we have deployed our own CA and disabled the managed one on all node pools, as there is no way to completely remove it. This means we have 2 CAs running at the same time, even though one is not supposed to do anything. Given the bug I described above, they both are conflicting because while one (ours) tries to scale down some nodes, the other (GKE's) will remove the DeletionCandidate taint from those as from its perspective those nodes should not be scaled down even though their node groups have not been selected. This makes scale down very slow.
To give you a better idea, with the test case I've added, it fails with the following on master:
Which issue(s) this PR fixes:
I haven't raised any issue for this. Shall I?
Special notes for your reviewer:
I'm just not sure what we should do if there's an error filtering the nodes during scale down. Shall we continue? I've added a comment in the code in case you don't see what I'm talking about.
Does this PR introduce a user-facing change?
This does somewhat introduce a user-facing change, yes, so I've added a release note, even though it is behaviour that users should not rely on.
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: