cluster-autoscaler does not respect CriticalAddonsOnly taint which is the only taint available to system nodes #2513

fullykubed · 2021-08-26T13:50:24Z

What happened:

The system nodes only have the option to add the CriticalAddonsOnly node taint. The cluster autoscaler ignores the CriticalAddonsOnly node taint in scheduling computation and results in undefined and undesired behavior.

What you expected to happen:

The cluster autoscaler properly respects CriticalAddonsOnly taints on the system nodes during its scheduling calculations OR the AKS team allows other taints to be added to system nodes.

How to reproduce it (as minimally and precisely as possible):

Run a cluster with system nodes with the CriticalAddonsOnly node taint and install the cluster autoscaler.

Anything else we need to know?:

See this issue in the cluster autoscaler repository: kubernetes/autoscaler#4097. There is some commentary that AKS's use of taints is incorrect. I think your implementation is within spec, but it is worth this team weighing in on b/c until alignment is reached b/w the AKS team and the cluster autoscaler maintainers, the cluster autoscaler is has undefined and undesired behavior in the common scenario described above.

Environment:

Kubernetes version (use kubectl version): 1.20
Cluster Autoscaler: 1.20

The text was updated successfully, but these errors were encountered:

ghost · 2021-08-26T13:50:27Z

Hi jclangst, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
Please abide by the AKS repo Guidelines and Code of Conduct.
If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

marwanad · 2021-08-27T00:32:23Z

@jclangst are you using the AKS managed autoscaler? There's a change that went in a while ago to drop that one from the sanitized list of taints. Happy to look into it further.

fullykubed · 2021-08-27T14:37:31Z

@marwanad Good question. No we are using 1.20.0 of the the cluster-autoscaler as we have unfortunately have requirements the built-in Azure autoscaler cannot support yet.

marwanad · 2021-08-27T16:18:03Z

@jclangst gotcha - let me see if I can make the change to upstream. The only possible way of doing it would be via a flag to disable that sanitization if anything.

I'm very curious - what are you missing in the managed autoscaler?

fullykubed · 2021-08-28T12:32:42Z

@marwanad A few stakeholders have worked together to generate a proposed solution which sounds aligned with your proposal for a new flag in upstream: kubernetes/autoscaler#4097.

As for the missing features, I don't have the complete list, but I know that the "managed autoscaler" was making it hard to uncover the root cause for issues like this b/c we didn't have complete visibility into how Azure was configuring it so we needed to remove that layer of abstraction to make progress: kubernetes/autoscaler#4099. In other words, we had to introduce custom patches to the autoscaler until there are upstream fixes.

marwanad · 2021-08-30T15:03:40Z

@jclangst fair enough, we're usually quick when patching and back-porting those upstream issues but totally understand if you're in a happy place with the unmanaged version.

The only downside is certain fixes/improvements that are AKS-specific that are bit of a hassle to get upstream (example the use of CriticalAddons taint is different than other providers). I'll probably resolve this issue (since it relates to upstream CA and will try and work on an upstream change to accommodate that or if you're open to it - feel free to PR it there.

What are you lacking in terms of visibility for the managed autoscaler? Are the logs not sufficient? You should be able to see most configs there.

fullykubed · 2021-08-30T23:25:08Z

@marwanad Yes, I think that you can close this issue. I wanted to bring visibility to the AKS team in case you had opinions different than mine which is that a fix is needed in the upstream CA (see originally linked issue).

For visibility, just the inability to mess with the code was the biggest limitation given all of the "interesting" behavior that we are uncovering, especially in re-using configuration across clouds; the plugin-and-play nature of the CA cloud-plugins hasn't quite lived up to the promise.

ghost added the triage label Aug 26, 2021

pierluigilenoci mentioned this issue Aug 26, 2021

[cluster-autoscaler] CriticalAddonsOnly taint ignored kubernetes/autoscaler#4097

Closed

marwanad added cluster-autoscaler and removed triage labels Aug 27, 2021

marwanad self-assigned this Aug 27, 2021

palma21 added Needs Author Feedback Needs Information labels Aug 27, 2021

ghost added action-required and removed Needs Author Feedback Needs Information labels Aug 27, 2021

ghost removed the action-required label Aug 27, 2021

fullykubed closed this as completed Aug 30, 2021

ghost locked as resolved and limited conversation to collaborators Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster-autoscaler does not respect CriticalAddonsOnly taint which is the only taint available to system nodes #2513

cluster-autoscaler does not respect CriticalAddonsOnly taint which is the only taint available to system nodes #2513

fullykubed commented Aug 26, 2021

ghost commented Aug 26, 2021

marwanad commented Aug 27, 2021

fullykubed commented Aug 27, 2021

marwanad commented Aug 27, 2021

fullykubed commented Aug 28, 2021 •

edited

Loading

marwanad commented Aug 30, 2021

fullykubed commented Aug 30, 2021

cluster-autoscaler does not respect CriticalAddonsOnly taint which is the only taint available to system nodes #2513

cluster-autoscaler does not respect CriticalAddonsOnly taint which is the only taint available to system nodes #2513

Comments

fullykubed commented Aug 26, 2021

ghost commented Aug 26, 2021

marwanad commented Aug 27, 2021

fullykubed commented Aug 27, 2021

marwanad commented Aug 27, 2021

fullykubed commented Aug 28, 2021 • edited Loading

marwanad commented Aug 30, 2021

fullykubed commented Aug 30, 2021

fullykubed commented Aug 28, 2021 •

edited

Loading