Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-autoscaler does not respect CriticalAddonsOnly taint which is the only taint available to system nodes #2513

Closed
fullykubed opened this issue Aug 26, 2021 · 7 comments
Assignees

Comments

@fullykubed
Copy link

What happened:

The system nodes only have the option to add the CriticalAddonsOnly node taint. The cluster autoscaler ignores the CriticalAddonsOnly node taint in scheduling computation and results in undefined and undesired behavior.

What you expected to happen:

The cluster autoscaler properly respects CriticalAddonsOnly taints on the system nodes during its scheduling calculations OR the AKS team allows other taints to be added to system nodes.

How to reproduce it (as minimally and precisely as possible):

Run a cluster with system nodes with the CriticalAddonsOnly node taint and install the cluster autoscaler.

Anything else we need to know?:

See this issue in the cluster autoscaler repository: kubernetes/autoscaler#4097. There is some commentary that AKS's use of taints is incorrect. I think your implementation is within spec, but it is worth this team weighing in on b/c until alignment is reached b/w the AKS team and the cluster autoscaler maintainers, the cluster autoscaler is has undefined and undesired behavior in the common scenario described above.

Environment:

  • Kubernetes version (use kubectl version): 1.20
  • Cluster Autoscaler: 1.20
@ghost ghost added the triage label Aug 26, 2021
@ghost
Copy link

ghost commented Aug 26, 2021

Hi jclangst, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

  1. If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
  2. Please abide by the AKS repo Guidelines and Code of Conduct.
  3. If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
  4. Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
  5. Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
  6. If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

@marwanad
Copy link

@jclangst are you using the AKS managed autoscaler? There's a change that went in a while ago to drop that one from the sanitized list of taints. Happy to look into it further.

@fullykubed
Copy link
Author

@marwanad Good question. No we are using 1.20.0 of the the cluster-autoscaler as we have unfortunately have requirements the built-in Azure autoscaler cannot support yet.

@marwanad
Copy link

@jclangst gotcha - let me see if I can make the change to upstream. The only possible way of doing it would be via a flag to disable that sanitization if anything.

I'm very curious - what are you missing in the managed autoscaler?

@ghost ghost removed the action-required label Aug 27, 2021
@fullykubed
Copy link
Author

fullykubed commented Aug 28, 2021

@marwanad A few stakeholders have worked together to generate a proposed solution which sounds aligned with your proposal for a new flag in upstream: kubernetes/autoscaler#4097.

As for the missing features, I don't have the complete list, but I know that the "managed autoscaler" was making it hard to uncover the root cause for issues like this b/c we didn't have complete visibility into how Azure was configuring it so we needed to remove that layer of abstraction to make progress: kubernetes/autoscaler#4099. In other words, we had to introduce custom patches to the autoscaler until there are upstream fixes.

@marwanad
Copy link

@jclangst fair enough, we're usually quick when patching and back-porting those upstream issues but totally understand if you're in a happy place with the unmanaged version.

The only downside is certain fixes/improvements that are AKS-specific that are bit of a hassle to get upstream (example the use of CriticalAddons taint is different than other providers). I'll probably resolve this issue (since it relates to upstream CA and will try and work on an upstream change to accommodate that or if you're open to it - feel free to PR it there.

What are you lacking in terms of visibility for the managed autoscaler? Are the logs not sufficient? You should be able to see most configs there.

@fullykubed
Copy link
Author

@marwanad Yes, I think that you can close this issue. I wanted to bring visibility to the AKS team in case you had opinions different than mine which is that a fix is needed in the upstream CA (see originally linked issue).

For visibility, just the inability to mess with the code was the biggest limitation given all of the "interesting" behavior that we are uncovering, especially in re-using configuration across clouds; the plugin-and-play nature of the CA cloud-plugins hasn't quite lived up to the promise.

@ghost ghost locked as resolved and limited conversation to collaborators Sep 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants