-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cluster-autoscaler] CriticalAddonsOnly taint ignored #4097
Comments
After reading up on the use of CriticalAddonsOnly this behaviour looks to be expected and desired. |
I think you may have misread. The documentation you linked references a Kubernetes component called the "rescheduler" that does not exist, at least in modern kubernetes distributions. This seems to be a result of that documentation being 5+ years out of date. To my knowledge the modern Kubernetes scheduler never ignores node taints so it does not make sense for the autoscaler to ignore this one taint in particular. It looks to be legacy code built in a time when there were proposals to make something called the "rescheduler" a main Kubernetes component. Since that did not happen, this should be removed. |
The only official documentation on Critical Add-On that I could find is this: |
Yep! There is that, but the scheduler still is not going to ignore node taints even for |
I agree with @pierluigilenoci that the only official documentation found in a google search doesn't even mention the taint by name. There are some kubeadm docs that mention the taint that suggest how it's intended to be used. It is also present in a number of deployment patterns for core cluster components. AWS EKS uses this taint when starting new nodes, which I suspect is to allow the aws-node CNI pod to start before allowing other workloads to schedule, and then removes it after the node is ready (has a CNI configured). Azure AKS uses this taint permanently on it's system node pool. I personally think that the EKS implementation is "correct", which makes the AKS implementation "incorrect"; although it would worth getting some clarification from the core team or kubeadm team as to the intention of the |
I appreciate that perspective, but there is no such thing as a "special" taint in Kubernetes. There were in the past proposals to make To me, it does not logically follow that "no current documentation for this taint exists" leads to "AKS implements this taint incorrectly" leads to "the current CA behavior is correct." While I agree that I think the AKS team has made some interesting decisions here, I am not sure its fair to call their implementation incorrect. EKS does not get to dictate the Kubernetes spec, and the CA already provides flags for ignoring taints such as I will continue to advocate that this legacy code path should be removed. |
@jclangst I think there is some confusion around the use of this taint and it's association with the "rescheduler", which I read to be an additive improvement to an existing "special" taint. I'm also not sure if the EKS implementation of For the specifics of the CA check on this taint, the important consideration is would changing the code change the behaviour. I suspect it would for any K8s cluster where the taints are calculated directly from the nodes and the cluster is using a temporary |
@stevehipwell You are making the claim that there is an "existing special taint." Kubernetes has specific documentation for all of common taints which you can find here; the taint is not listed in the spec. I have also done the due diligence of searching the source code for ALL of the core Kubernetes components (including That said, I do appreciate the effort you have put into searching for the deployment patterns. This is a holdover from old (5+ years) versions of Kubernetes that DID use this taint (for the rescheduler proposal), I have linked to that proposal in a previous comment, and I encourage you to review the sequence of events yourself. I definitely agree with your assessment that its possible that the CA has never worked on these AKS nodes. But with the context of all of the other evidence, I do not feel this is a bug with AKS but rather a bug with CA.
I think this is a cogent consideration. Fortunately, no documentation would need to be changed as the behavior is completely undocumented, hence why my teams ran into unexpected behavior in the first place. Given that the current versions of Kubernetes do not use this taint (again, please feel free to review the source), it is not in the official Kubernetes spec, and its undocumented behavior of the CA, I don't see how this could be anything other than a bug. I do not see the issue of putting bugfixes into current version (and even backporting the bugfix to previous versions). That said, if you can provide an example of how this may be disruptive, I definitely want to make sure your use cases are considered. It could be that I am simply not understanding how this would affect your setup. What I can say is that the current behavior is negatively impacting the setups of most AKS customers, myself included. |
From what I am understanding, you are worried about the following sequence of events: (1) EKS adds a temporary taint with Is this a fair assessment of your concern? |
@jclangst I'm coming at this from a similar position as you as I'm building both EKS and AKS clusters as part of a single platform and the lack of consistency in this area is painful; we got hit with this on both EKS & AKS due to trying to align system node pool taints. I have reviewed the Kubernetes code and like you can't find any references within the code, but I have been able to find a number of places such as Helm charts where it's treated as a special case.
That is my concern and unless this can be addressed it would be a breaking change. I'd love to see the code removed and the correct documentation of |
@jclangst I saw the issue you opened on the AKS repository. 👍🏻 Ie: Azure/AKS#2306 |
@jclangst what's your AKS issue link? |
I definitely understand and it sucks that EKS and AKS have such different behavior on this front. What about this solution: (1) For future versions, remove this behavior This way it would be opt-in and shouldn't affect customers without Edit: Typo |
@stevehipwell here: Azure/AKS#2513 |
@pierluigilenoci Well I had to try. Ultimately, we just want our configurations to work regardless of where our cluster is hosted! |
Our team's core issue is that currently I can neither tell to the CA to respect the taint nor add additional taints to system nodes in AKS. |
@jclangst we're in the same situation as you. In Azure we have made the default system node group single zone and added duplicates for the other zones so the CA should work correctly for workloads with PVCs but until this is resolved either way we have to manually scale these node groups as CA isn't working. |
@stevehipwell This is indeed the solution we aligned on as well. |
We have also been trying to align the taint between EKS and AKS for months but it is not possible to date. For cluster-autoscale, we used a mixed approach. We install as Add-On in AKS and we install with Helm/Flux in EKS with the toleration for |
@jclangst it's yet another area where AKS is making work, they should either support a group per zone or allow the default group to be disabled. |
I agree with all of the above, but I also think the CA is also something that needs to change as it appears to be making assumptions that are really only relevant for EKS. What do you think of the above proposal? |
I'd also be very open to just doing (2) for both past and future as long as we make sure its documented appropriately. |
@jclangst I'd suggest first implementing option 1 for v1.23.0 as that should be the correct behaviour moving forwards. Then if possible implementing option 2 to be backported as a patch to v1.21 & v1.22 would give a solution for Azure. |
Sorry. Closed wrong issue! |
This taint is not documented to be involved in any special handling according to the official Kubernetes documentation, see: - kubernetes/autoscaler#4097 - https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ Gardener as well doesn't use this taint in any way.
…7304) * Remove system components node selector * Remove `CriticalAddonsOnly` toleration This taint is not documented to be involved in any special handling according to the official Kubernetes documentation, see: - kubernetes/autoscaler#4097 - https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ Gardener as well doesn't use this taint in any way. * Remove general tolerations from deployments Custom or user added taints are tolerated with the help of the System-Components-Config webhook (#7204). In addition, it's not desired to tolerate any `NoExecute` taint as this will prevent proper [taint based evictions](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions). Only `DaemonSet`s should tolerate these taints.
`CriticalAddonsOnly` taint is not documented to be involved in any special handling according to the official Kubernetes documentation, see: - kubernetes/autoscaler#4097 - https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
`CriticalAddonsOnly` taint is not documented to be involved in any special handling according to the official Kubernetes documentation, see: - kubernetes/autoscaler#4097 - https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
@pierluigilenoci @gjtempleton @stevehipwell any news on this issue? |
@ib-ak I'm sorry, but I can't help you much. |
BTW, For example, nodes can come online tainted as Using a private taint, I think that cluster-autoscaler's use of this private taint represents an architecture bug. |
Came to know about this issue today because of https://kubernetes.slack.com/archives/C09R1LV8S/p1685972225498489 I am not sure if I have fully grasped the details here. I have added it as an item on the agenda for the upcoming sig meeting to decide what to do about this. Once we reach a decision, I would be happy to raise a PR (I am working on 2 other PRs so some delay should be expected). |
/assign vadasambar |
Doing the simplest thing first based on my limited understanding: #5838 (WIP) |
Link to agenda for upcoming sig meeting: https://docs.google.com/document/d/1RvhQAEIrVLHbyNnuaT99-6u9ZUMp7BfkPupT2LAZK7w/edit#bookmark=id.e9v2tiuubvsq |
My 2 cents: Since it's not an official taint, it shouldn't be treated different than any other taint. Therefor I'd remove it.
...this will cause problems as described above. But frankly this shouldn't be too much of a concern for this project. Might require a breaking/major release but EKS is in the wrong using this taint instead of the official e.g |
If preferred, EKS could also mint a taint such as |
@discordianfish If you hope cloud providers will use "official" methods, you are naïve and wait forever. 😜 |
@pierluigilenoci Precisely my point, this project should assume the 'official' methods and leave it to downstream to converge to that (aka break it) |
@discordianfish I'm not 100% sure if EKS still uses this taint but I would say they're more correct than AKS who use it as their system node taint. AFAIR the |
@discordianfish The project must assume the "official" methods as default behavior but could give the possibility to use "custom" methods on request (for example, through some configuration). @stevehipwell between AKS and EKS, they have an internal competition over who does things less orthodox. 🤣 |
Brought this up in the sig meeting yesterday (12th June 2023). Let's wait and see if we get a response from someone in the community who has historical context on this. Otherwise I will bring this up again in the upcoming sig meeting (Mon, 19th June 2023) |
Can you be more specific about what historical context you are looking for? We have already done a deep dive in the current and historical codebases here. |
@jclangst I basically want someone from CA community with better historical context (than me) comment around why we haven't removed this taint yet and if it's okay to just go ahead and remove it. Based on the community's response, we either remove the taint completely from the code OR have an option to preserve the current behavior while giving an option to change the behavior based on the use-case (maybe using a flag) |
@vadasambar To that end, I would also be interested in what is holding us up from choosing a path on this. |
Given that the CA already has flags for ignoring certain taints, it seems to me implicit + immutable ignore behavior should be removed. It might generate some very minor work for cloud providers that have relied on the implicit, undocumented behavior, but as long as this is coupled with a major version bump that seems completely reasonable (in my personal opinion). |
Contrary to what comments above say - it's not a private extension, CriticalAddonsOnly was an official Kubernetes taint (to the extent anything was official in those days - it was used by core k8s code which I think was how we defined "official" back in 2017 :) ). This taint was used by 'rescheduler' which very much existed and wasn't just a proposal. I don't remember the details after all those years, but back than pod priority didn't exist and daemonset pods were scheduled directly by DaemonsetController, not by scheduler. IIRC Rescheduler was basically a separate controller doing equivalent of today pod preemption for system pods. Since there was no pod priority back than it had to temporarily taint the node while performing the preemption, otherwise any pod could schedule using the capacity freed by preemption. The only way to guarantee the intended pod would use the capacity was to temporarily taint a node so that non-system pods couldn't schedule there. Once the preemption was done, the taint would be removed. I'm sure I'm getting some details wrong, but I think that was the general idea. So yeah - this is very much legacy. I don't think CA should be required to maintain support for a feature removed from kubernetes back in 2018 and I'd be happy to approve a PR removing it. Now - the issue brought on sig autoscaling seems to have been triggered by the idea of using this particular taint to mark control plane nodes. I don't think that's consistent with how CriticalAddonsOnly taint was originally intended to be used and I guess that could be considered a private extension (using an "official" taint for a different purpose). I have no idea what platforms out there may be using and whether they will be impacted by the change and it will technically be backward incompatible change, so I'd ask anyone making this change to make sure to include a release note that clearly calls out the backward-incompatible change in behavior. |
I'm guessing As it's been around so long, I strongly agree on the value of some kind of deprecation process. |
PR #5838 is ready for review. |
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
1.20.0
Component version:
What k8s version are you using (
kubectl version
)?:1.20.5
kubectl version
OutputWhat environment is this in?:
Azure w/ VMSS scaling configured via auto-discovery
What did you expect to happen?:
When I apply taint
CriticalAddonsOnly=true:NoSchedule
to a node group, it should be respected.What happened instead?:
This particular taint is not respected during scaling calculations.
How to reproduce it (as minimally and precisely as possible):
Create a node group with this taint and one without, scale a deployment without a matching toleration, notice that the tainted node group will get scaled occasionally.
Anything else we need to know?:
This appears to be a result of this taint filtering logic;
autoscaler/cluster-autoscaler/utils/taints/taints.go
Line 70 in d33cc1b
I believe this is the problem in this old issue that went unresolved: #2434
After searching the repository, I could not find any references to this value in the core logic. While I may be missing something, I am not sure what the intended purpose of this filter is. The only thing I can find is a reference to a "rescheduler". I think it may be related to this KEP but I'm not sure if its still relevant: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/rescheduling.md
It is hindering my use case in the Azure/AKS b/c that is the only taint that can be added to the default system node pool. Frequently the scaling is very delayed as the autoscaler will scale this node pool incorrectly and repeatedly. My hacky workaround is to just have the autoscaler ignore this node pool, but I am afraid this unexpected behavior will continue to surprise people. Moreover, I'd like to be able to not ignore this pool in the future.
(1) Can it be removed?
(2) If not, can a warning and explanation be added to the FAQ?
The text was updated successfully, but these errors were encountered: