-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EKS] [EBS CSI addon]: Adding custom Toleration on EBS CSI Addon #1706
Comments
Thank you for raising this ticket. I'm curious about why the current tolerations for the We currently have this: tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoExecute
operator: Exists
tolerationSeconds: 300 Which seems to assume that the user/operator won't want the tolerations:
- operator: Exists I'm sure my question is just a sign that I have some deeper misunderstanding of how the EBS CSI driver is intended to be used in conjunction with EKS, I'm more than happy to be corrected, or to find that the CSI driver add-on can be easily configured to use EKS' standard topology keys, for instance. |
I've been able to extend the daemonset to tainted nodes by modifying the tolerations, which is enough to deploy pending persistent volume claims...
Unfortunately, tolerations are a fully managed field, so my changes are quickly overwritten. The only workaround I could think of is to remove
|
Have the same issue. In the "standalone" ebs driver, this apparently can be configured with a helm variable "tolerateAllTaints", too: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/charts/aws-ebs-csi-driver/templates/node.yaml#L46 I have opened a ticket with AWS support on why they chose to set this to the (IMHO) less intuitive/sensible option. |
Just to report back: AWS support recommended to use a self-managed installation of the CSI driver for the time being, and requested me to watch this issue on Github. |
I'm in the same boat. An AddOn DaemonSet to provide a universal service that has by default an exclusion of any nodes with a Taint is like running Up the Down escalator. StatefulSets with PVs in a Production multi-AZ architecture - you're gonna have Taints on Worker Nodes so the SS Pods run in the right AZ. I see the option was added to the EBS CSI Driver 2 years ago. Haven't found any discussion around keeping the old default yet, so maybe I'm missing something important. But sure seems like an illogical choice given what EBS CSI Driver does. |
We just upgraded from an older helm chart version of the EBS CSI to the EKS managed add on version and did not realize that this problem until now. I don't understand and am disappointed that the add on does not allow the toleration field to be customizable for the daemonset - it makes absolutely no sense to not allow some configuration of it. |
Hey, facing the same issue. i want to add tolerations : |
@idanl21 Indeed, currently the only solution is to remove the AWS-managed Add-In, and install the ebs-csi-controller yourself. |
as I understand NodeSelector is also fullyManaged? |
this issue is preventing me to use AWS-managed addons. Any workarounds? |
I think I have solved this locally by editing ebs-csi-node daemonset by appending the following in the tolerations at line 365
The toleration after the append will be like below:
It's working, but I am almost sure that it will break again after the drive is updated. |
It's definitely not fixed in 1.10. I have 1.10 installed in my cluster and it won't deploy the daemonset to any nodes that have taints. |
@johnjeffers you are right, I checked only one of my clusters. The other one is still having the same issue. Thank you! The workaround I shared before is still working. |
Could we please get an update on why this is taking so long to fix? All that has to be done is update the toleration on the daemonset to
The other EKS managed add-ons like kube-proxy and vpc-cni already do this. In fact, I copied and pasted that code block directly from the kube-proxy managed add-on's daemonset. This is a blocker for upgrading to EKS 1.23, and the fix seems to be so simple. I can't understand why this issue has been open for so long. Is there more to this problem than there seems to be? |
We recently released a behavior change that will NOT overwrite configuration changes made to EKS managed add-ons through the Kubernetes API. Previously, a reconciliation process ran every 15 minutes that overwrote configuration changes made to EKS managed add-ons through the Kubernetes API. Example – changes you make to the CoreDNS Config Map through the Kubernetes API will no longer be overwritten during steady state. However, if a managed add-on is upgraded, then any configuration changes made will not be retained at this time. This change is a first step in ensuring configurations made to EKS add-ons are preserved. We are also working on additional changes to support advanced configuration of EKS add-ons directly through the EKS API, and the ability to preserve the configuration changes during add-on upgrades. Toleration for the EBS-CSI driver is in our product backlog and is being evaluated by the team. |
OK just to make sure I understand, the permanent fix for this is in the backlog so we won't see it for a while, but if I manually update the daemonset with the tolerations I need, nothing will overwrite my changes until the next upgrade of the add-on? |
Yes, I already tested it applying the requested toleration and it wasn't reconciled. |
As more and more people move to K8s 1.23 where the ebs-csi-driver is mandatory, I imagine a fair amount will try out this addon and realise it doesn't work nicely with taints. It's a shame as we would like to use the eks-addons where possible and it seems like AWS is recommending people do that. Other Daemonsets that are eks-addons don't have this issue, so I'm surprised this one acts differently, I would've thought most people would consider EBS mounting something that they would want on all nodes by default as that's how it works pre 1.23. |
Thanks for the feedback everyone. Currently, you can modify your tolerations without been overwritten by EKS. We will be updating the CSI driver with EKS managed add-on in the next release, updating the daemonset to allow custom tolerations for all taints by default. |
Just got bit by the same issue. Not having at least a warning on the docs or troubleshooting docs made things unnecessarily complicated. Having tolerations for every taint should be the default behavior for the DS. |
Amazon has just released EKS managed add-on EBS CSI "v1.11.2-eksbuild.1". $ k get ds -n kube-system ebs-csi-node -o yaml
...
tolerations:
- operator: Exists
... This is now similar to other add-ons like kube-proxy or AWS VPC CNI. |
I have tried at 16:00 BST from eu-west-1 region too. I didn't see any updated version, but I'll retry later |
@aleclerc-sonrai I'm on 1.23 and I don't see the new version yet either (us-west-2) |
Still not available in us-west-2. Latest from "aws eks describe-addon-versions" is still: "v1.10.0-eksbuild.1". If I hadn't been so busy with other higher priority tickets, I'd have punted on waiting for the add-on and install and manage the drivers myself. The add-on is worthless with the Taint issue. |
Thanks everyone for your patience. |
I hit the same problem and is waiting for new version rollout to finish to continue my work. Current ap-southeast-1 region is not available yet. |
@duclm2609 upgrade eks addon to v1.11.2-eksbuild.1 , in this version has been resolved the issue My env: eks cluster : kubernetes 1.23 version |
It seems there’s a regression: after upgrading addon from v1.11.4-eksbuild.1 to v1.13.0-eksbuild.1, tolerations are gone on my ebs-csi-node DaemonSet. |
^^ Running into the same issue. There are no |
v1.12.1-eksbuild.1 does not work either. |
same here, after addon update to v1.13.0-eksbuild.1 I was unable to provision volumes on node with taints |
How did you manage the upgrade of the managed EBS CSI add-on? Did you use the new "preserve" option to preserve your changes to toleration? (in case you have customized them)
$ k get ds -n kube-system ebs-csi-node -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "1"
meta.helm.sh/release-name: aws-ebs-csi-driver
meta.helm.sh/release-namespace: kube-system
creationTimestamp: "2022-11-16T10:36:34Z"
generation: 1
labels:
app.kubernetes.io/component: csi-driver
app.kubernetes.io/instance: aws-ebs-csi-driver
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: aws-ebs-csi-driver
app.kubernetes.io/version: 1.13.0
helm.sh/chart: aws-ebs-csi-driver-2.13.0
helm.toolkit.fluxcd.io/name: aws-ebs-csi-driver
helm.toolkit.fluxcd.io/namespace: kube-system
name: ebs-csi-node
namespace: kube-system
...
tolerations:
- operator: Exists
...
$ k get ds -n kube-system ebs-csi-node -o yaml | grep tolerations So managed add-on is missing tolerations at all but you can apply it yourself. I will contact AWS service team to check if this is expected! |
I didn’t use the preserve option since I didn’t modify the tolerations because v1.11.5-eksbuild.1 deploys the right ones. |
Can confirm that the addon with a version newer than v1.11.5 doesn’t have tolerations in the manifest and therefore cannot be scheduled on nodes with taints. |
We are working on a v1.12 and v1.13 managed add-on release which fixes the toleration issue. |
We have rolled out add-on versions EBS CSI v1.13.0-eksbuild.2 (default version now for EKS v1.22, v1.23 and v1.24) and v1.12.1-eksbuild.2 in all regions with toleration for ebs-csi-node DaemonSet as expected. Please check! tolerations:
- operator: Exists |
@youwalther65 I confirm it’s working, thanks! |
Hi everyone, we have recently rolled out support for custom tolerations for the EBS CSI Driver addon using the new Advanced Configuration feature for EKS Addons! This has come in two steps:
|
@ConnorJC3 Thanks for the update, that was very useful. Is the team considering adding tolerations support to CoreDNS? Currently, this is the only thing preventing people from creating an EKS cluster with tainted-only nodes. |
|
Amazon EKS team recently announced the general availability of advanced configuration feature for managed add-ons. You can now pass in advanced configuration for cluster add-ons, enabling you to customize add-on properties not handled by default settings. Configuration can be applied to add-ons either during cluster creation or at any time after the cluster is created. Using advanced configuration feature, you can now configure custom tolerations for Amazon EBS CSI driver addon starting from v1.14.0-eksbuild.1. Custom tolerations can be configured through controller.tolerations and node.tolerations. Note, node.tolerateAllTaints will continue to default to true. To learn more about this feature, check out this blogpost - https://aws.amazon.com/blogs/containers/amazon-eks-add-ons-advanced-configuration/ Check out the Amazon EKS documentation - https://docs.aws.amazon.com/eks/latest/userguide/managing-add-ons.html |
@youwalther65 @ConnorJC3 and also ability to customize tolerations? Thanks |
cc @sriramranganathan should know the answer to that |
@pcebul: You can check the JSON configuration schema of the CoreDNS managed add-on to see if it supports customizing the toleration as described in the corresponding launch blog post here. |
Now, I tried by using addon version 1.16 and 1.13, but it seems its not able to run on nodes with taints. kubectl describe pods ebs-csi-controller-765f496485-6xrl4 -n kube-system Warning FailedScheduling 80s default-scheduler 0/2 nodes are available: 2 node(s) had untolerated taint {dedicated: gpuGroup}. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling. And the describe of node with taint kubectl describe node ip-192-168-22-69.ec2.internal MemoryPressure False Wed, 08 Mar 2023 14:31:10 +0530 Wed, 08 Mar 2023 14:30:07 +0530 KubeletHasSufficientMemory kubelet has sufficient memory available amazon-cloudwatch fluent-bit-k7sfv 500m (25%) 0 (0%) 100Mi (3%) 200Mi (6%) 5m7s cpu 655m (33%) 300m (15%) Normal Starting 5m kube-proxy So, it seems that tolerations -op:exists doesnt work for all types or any taint on the node. Please confirm this. |
Community Note
Tell us about your request
Adding custom toleration like "node.tolerateAllTaints = true" on EBS CSI addon so that it can tolerate taints put on nodes.
Which service(s) is this request for?
EKS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
It seems like there is potentially a missing setting on the DaemonSet created by this Add-on that will prevent it from being put on certain nodes with taints on them. People who manage this manually have fixed this by adding:
node.tolerateAllTaints = true
to the daemonset. Currently it does not look like the AWS Add-on allows for something like this. Because of this our move over to using this add on has caused issues in new persistent volumes from being created on certain nodes.
I am not fully sure if there is a workaround for this (I assumed modifying the daemonset after install would not be permanent or should really be a step in setting up).
kubernetes-sigs/aws-ebs-csi-driver#848
Specifically the error we saw that led to this is:
Warning ProvisioningFailed 18s (x7 over 81s) ebs.csi.aws.com_ebs-csi-controller-848fb4bd69-lnjp8_454f66d4-704c-4164-a5a0-283cb99d5688 failed to provision volume with StorageClass "gp3-encrypted": error generating accessibility requirements: no topology key found on CSINode ip-XX-XXX-XX-XX.ec2.internal
Normal ExternalProvisioning 8s (x6 over 81s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
The label of:
topology.ebs.csi.aws.com/zone=
was only on nodes that the daemonset could run on instead of every node (due to the daemonset).
Are you currently working around this issue?
How are you currently solving this problem?
Additional context
Anything else we should know?
Attachments
Case ID:9858452761
The text was updated successfully, but these errors were encountered: