[EKS Add-On] [CoreDNS]: Patched Add-On never recovers from 'Degraded' State #1389

petewilcock · 2021-05-31T17:41:16Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
Bug

Which service(s) is this request for?
EKS Managed Add-Ons

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
In a scenario where an EKS cluster has only tainted nodes available, the CoreDNS add-on cannot deploy successfully. Terraform will, for e.g., time out after 20 minutes of failing to schedule, and the deployment will eventually record as failed in the console.

However, if I begin to deploy the CoreDNS add-on, and then immediately patch the deployment with my tolerations, the service will schedule successfully but the EKS console will report the add-on as 'Degraded' even after the new ReplicaSet has deployed successfully.

It appears that EKS will periodically try a rolling redeployment on a degraded add-on? But this just makes the pods unschedulable again and the cycle repeats.

The ultimate fix (and embedded feature request) here would be the ability to stipulate tolerations as part of the add-on configuration.

Are you currently working around this issue?
With difficulty - Terraform sees the degraded Add-on state as tainted and looks to destroy it.

One workaround is to supply an untainted node to allow the initial deployment to stabilize, and then patch it.

Another seems to be to 'catch it in the act' of the rolling redeploy and patch it again whilst it is in progress. This seems to make it stabilize but is painfully manual.

Additional context

Attachments
If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

pierluigilenoci · 2021-11-16T09:23:31Z

We have been hit by the same problem. 😞

Solutions proposed by AWS support:

remove the taints
change taints to PreferNoSchedule
do not use managed addons

Obviously, none of the solutions are really acceptable. 🤦🏻
If someone decides to use taints on the nodes it is because they need them.

pierluigilenoci · 2021-11-16T09:46:49Z

It appears that CoreDNS has been hit several times by taints related issues:

And not just CoreDNS:

Cluster Autoscaler does not start new nodes when Taints and NodeSelector are used in EKS kubernetes/autoscaler#3802

And not just on AWS EKS:

[cluster-autoscaler] CriticalAddonsOnly taint ignored kubernetes/autoscaler#4097
cluster-autoscaler does not respect CriticalAddonsOnly taint which is the only taint available to system nodes Azure/AKS#2513

Unfortunately there is no way around this as the tolerations field is fully managed by EKS.
Ref: https://docs.aws.amazon.com/eks/latest/userguide/add-ons-configuration.html

I've waited a long time to see this feature released and see it so lame leaves me a little bitter in my mouth.

stevehipwell · 2021-11-16T09:48:12Z

I'd really like to see some documentation for the CriticalAddonsOnly taint and it's use in EKS; this appears to be undocumented behaviour.

The best reference I can find to it in the Kubernetes codebase is kubernetes/kubernetes#101966 (comment), where it's described as "an artifact of the old addon manager under /cluster" and said to "i don't think such a taint is actively used anymore".

stevehipwell · 2021-11-16T09:52:22Z

@pierluigilenoci having EKS install the addons is a bad enough experience; allowing EKS to "manage" the addons is just asking for trouble. I'm very happy that this behaviour wasn't forced on us (see AKS), but I'm desperately waiting on #923 to be able to provision a bare cluster.

pierluigilenoci · 2021-11-16T10:08:47Z

Instead, I'm for the team "the more it is managed by the cloud provider, the better". 💪🏻
I prefer to focus on other things and let AWS / Azure engineers do their work. 😜

I agree that the documentation is missing but I think it is linked to the fact that the AddOns management is (IMHO) still "not production ready" despite being released. So I think they still don't have a clear direction as to where they want to take the "product".

In reality, both platforms (AKS / EKS) are still a bit immature for my taste. But I think it's an inherent problem with Kubernetes itself that it's a product in the making. And obviously, for those who want to provide a product, it is not easy to keep up.
I have to give them credit because in the last 2-3 years they have improved a lot and therefore we have decided to migrate from self-managed clusters to managed clusters.

P.S.: I found this comment enlightening #923 (comment)

stevehipwell · 2021-11-16T12:44:15Z

Instead, I'm for the team "the more it is managed by the cloud provider, the better". 💪🏻
I prefer to focus on other things and let AWS / Azure engineers do their work. 😜

@pierluigilenoci you might be interested in #1559 then?

pierluigilenoci · 2021-11-16T12:54:06Z

@stevehipwell you've had my like and I look forward to finding you a few more.
But I have little hope that AWS will consider the request.

datadoggers · 2022-12-23T05:57:35Z

if configuration of cluster and node group is seems fine then once try changing type of instance (like if we have selected t3.medium in node group then try t2.medium or any)
then verify with --- kubectl get pods -n=kube-system | grep coredns
are they running or not . in my case it was solved

sftim · 2023-04-08T12:12:26Z

BTW, CriticalAddonsOnly is not an official Kubernetes taint; if it were, you'd see it listed in [Well-Known Labels, Annotations and Taints](Well-Known Labels, Annotations and Taints) and it would be prefixed, eg with kubernetes.io/.

It's an architectural bug that the cluster-autoscaler tries to use a private taint. Ideally, folks help fix the cluster autoscaler.

AndreiAtMP · 2024-10-31T23:30:13Z

Hello, still hitting this problem with Terraform.

resource "aws_eks_addon" "coredns" {
  cluster_name                = aws_eks_cluster.name.name
  addon_name                  = "coredns"
  depends_on = [aws_eks_fargate_profile.kube-system]  
}

Stacking on degraded state.

Any updates, please?

Best regards,
Andrei

petewilcock added the Proposed Community submitted issue label May 31, 2021

mikestef9 added EKS Add-Ons EKS Amazon Elastic Kubernetes Service labels May 31, 2021

ghost mentioned this issue Jun 21, 2021

[EKS] [request]: Provide safe transition from managed to self managed add-on deployments #1415

Closed

pierluigilenoci mentioned this issue Nov 16, 2021

[cluster-autoscaler] CriticalAddonsOnly taint ignored kubernetes/autoscaler#4097

Closed

wojtekszpunar mentioned this issue Jan 24, 2022

Add-ons management Race condition with Node Groups terraform-aws-modules/terraform-aws-eks#1801

Closed

jaimehrubiks mentioned this issue Jan 3, 2023

[EKS] [EBS CSI addon]: Adding custom Toleration on EBS CSI Addon #1706

Closed

ldming mentioned this issue Apr 1, 2023

[BUG]EKS DEGRADED cause playground init failed apecloud/kubeblocks#2371

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EKS Add-On] [CoreDNS]: Patched Add-On never recovers from 'Degraded' State #1389

[EKS Add-On] [CoreDNS]: Patched Add-On never recovers from 'Degraded' State #1389

petewilcock commented May 31, 2021

pierluigilenoci commented Nov 16, 2021

pierluigilenoci commented Nov 16, 2021

stevehipwell commented Nov 16, 2021

stevehipwell commented Nov 16, 2021

pierluigilenoci commented Nov 16, 2021

stevehipwell commented Nov 16, 2021

pierluigilenoci commented Nov 16, 2021

datadoggers commented Dec 23, 2022

sftim commented Apr 8, 2023

AndreiAtMP commented Oct 31, 2024

[EKS Add-On] [CoreDNS]: Patched Add-On never recovers from 'Degraded' State #1389

[EKS Add-On] [CoreDNS]: Patched Add-On never recovers from 'Degraded' State #1389

Comments

petewilcock commented May 31, 2021

Community Note

pierluigilenoci commented Nov 16, 2021

pierluigilenoci commented Nov 16, 2021

stevehipwell commented Nov 16, 2021

stevehipwell commented Nov 16, 2021

pierluigilenoci commented Nov 16, 2021

stevehipwell commented Nov 16, 2021

pierluigilenoci commented Nov 16, 2021

datadoggers commented Dec 23, 2022

sftim commented Apr 8, 2023

AndreiAtMP commented Oct 31, 2024