Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS Add-On] [CoreDNS]: Patched Add-On never recovers from 'Degraded' State #1389

Open
petewilcock opened this issue May 31, 2021 · 10 comments
Labels
EKS Add-Ons EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue

Comments

@petewilcock
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
Bug

Which service(s) is this request for?
EKS Managed Add-Ons

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
In a scenario where an EKS cluster has only tainted nodes available, the CoreDNS add-on cannot deploy successfully. Terraform will, for e.g., time out after 20 minutes of failing to schedule, and the deployment will eventually record as failed in the console.

However, if I begin to deploy the CoreDNS add-on, and then immediately patch the deployment with my tolerations, the service will schedule successfully but the EKS console will report the add-on as 'Degraded' even after the new ReplicaSet has deployed successfully.

It appears that EKS will periodically try a rolling redeployment on a degraded add-on? But this just makes the pods unschedulable again and the cycle repeats.

The ultimate fix (and embedded feature request) here would be the ability to stipulate tolerations as part of the add-on configuration.

Are you currently working around this issue?
With difficulty - Terraform sees the degraded Add-on state as tainted and looks to destroy it.

One workaround is to supply an untainted node to allow the initial deployment to stabilize, and then patch it.

Another seems to be to 'catch it in the act' of the rolling redeploy and patch it again whilst it is in progress. This seems to make it stabilize but is painfully manual.

Additional context

Attachments
If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

@petewilcock petewilcock added the Proposed Community submitted issue label May 31, 2021
@mikestef9 mikestef9 added EKS Add-Ons EKS Amazon Elastic Kubernetes Service labels May 31, 2021
@pierluigilenoci
Copy link

We have been hit by the same problem. 😞

Solutions proposed by AWS support:

  • remove the taints
  • change taints to PreferNoSchedule
  • do not use managed addons

Obviously, none of the solutions are really acceptable. 🤦🏻
If someone decides to use taints on the nodes it is because they need them.

@stevehipwell
Copy link

I'd really like to see some documentation for the CriticalAddonsOnly taint and it's use in EKS; this appears to be undocumented behaviour.

The best reference I can find to it in the Kubernetes codebase is kubernetes/kubernetes#101966 (comment), where it's described as "an artifact of the old addon manager under /cluster" and said to "i don't think such a taint is actively used anymore".

@stevehipwell
Copy link

@pierluigilenoci having EKS install the addons is a bad enough experience; allowing EKS to "manage" the addons is just asking for trouble. I'm very happy that this behaviour wasn't forced on us (see AKS), but I'm desperately waiting on #923 to be able to provision a bare cluster.

@pierluigilenoci
Copy link

Instead, I'm for the team "the more it is managed by the cloud provider, the better". 💪🏻
I prefer to focus on other things and let AWS / Azure engineers do their work. 😜

I agree that the documentation is missing but I think it is linked to the fact that the AddOns management is (IMHO) still "not production ready" despite being released. So I think they still don't have a clear direction as to where they want to take the "product".

In reality, both platforms (AKS / EKS) are still a bit immature for my taste. But I think it's an inherent problem with Kubernetes itself that it's a product in the making. And obviously, for those who want to provide a product, it is not easy to keep up.
I have to give them credit because in the last 2-3 years they have improved a lot and therefore we have decided to migrate from self-managed clusters to managed clusters.

P.S.: I found this comment enlightening #923 (comment)

@stevehipwell
Copy link

Instead, I'm for the team "the more it is managed by the cloud provider, the better". 💪🏻
I prefer to focus on other things and let AWS / Azure engineers do their work. 😜

@pierluigilenoci you might be interested in #1559 then?

@pierluigilenoci
Copy link

@stevehipwell you've had my like and I look forward to finding you a few more.
But I have little hope that AWS will consider the request.

@datadoggers
Copy link

if configuration of cluster and node group is seems fine then once try changing type of instance (like if we have selected t3.medium in node group then try t2.medium or any)
then verify with --- kubectl get pods -n=kube-system | grep coredns
are they running or not . in my case it was solved

@sftim
Copy link

sftim commented Apr 8, 2023

BTW, CriticalAddonsOnly is not an official Kubernetes taint; if it were, you'd see it listed in [Well-Known Labels, Annotations and Taints](Well-Known Labels, Annotations and Taints) and it would be prefixed, eg with kubernetes.io/.

It's an architectural bug that the cluster-autoscaler tries to use a private taint. Ideally, folks help fix the cluster autoscaler.

@AndreiAtMP
Copy link

Hello, still hitting this problem with Terraform.

resource "aws_eks_addon" "coredns" {
  cluster_name                = aws_eks_cluster.name.name
  addon_name                  = "coredns"
  depends_on = [aws_eks_fargate_profile.kube-system]  
}

Stacking on degraded state.

Any updates, please?

Best regards,
Andrei

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Add-Ons EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue
Projects
None yet
Development

No branches or pull requests

7 participants