-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EKS Add-On] [CoreDNS]: Patched Add-On never recovers from 'Degraded' State #1389
Comments
We have been hit by the same problem. 😞 Solutions proposed by AWS support:
Obviously, none of the solutions are really acceptable. 🤦🏻 |
It appears that CoreDNS has been hit several times by taints related issues:
And not just CoreDNS: And not just on AWS EKS:
Unfortunately there is no way around this as the I've waited a long time to see this feature released and see it so lame leaves me a little bitter in my mouth. |
I'd really like to see some documentation for the The best reference I can find to it in the Kubernetes codebase is kubernetes/kubernetes#101966 (comment), where it's described as "an artifact of the old addon manager under /cluster" and said to "i don't think such a taint is actively used anymore". |
@pierluigilenoci having EKS install the addons is a bad enough experience; allowing EKS to "manage" the addons is just asking for trouble. I'm very happy that this behaviour wasn't forced on us (see AKS), but I'm desperately waiting on #923 to be able to provision a bare cluster. |
Instead, I'm for the team "the more it is managed by the cloud provider, the better". 💪🏻 I agree that the documentation is missing but I think it is linked to the fact that the AddOns management is (IMHO) still "not production ready" despite being released. So I think they still don't have a clear direction as to where they want to take the "product". In reality, both platforms (AKS / EKS) are still a bit immature for my taste. But I think it's an inherent problem with Kubernetes itself that it's a product in the making. And obviously, for those who want to provide a product, it is not easy to keep up. P.S.: I found this comment enlightening #923 (comment) |
@pierluigilenoci you might be interested in #1559 then? |
@stevehipwell you've had my like and I look forward to finding you a few more. |
if configuration of cluster and node group is seems fine then once try changing type of instance (like if we have selected t3.medium in node group then try t2.medium or any) |
BTW, It's an architectural bug that the cluster-autoscaler tries to use a private taint. Ideally, folks help fix the cluster autoscaler. |
Hello, still hitting this problem with Terraform.
Stacking on degraded state. Any updates, please? Best regards, |
Community Note
Tell us about your request
Bug
Which service(s) is this request for?
EKS Managed Add-Ons
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
In a scenario where an EKS cluster has only tainted nodes available, the CoreDNS add-on cannot deploy successfully. Terraform will, for e.g., time out after 20 minutes of failing to schedule, and the deployment will eventually record as failed in the console.
However, if I begin to deploy the CoreDNS add-on, and then immediately patch the deployment with my tolerations, the service will schedule successfully but the EKS console will report the add-on as 'Degraded' even after the new ReplicaSet has deployed successfully.
It appears that EKS will periodically try a rolling redeployment on a degraded add-on? But this just makes the pods unschedulable again and the cycle repeats.
The ultimate fix (and embedded feature request) here would be the ability to stipulate tolerations as part of the add-on configuration.
Are you currently working around this issue?
With difficulty - Terraform sees the degraded Add-on state as tainted and looks to destroy it.
One workaround is to supply an untainted node to allow the initial deployment to stabilize, and then patch it.
Another seems to be to 'catch it in the act' of the rolling redeploy and patch it again whilst it is in progress. This seems to make it stabilize but is painfully manual.
Additional context
Attachments
If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)
The text was updated successfully, but these errors were encountered: