-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA failed to load Instance Type list unless configured with hostNetworking #4464
Comments
/area provider/aws |
I'm getting a similar error:
Kubernetes version:
Cluster Autoscaler Image:
|
Is there a known workaround for this? Seems that we're hit by the same issue. |
Just to confirm, are all 3 of you only seeing this in Osaka, with a Does running this with the flag |
Our stacktrace looked the same but was caused by a permission problem. So we're luckily not affected by this exact issue. |
Hey @adaam, I don't currently have access to a cluster in Osaka (working on that) to reproduce, but a couple of questions I'd like the answer to/things I'd like you to try out if possible to help narrow down what's going on here:
My suspicion is currently still that this is related to a permissions issue, although we should handle it more gracefully than we currently do. |
I was seeing it with any tag I tried (1.20.0 to 1.21.1) and it wasn't in Osaka, we were trying from Sydney. Running with
Then I tried with
... Eventually what worked for us was enabling host networking for the cluster autoscaler. We found that no pods on our cluster were actually able to access resources outside the cluster by default (EKS, Amazon VPC CNI) -- still running with host networking until we can apply some more engineering time to looking at it further. |
That's some great detail, thanks @dan-tw, you're reinforcing my belief that most people seeing this error are having permissions/networking errors masked poorly by this crash, and we can handle it more gracefully. |
Relatedly, it would be great to get your feedback as users who have encountered this, on the change I'm proposing in #4480, would you prefer that behaviour, with the risk I've outlined in the PR description, over the current hard crash behaviour? |
Yeah I think that is a reasonable change although I'm not sure it solves the specific issue as in my case, falling back to that static list still resulted in fatal crashing as it attempted to access resources outside the cluster elsehwere. What I might propose is an obvious check (as it seemingly is a requirement of the cluster autoscaler here, not sure if it is AWS specific or not though) that the pod the cluster autoscaler is running in has access to external resources outside the cluster (e.g. can access the internet) and if it can't, error with an explicit message that is seemingly less cryptic than the ones noted above. E.g.
.. Hope that makes sense :) To add some more context, when I was attempting to debug the issue I had specifically, seeing messages of 'timeout' I was unsure if the context deadline was being hit as a result of latency. If the endpoint data was so big that again it was timing out. If the timeout was permission related and kept trying until again the context deadline exceeded. (It's not a normal perception that your thing in the cloud can't reach the cloud :) ) |
We have the same issue in Ireland eu-west-1 region. What version of the component are you using?: Component version: What happened instead?: Logs Troubleshooting |
Thanks for the extra information everyone. This seems to me to be an AWS/EKS problem at its core rather than a CA one, though we could definitely handle this more gracefully on the CA side. Can I ask how you all provisioned your clusters to see if I can reproduce the networking issues you're seeing? |
I've also updated the issue title to capture what appears to be the common thread from all your messages so far. |
I am seeing this issue on v1.19.2
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In my case, the cluster-autoscaler pod fails accessing the public AWS sts service endpoint via its public IP:
My EKS is a private cluster, with a private VPC sts interface endpoint configured, like this:
|
What's the solution facing the same issue with EKS 1.24? Cluster is public while CA trying to access sts which the public getting timeout
|
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
Helm chart 9.10.8
cluster-autoscaler v1.21.1
Component version:
What k8s version are you using (
kubectl version
)?:v1.21
kubectl version
OutputWhat environment is this in?:
AWS EKS
What did you expect to happen?:
It will load instance type list normally and keep running.
What happened instead?:
It keep CrashLoopBack and exit with error 255
How to reproduce it (as minimally and precisely as possible):
Set environment variable with
AWS_REGION: ap-northeast-3
Anything else we need to know?:
Part of logs:
The text was updated successfully, but these errors were encountered: