Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA failed to load Instance Type list unless configured with hostNetworking #4464

Closed
adaam opened this issue Nov 12, 2021 · 20 comments
Closed
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@adaam
Copy link

adaam commented Nov 12, 2021

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
Helm chart 9.10.8
cluster-autoscaler v1.21.1

Component version:

What k8s version are you using (kubectl version)?:
v1.21

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:38:26Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-06eac09", GitCommit:"5f6d83fe4cb7febb5f4f4e39b3b2b64ebbbe3e97", GitTreeState:"clean", BuildDate:"2021-09-13T14:20:15Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:
AWS EKS

What did you expect to happen?:
It will load instance type list normally and keep running.

What happened instead?:
It keep CrashLoopBack and exit with error 255

How to reproduce it (as minimally and precisely as possible):

Set environment variable with
AWS_REGION: ap-northeast-3

Anything else we need to know?:

Part of logs:

1112 07:23:25.974866       1 main.go:391] Cluster Autoscaler 1.21.1
I1112 07:23:25.996783       1 leaderelection.go:243] attempting to acquire leader lease kube-system/cluster-autoscaler...
I1112 07:23:26.016572       1 leaderelection.go:253] successfully acquired lease kube-system/cluster-autoscaler
I1112 07:23:26.016842       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Lease", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"04f7e024-313b-4cd3-9e47-1bd8ab89d128", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"14162", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' hub-c-a-aws-cluster-autoscaler-fdb7d96d4-b9rg9 became leader
I1112 07:23:26.019206       1 reflector.go:219] Starting reflector *v1.Pod (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:188
I1112 07:23:26.019328       1 reflector.go:255] Listing and watching *v1.Pod from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:188
I1112 07:23:26.020108       1 reflector.go:219] Starting reflector *v1.DaemonSet (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:320
I1112 07:23:26.020220       1 reflector.go:255] Listing and watching *v1.DaemonSet from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:320
I1112 07:23:26.020557       1 reflector.go:219] Starting reflector *v1.ReplicationController (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:329
I1112 07:23:26.020573       1 reflector.go:255] Listing and watching *v1.ReplicationController from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:329
I1112 07:23:26.020868       1 reflector.go:219] Starting reflector *v1.Job (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:338
I1112 07:23:26.020883       1 reflector.go:255] Listing and watching *v1.Job from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:338
I1112 07:23:26.021148       1 reflector.go:219] Starting reflector *v1.Pod (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:212
I1112 07:23:26.021242       1 reflector.go:255] Listing and watching *v1.Pod from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:212
I1112 07:23:26.021155       1 reflector.go:219] Starting reflector *v1.ReplicaSet (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:347
I1112 07:23:26.021494       1 reflector.go:255] Listing and watching *v1.ReplicaSet from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:347
I1112 07:23:26.021216       1 reflector.go:219] Starting reflector *v1.StatefulSet (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:356
I1112 07:23:26.021667       1 reflector.go:255] Listing and watching *v1.StatefulSet from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:356
I1112 07:23:26.021267       1 reflector.go:219] Starting reflector *v1.Node (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:246
I1112 07:23:26.021770       1 reflector.go:255] Listing and watching *v1.Node from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:246
I1112 07:23:26.021279       1 reflector.go:219] Starting reflector *v1beta1.PodDisruptionBudget (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:309
I1112 07:23:26.021938       1 reflector.go:255] Listing and watching *v1beta1.PodDisruptionBudget from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:309
I1112 07:23:26.021232       1 reflector.go:219] Starting reflector *v1.Node (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:246
I1112 07:23:26.022155       1 reflector.go:255] Listing and watching *v1.Node from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:246
W1112 07:23:26.040478       1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W1112 07:23:26.061120       1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
I1112 07:23:26.067058       1 cloud_provider_builder.go:29] Building aws cloud provider.
F1112 07:23:26.067164       1 aws_cloud_provider.go:365] Failed to generate AWS EC2 Instance Types: unable to load EC2 Instance Type list

goroutine 61 [running]:
k8s.io/klog/v2.stacks(0xc0000c2001, 0xc0009fe000, 0x8a, 0xee)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1021 +0xb8
k8s.io/klog/v2.(*loggingT).output(0x629d5a0, 0xc000000003, 0x0, 0x0, 0xc00004c230, 0x61ad5f1, 0x15, 0x16d, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:970 +0x1a3
k8s.io/klog/v2.(*loggingT).printf(0x629d5a0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x3e68953, 0x2d, 0xc001044900, 0x1, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:751 +0x18b
k8s.io/klog/v2.Fatalf(...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1509
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws.BuildAWS(0x3fe0000000000000, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000, 0xa, 0x0, 0x4e200, 0x0, 0x186a0000000000, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go:365 +0x290
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder.buildCloudProvider(0x3fe0000000000000, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000, 0xa, 0x0, 0x4e200, 0x0, 0x186a0000000000, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder/builder_all.go:69 +0x18f
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder.NewCloudProvider(0x3fe0000000000000, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000, 0xa, 0x0, 0x4e200, 0x0, 0x186a0000000000, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder/cloud_provider_builder.go:45 +0x1e6
k8s.io/autoscaler/cluster-autoscaler/core.initializeDefaultOptions(0xc0010076e0, 0x4530301, 0x8)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/autoscaler.go:101 +0x2fd
k8s.io/autoscaler/cluster-autoscaler/core.NewAutoscaler(0x3fe0000000000000, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000, 0xa, 0x0, 0x4e200, 0x0, 0x186a0000000000, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/autoscaler.go:65 +0x43
main.buildAutoscaler(0x972073, 0xc000634f50, 0x457dc20, 0xc00039d500)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:337 +0x368
main.run(0xc00007efa0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:343 +0x39
main.main.func2(0x453c8a0, 0xc0000c9b00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:447 +0x2a
created by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:207 +0x113

goroutine 1 [select]:
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000e77c00, 0x44cea80, 0xc000311620, 0xc0000c9b01, 0xc000056c00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:167 +0x13f
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0008bfc00, 0x77359400, 0x0, 0xc0000c9b01, 0xc000056c00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90

goroutine 1 [select]:
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000e77c00, 0x44cea80, 0xc000311620, 0xc0000c9b01, 0xc000056c00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:167 +0x13f
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0008bfc00, 0x77359400, 0x0, 0xc0000c9b01, 0xc000056c00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew(0xc0001bf320, 0x453c8a0, 0xc0000c9b40)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:263 +0x107
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc0001bf320, 0x453c8a0, 0xc0000c9b00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:208 +0x13b
k8s.io/client-go/tools/leaderelection.RunOrDie(0x453c8e0, 0xc0000ae008, 0x4571bc0, 0xc00092eb40, 0x37e11d600, 0x2540be400, 0x77359400, 0xc00069d8e0, 0x3f40d28, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:222 +0x96
main.main()
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:438 +0x829

goroutine 18 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x629d5a0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1164 +0x8b
created by k8s.io/klog/v2.init.0
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:418 +0xdd

goroutine 48 [runnable]:
sync.runtime_SemacquireMutex(0xc0000a1a44, 0xc000966c00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0000a1a40)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
sync.(*Map).Load(0xc0000a1a40, 0x339f5a0, 0xc000966d38, 0xc000c442f8, 0x5a9fc18f48a93701, 0x5a0000000040c8f4)
	/usr/local/go/src/sync/map.go:106 +0x2c4
github.com/modern-go/reflect2.(*frozenConfig).Type2(0xc00009d180, 0x45acfa0, 0xc000e3a540, 0x3711f40, 0xc000966f00)
@adaam adaam added the kind/bug Categorizes issue or PR as related to a bug. label Nov 12, 2021
@gjtempleton
Copy link
Member

/area provider/aws

@k8s-ci-robot k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Nov 22, 2021
@dan-tw
Copy link

dan-tw commented Nov 23, 2021

I'm getting a similar error:

W1123 22:49:14.940056       1 aws_util.go:84] Error fetching https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/ap-southeast-2/index.json skipping...
Get "https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/ap-southeast-2/index.json": dial tcp 13.224.179.62:443: i/o timeout
F1123 22:49:14.940096       1 aws_cloud_provider.go:365] Failed to generate AWS EC2 Instance Types: unable to load EC2 Instance Type list

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.7-eks-d88609", GitCommit:"d886092805d5cc3a47ed5cf0c43de38ce442dfcb", GitTreeState:"clean", BuildDate:"2021-07-31T00:29:12Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

Cluster Autoscaler Image:

cluster-autoscaler v1.21.1

@languitar
Copy link

Is there a known workaround for this? Seems that we're hit by the same issue.

@gjtempleton
Copy link
Member

Just to confirm, are all 3 of you only seeing this in Osaka, with a v1.21.1 tag?

Does running this with the flag --aws-use-static-instance-list=true still produce this behaviour?

@languitar
Copy link

Just to confirm, are all 3 of you only seeing this in Osaka, with a v1.21.1 tag?

Does running this with the flag --aws-use-static-instance-list=true still produce this behaviour?

Our stacktrace looked the same but was caused by a permission problem. So we're luckily not affected by this exact issue.

@gjtempleton
Copy link
Member

Hey @adaam,

I don't currently have access to a cluster in Osaka (working on that) to reproduce, but a couple of questions I'd like the answer to/things I'd like you to try out if possible to help narrow down what's going on here:

  1. What verbosity level are you running the CA with? (If anything greater than 0, I would expect to see a line of the form fetching https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/eu-west-1/index.json being logged before the CA crashes.)
  2. Could you please try running the CA with --aws-use-static-instance-list=true to see if it still crashes?

My suspicion is currently still that this is related to a permissions issue, although we should handle it more gracefully than we currently do.

@dan-tw
Copy link

dan-tw commented Nov 28, 2021

Just to confirm, are all 3 of you only seeing this in Osaka, with a v1.21.1 tag?

Does running this with the flag --aws-use-static-instance-list=true still produce this behaviour?

I was seeing it with any tag I tried (1.20.0 to 1.21.1) and it wasn't in Osaka, we were trying from Sydney.

Running with --aws-use-static-instance-list=true produces different results but still errors.

 1 aws_manager.go:265] Failed to regenerate ASG cache: cannot autodiscover ASGs: RequestError: send request failed
caused by: Post "https://autoscaling.ap-southeast-2.amazonaws.com/": dial tcp 99.82.184.205:443: i/o timeout
F1124 01:30:04.718587       1 aws_cloud_provider.go:382] Failed to create AWS Manager: cannot autodiscover ASGs: RequestError: send request failed
caused by: Post "https://autoscaling.ap-southeast-2.amazonaws.com/": dial tcp 99.82.184.205:443: i/o timeout

Then I tried with --aws-use-static-instance-list=true and autodiscovery off:

1 aws_manager.go:265] Failed to regenerate ASG cache: WebIdentityErr: failed to retrieve credentials
caused by: RequestError: send request failed
caused by: Post "https://sts.amazonaws.com/": dial tcp 52.46.149.173:443: i/o timeout
F1124 01:43:25.536439       1 aws_cloud_provider.go:382] Failed to create AWS Manager: WebIdentityErr: failed to retrieve credentials
caused by: RequestError: send request failed
caused by: Post "https://sts.amazonaws.com/": dial tcp 52.46.149.173:443: i/o timeout

... Eventually what worked for us was enabling host networking for the cluster autoscaler. We found that no pods on our cluster were actually able to access resources outside the cluster by default (EKS, Amazon VPC CNI) -- still running with host networking until we can apply some more engineering time to looking at it further.

@gjtempleton
Copy link
Member

That's some great detail, thanks @dan-tw, you're reinforcing my belief that most people seeing this error are having permissions/networking errors masked poorly by this crash, and we can handle it more gracefully.

@gjtempleton
Copy link
Member

Relatedly, it would be great to get your feedback as users who have encountered this, on the change I'm proposing in #4480, would you prefer that behaviour, with the risk I've outlined in the PR description, over the current hard crash behaviour?

@dan-tw
Copy link

dan-tw commented Nov 29, 2021

Relatedly, it would be great to get your feedback as users who have encountered this, on the change I'm proposing in #4480, would you prefer that behaviour, with the risk I've outlined in the PR description, over the current hard crash behaviour?

Yeah I think that is a reasonable change although I'm not sure it solves the specific issue as in my case, falling back to that static list still resulted in fatal crashing as it attempted to access resources outside the cluster elsehwere.

What I might propose is an obvious check (as it seemingly is a requirement of the cluster autoscaler here, not sure if it is AWS specific or not though) that the pod the cluster autoscaler is running in has access to external resources outside the cluster (e.g. can access the internet) and if it can't, error with an explicit message that is seemingly less cryptic than the ones noted above.

E.g.

// check if we can reach amazon.com/google.com/some resource, dns location whatever.
// if successful, proposed PR above should handle specific cases permissions might be a concern
// if failed, error gracefully with a specific message telling the user the cluster autoscaler cannot access the internet to retrieve necessary resources

.. Hope that makes sense :)

To add some more context, when I was attempting to debug the issue I had specifically, seeing messages of 'timeout' I was unsure if the context deadline was being hit as a result of latency. If the endpoint data was so big that again it was timing out. If the timeout was permission related and kept trying until again the context deadline exceeded. (It's not a normal perception that your thing in the cloud can't reach the cloud :) )

@Vadim-Zenin
Copy link

Vadim-Zenin commented Nov 29, 2021

We have the same issue in Ireland eu-west-1 region.
Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
cluster-autoscaler v1.21.1

Component version:
What k8s version are you using (kubectl version)?:
v1.21
kubectl version Client Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-13+d2965f0db10712", GitCommit:"d2965f0db1071203c6f5bc662c2827c71fc8b20d", GitTreeState:"clean", BuildDate:"2021-06-26T01:02:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-0389ca3", GitCommit:"8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

What happened instead?:
It keep CrashLoopBack
kube-system pod/cluster-autoscaler-79475c6789-tnljd 0/1 CrashLoopBackOff 9

Logs
W1129 1 aws_util.go:84] Error fetching https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/eu-west-1/index.json skipping... Get "https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/eu-west-1/index.json": dial tcp: i/o timeout F1129 aws_cloud_provider.go:365] Failed to generate AWS EC2 Instance Types: unable to load EC2 Instance Type list goroutine 32

Troubleshooting
If I added --aws-use-static-instance-list=true to CA it is running some time,
kube-system pod/cluster-autoscaler-cc975695c-rwlzv 1/1 Running 2 5m3s
but crashed again after with log:
E1129 17:59:44.241301 1 aws_manager.go:265] Failed to regenerate ASG cache: cannot autodiscover ASGs: RequestError: send request failed caused by: Post "https://autoscaling.eu-west-1.amazonaws.com/": dial tcp: i/o timeout F1129 17:59:44.241348 1 aws_cloud_provider.go:389] Failed to create AWS Manager: cannot autodiscover ASGs: RequestError: send request failed caused by: Post "https://autoscaling.eu-west-1.amazonaws.com/": dial tcp: i/o timeout goroutine 71 [running]:

@gjtempleton
Copy link
Member

Thanks for the extra information everyone.

This seems to me to be an AWS/EKS problem at its core rather than a CA one, though we could definitely handle this more gracefully on the CA side.

Can I ask how you all provisioned your clusters to see if I can reproduce the networking issues you're seeing?

@gjtempleton gjtempleton changed the title CA failed to load Instance Type list at AWS ap-northeast-3 (Osaka) region CA failed to load Instance Type list unless configured with hostNetworking Dec 1, 2021
@gjtempleton
Copy link
Member

I've also updated the issue title to capture what appears to be the common thread from all your messages so far.

@mohsen0
Copy link

mohsen0 commented Feb 2, 2022

I am seeing this issue on v1.19.2

W0202 09:30:54.619667       1 aws_util.go:84] Error fetching https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/eu-west-1/index.json skipping...
F0202 09:30:54.619721       1 aws_cloud_provider.go:358] Failed to generate AWS EC2 Instance Types: unable to load EC2 Instance Type list
goroutine 62 [running]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2.stacks(0xc0000c2001, 0xc0000fff00, 0x8a, 0xfa)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:996 +0xb8
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2.(*loggingT).output(0x58cd800, 0xc000000003, 0x0, 0x0, 0xc0001aab60, 0x57f057d, 0x15, 0x166, 0x0)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 3, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 1, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bogdando
Copy link

bogdando commented Sep 6, 2022

In my case, the cluster-autoscaler pod fails accessing the public AWS sts service endpoint via its public IP:

F0906 08:47:57.077390       1 aws_cloud_provider.go:386] Failed to generate AWS EC2 Instance Types: WebIdentityErr: failed to retrieve credentials
caused by: RequestError: send request failed
caused by: Post "https://sts.amazonaws.com/": dial tcp 54.xxx.xxx..25:443: i/o timeout

My EKS is a private cluster, with a private VPC sts interface endpoint configured, like this:

	  "sts" = {
			"dns_name" = "sts.eu-west-1.amazonaws.com"
			"hosted_zone_id" = "ZXXXXX"
		  },
```.
I believe after I have all the things fixed, it should resolve ``sts.amazonaws.com`` into its regional cname ``sts.eu-west-1.amazonaws.com`` into a private subnet IP and access it via the worker host's ENI interface...

@MageshSrinivasulu
Copy link

What's the solution facing the same issue with EKS 1.24? Cluster is public while CA trying to access sts which the public getting timeout

F0208 15:16:19.159661       1 aws_cloud_provider.go:386] Failed to generate AWS EC2 Instance Types: WebIdentityErr: failed to retrieve credentials
caused by: RequestError: send request failed
caused by: Post "https://sts.us-west-1.amazonaws.com/": dial tcp 176.32.114.104:443: i/o timeout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests