CA failed to load Instance Type list unless configured with hostNetworking #4464

adaam · 2021-11-12T08:46:44Z

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
Helm chart 9.10.8
cluster-autoscaler v1.21.1

Component version:

What k8s version are you using (kubectl version)?:
v1.21

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:38:26Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-06eac09", GitCommit:"5f6d83fe4cb7febb5f4f4e39b3b2b64ebbbe3e97", GitTreeState:"clean", BuildDate:"2021-09-13T14:20:15Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:
AWS EKS

What did you expect to happen?:
It will load instance type list normally and keep running.

What happened instead?:
It keep CrashLoopBack and exit with error 255

How to reproduce it (as minimally and precisely as possible):

Set environment variable with
AWS_REGION: ap-northeast-3

Anything else we need to know?:

Part of logs:

1112 07:23:25.974866       1 main.go:391] Cluster Autoscaler 1.21.1
I1112 07:23:25.996783       1 leaderelection.go:243] attempting to acquire leader lease kube-system/cluster-autoscaler...
I1112 07:23:26.016572       1 leaderelection.go:253] successfully acquired lease kube-system/cluster-autoscaler
I1112 07:23:26.016842       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Lease", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"04f7e024-313b-4cd3-9e47-1bd8ab89d128", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"14162", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' hub-c-a-aws-cluster-autoscaler-fdb7d96d4-b9rg9 became leader
I1112 07:23:26.019206       1 reflector.go:219] Starting reflector *v1.Pod (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:188
I1112 07:23:26.019328       1 reflector.go:255] Listing and watching *v1.Pod from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:188
I1112 07:23:26.020108       1 reflector.go:219] Starting reflector *v1.DaemonSet (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:320
I1112 07:23:26.020220       1 reflector.go:255] Listing and watching *v1.DaemonSet from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:320
I1112 07:23:26.020557       1 reflector.go:219] Starting reflector *v1.ReplicationController (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:329
I1112 07:23:26.020573       1 reflector.go:255] Listing and watching *v1.ReplicationController from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:329
I1112 07:23:26.020868       1 reflector.go:219] Starting reflector *v1.Job (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:338
I1112 07:23:26.020883       1 reflector.go:255] Listing and watching *v1.Job from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:338
I1112 07:23:26.021148       1 reflector.go:219] Starting reflector *v1.Pod (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:212
I1112 07:23:26.021242       1 reflector.go:255] Listing and watching *v1.Pod from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:212
I1112 07:23:26.021155       1 reflector.go:219] Starting reflector *v1.ReplicaSet (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:347
I1112 07:23:26.021494       1 reflector.go:255] Listing and watching *v1.ReplicaSet from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:347
I1112 07:23:26.021216       1 reflector.go:219] Starting reflector *v1.StatefulSet (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:356
I1112 07:23:26.021667       1 reflector.go:255] Listing and watching *v1.StatefulSet from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:356
I1112 07:23:26.021267       1 reflector.go:219] Starting reflector *v1.Node (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:246
I1112 07:23:26.021770       1 reflector.go:255] Listing and watching *v1.Node from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:246
I1112 07:23:26.021279       1 reflector.go:219] Starting reflector *v1beta1.PodDisruptionBudget (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:309
I1112 07:23:26.021938       1 reflector.go:255] Listing and watching *v1beta1.PodDisruptionBudget from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:309
I1112 07:23:26.021232       1 reflector.go:219] Starting reflector *v1.Node (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:246
I1112 07:23:26.022155       1 reflector.go:255] Listing and watching *v1.Node from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:246
W1112 07:23:26.040478       1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W1112 07:23:26.061120       1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
I1112 07:23:26.067058       1 cloud_provider_builder.go:29] Building aws cloud provider.
F1112 07:23:26.067164       1 aws_cloud_provider.go:365] Failed to generate AWS EC2 Instance Types: unable to load EC2 Instance Type list

goroutine 61 [running]:
k8s.io/klog/v2.stacks(0xc0000c2001, 0xc0009fe000, 0x8a, 0xee)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1021 +0xb8
k8s.io/klog/v2.(*loggingT).output(0x629d5a0, 0xc000000003, 0x0, 0x0, 0xc00004c230, 0x61ad5f1, 0x15, 0x16d, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:970 +0x1a3
k8s.io/klog/v2.(*loggingT).printf(0x629d5a0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x3e68953, 0x2d, 0xc001044900, 0x1, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:751 +0x18b
k8s.io/klog/v2.Fatalf(...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1509
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws.BuildAWS(0x3fe0000000000000, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000, 0xa, 0x0, 0x4e200, 0x0, 0x186a0000000000, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go:365 +0x290
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder.buildCloudProvider(0x3fe0000000000000, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000, 0xa, 0x0, 0x4e200, 0x0, 0x186a0000000000, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder/builder_all.go:69 +0x18f
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder.NewCloudProvider(0x3fe0000000000000, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000, 0xa, 0x0, 0x4e200, 0x0, 0x186a0000000000, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder/cloud_provider_builder.go:45 +0x1e6
k8s.io/autoscaler/cluster-autoscaler/core.initializeDefaultOptions(0xc0010076e0, 0x4530301, 0x8)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/autoscaler.go:101 +0x2fd
k8s.io/autoscaler/cluster-autoscaler/core.NewAutoscaler(0x3fe0000000000000, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000, 0xa, 0x0, 0x4e200, 0x0, 0x186a0000000000, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/autoscaler.go:65 +0x43
main.buildAutoscaler(0x972073, 0xc000634f50, 0x457dc20, 0xc00039d500)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:337 +0x368
main.run(0xc00007efa0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:343 +0x39
main.main.func2(0x453c8a0, 0xc0000c9b00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:447 +0x2a
created by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:207 +0x113

goroutine 1 [select]:
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000e77c00, 0x44cea80, 0xc000311620, 0xc0000c9b01, 0xc000056c00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:167 +0x13f
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0008bfc00, 0x77359400, 0x0, 0xc0000c9b01, 0xc000056c00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90

goroutine 1 [select]:
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000e77c00, 0x44cea80, 0xc000311620, 0xc0000c9b01, 0xc000056c00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:167 +0x13f
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0008bfc00, 0x77359400, 0x0, 0xc0000c9b01, 0xc000056c00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew(0xc0001bf320, 0x453c8a0, 0xc0000c9b40)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:263 +0x107
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc0001bf320, 0x453c8a0, 0xc0000c9b00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:208 +0x13b
k8s.io/client-go/tools/leaderelection.RunOrDie(0x453c8e0, 0xc0000ae008, 0x4571bc0, 0xc00092eb40, 0x37e11d600, 0x2540be400, 0x77359400, 0xc00069d8e0, 0x3f40d28, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:222 +0x96
main.main()
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:438 +0x829

goroutine 18 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x629d5a0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1164 +0x8b
created by k8s.io/klog/v2.init.0
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:418 +0xdd

goroutine 48 [runnable]:
sync.runtime_SemacquireMutex(0xc0000a1a44, 0xc000966c00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0000a1a40)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
sync.(*Map).Load(0xc0000a1a40, 0x339f5a0, 0xc000966d38, 0xc000c442f8, 0x5a9fc18f48a93701, 0x5a0000000040c8f4)
	/usr/local/go/src/sync/map.go:106 +0x2c4
github.com/modern-go/reflect2.(*frozenConfig).Type2(0xc00009d180, 0x45acfa0, 0xc000e3a540, 0x3711f40, 0xc000966f00)

The text was updated successfully, but these errors were encountered:

gjtempleton · 2021-11-22T15:04:44Z

/area provider/aws

dan-tw · 2021-11-23T22:57:23Z

I'm getting a similar error:

W1123 22:49:14.940056       1 aws_util.go:84] Error fetching https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/ap-southeast-2/index.json skipping...
Get "https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/ap-southeast-2/index.json": dial tcp 13.224.179.62:443: i/o timeout
F1123 22:49:14.940096       1 aws_cloud_provider.go:365] Failed to generate AWS EC2 Instance Types: unable to load EC2 Instance Type list

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.7-eks-d88609", GitCommit:"d886092805d5cc3a47ed5cf0c43de38ce442dfcb", GitTreeState:"clean", BuildDate:"2021-07-31T00:29:12Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

Cluster Autoscaler Image:

cluster-autoscaler v1.21.1

languitar · 2021-11-26T10:53:59Z

Is there a known workaround for this? Seems that we're hit by the same issue.

gjtempleton · 2021-11-26T11:32:31Z

Just to confirm, are all 3 of you only seeing this in Osaka, with a v1.21.1 tag?

Does running this with the flag --aws-use-static-instance-list=true still produce this behaviour?

languitar · 2021-11-26T14:43:56Z

Just to confirm, are all 3 of you only seeing this in Osaka, with a v1.21.1 tag?

Does running this with the flag --aws-use-static-instance-list=true still produce this behaviour?

Our stacktrace looked the same but was caused by a permission problem. So we're luckily not affected by this exact issue.

gjtempleton · 2021-11-28T22:42:29Z

Hey @adaam,

I don't currently have access to a cluster in Osaka (working on that) to reproduce, but a couple of questions I'd like the answer to/things I'd like you to try out if possible to help narrow down what's going on here:

What verbosity level are you running the CA with? (If anything greater than 0, I would expect to see a line of the form fetching https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/eu-west-1/index.json being logged before the CA crashes.)
Could you please try running the CA with --aws-use-static-instance-list=true to see if it still crashes?

My suspicion is currently still that this is related to a permissions issue, although we should handle it more gracefully than we currently do.

dan-tw · 2021-11-28T23:07:26Z

Just to confirm, are all 3 of you only seeing this in Osaka, with a v1.21.1 tag?

Does running this with the flag --aws-use-static-instance-list=true still produce this behaviour?

I was seeing it with any tag I tried (1.20.0 to 1.21.1) and it wasn't in Osaka, we were trying from Sydney.

Running with --aws-use-static-instance-list=true produces different results but still errors.

 1 aws_manager.go:265] Failed to regenerate ASG cache: cannot autodiscover ASGs: RequestError: send request failed
caused by: Post "https://autoscaling.ap-southeast-2.amazonaws.com/": dial tcp 99.82.184.205:443: i/o timeout
F1124 01:30:04.718587       1 aws_cloud_provider.go:382] Failed to create AWS Manager: cannot autodiscover ASGs: RequestError: send request failed
caused by: Post "https://autoscaling.ap-southeast-2.amazonaws.com/": dial tcp 99.82.184.205:443: i/o timeout

Then I tried with --aws-use-static-instance-list=true and autodiscovery off:

1 aws_manager.go:265] Failed to regenerate ASG cache: WebIdentityErr: failed to retrieve credentials
caused by: RequestError: send request failed
caused by: Post "https://sts.amazonaws.com/": dial tcp 52.46.149.173:443: i/o timeout
F1124 01:43:25.536439       1 aws_cloud_provider.go:382] Failed to create AWS Manager: WebIdentityErr: failed to retrieve credentials
caused by: RequestError: send request failed
caused by: Post "https://sts.amazonaws.com/": dial tcp 52.46.149.173:443: i/o timeout

... Eventually what worked for us was enabling host networking for the cluster autoscaler. We found that no pods on our cluster were actually able to access resources outside the cluster by default (EKS, Amazon VPC CNI) -- still running with host networking until we can apply some more engineering time to looking at it further.

gjtempleton · 2021-11-28T23:30:31Z

That's some great detail, thanks @dan-tw, you're reinforcing my belief that most people seeing this error are having permissions/networking errors masked poorly by this crash, and we can handle it more gracefully.

gjtempleton · 2021-11-28T23:52:03Z

Relatedly, it would be great to get your feedback as users who have encountered this, on the change I'm proposing in #4480, would you prefer that behaviour, with the risk I've outlined in the PR description, over the current hard crash behaviour?

dan-tw · 2021-11-29T04:06:43Z

Relatedly, it would be great to get your feedback as users who have encountered this, on the change I'm proposing in #4480, would you prefer that behaviour, with the risk I've outlined in the PR description, over the current hard crash behaviour?

Yeah I think that is a reasonable change although I'm not sure it solves the specific issue as in my case, falling back to that static list still resulted in fatal crashing as it attempted to access resources outside the cluster elsehwere.

What I might propose is an obvious check (as it seemingly is a requirement of the cluster autoscaler here, not sure if it is AWS specific or not though) that the pod the cluster autoscaler is running in has access to external resources outside the cluster (e.g. can access the internet) and if it can't, error with an explicit message that is seemingly less cryptic than the ones noted above.

E.g.

// check if we can reach amazon.com/google.com/some resource, dns location whatever.
// if successful, proposed PR above should handle specific cases permissions might be a concern
// if failed, error gracefully with a specific message telling the user the cluster autoscaler cannot access the internet to retrieve necessary resources

.. Hope that makes sense :)

To add some more context, when I was attempting to debug the issue I had specifically, seeing messages of 'timeout' I was unsure if the context deadline was being hit as a result of latency. If the endpoint data was so big that again it was timing out. If the timeout was permission related and kept trying until again the context deadline exceeded. (It's not a normal perception that your thing in the cloud can't reach the cloud :) )

Vadim-Zenin · 2021-11-29T17:16:03Z

We have the same issue in Ireland eu-west-1 region.
Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
cluster-autoscaler v1.21.1

Component version:
What k8s version are you using (kubectl version)?:
v1.21
kubectl version Client Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-13+d2965f0db10712", GitCommit:"d2965f0db1071203c6f5bc662c2827c71fc8b20d", GitTreeState:"clean", BuildDate:"2021-06-26T01:02:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-0389ca3", GitCommit:"8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

What happened instead?:
It keep CrashLoopBack
kube-system pod/cluster-autoscaler-79475c6789-tnljd 0/1 CrashLoopBackOff 9

Logs
W1129 1 aws_util.go:84] Error fetching https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/eu-west-1/index.json skipping... Get "https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/eu-west-1/index.json": dial tcp: i/o timeout F1129 aws_cloud_provider.go:365] Failed to generate AWS EC2 Instance Types: unable to load EC2 Instance Type list goroutine 32

Troubleshooting
If I added --aws-use-static-instance-list=true to CA it is running some time,
kube-system pod/cluster-autoscaler-cc975695c-rwlzv 1/1 Running 2 5m3s
but crashed again after with log:
E1129 17:59:44.241301 1 aws_manager.go:265] Failed to regenerate ASG cache: cannot autodiscover ASGs: RequestError: send request failed caused by: Post "https://autoscaling.eu-west-1.amazonaws.com/": dial tcp: i/o timeout F1129 17:59:44.241348 1 aws_cloud_provider.go:389] Failed to create AWS Manager: cannot autodiscover ASGs: RequestError: send request failed caused by: Post "https://autoscaling.eu-west-1.amazonaws.com/": dial tcp: i/o timeout goroutine 71 [running]:

gjtempleton · 2021-12-01T23:07:04Z

Thanks for the extra information everyone.

This seems to me to be an AWS/EKS problem at its core rather than a CA one, though we could definitely handle this more gracefully on the CA side.

Can I ask how you all provisioned your clusters to see if I can reproduce the networking issues you're seeing?

gjtempleton · 2021-12-01T23:07:58Z

I've also updated the issue title to capture what appears to be the common thread from all your messages so far.

mohsen0 · 2022-02-02T10:07:33Z

I am seeing this issue on v1.19.2

W0202 09:30:54.619667       1 aws_util.go:84] Error fetching https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/eu-west-1/index.json skipping...
F0202 09:30:54.619721       1 aws_cloud_provider.go:358] Failed to generate AWS EC2 Instance Types: unable to load EC2 Instance Type list
goroutine 62 [running]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2.stacks(0xc0000c2001, 0xc0000fff00, 0x8a, 0xfa)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:996 +0xb8
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2.(*loggingT).output(0x58cd800, 0xc000000003, 0x0, 0x0, 0xc0001aab60, 0x57f057d, 0x15, 0x166, 0x0)

k8s-triage-robot · 2022-05-03T11:02:52Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-07-01T21:43:05Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-07-31T22:40:15Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-07-31T22:40:30Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bogdando · 2022-09-06T09:59:07Z

In my case, the cluster-autoscaler pod fails accessing the public AWS sts service endpoint via its public IP:

F0906 08:47:57.077390       1 aws_cloud_provider.go:386] Failed to generate AWS EC2 Instance Types: WebIdentityErr: failed to retrieve credentials
caused by: RequestError: send request failed
caused by: Post "https://sts.amazonaws.com/": dial tcp 54.xxx.xxx..25:443: i/o timeout

My EKS is a private cluster, with a private VPC sts interface endpoint configured, like this:

	  "sts" = {
			"dns_name" = "sts.eu-west-1.amazonaws.com"
			"hosted_zone_id" = "ZXXXXX"
		  },
```.
I believe after I have all the things fixed, it should resolve ``sts.amazonaws.com`` into its regional cname ``sts.eu-west-1.amazonaws.com`` into a private subnet IP and access it via the worker host's ENI interface...

MageshSrinivasulu · 2023-02-08T15:30:54Z

What's the solution facing the same issue with EKS 1.24? Cluster is public while CA trying to access sts which the public getting timeout

F0208 15:16:19.159661       1 aws_cloud_provider.go:386] Failed to generate AWS EC2 Instance Types: WebIdentityErr: failed to retrieve credentials
caused by: RequestError: send request failed
caused by: Post "https://sts.us-west-1.amazonaws.com/": dial tcp 176.32.114.104:443: i/o timeout

adaam added the kind/bug Categorizes issue or PR as related to a bug. label Nov 12, 2021

jbartosik added the area/cluster-autoscaler label Nov 19, 2021

k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Nov 22, 2021

gjtempleton mentioned this issue Nov 28, 2021

CA - AWS CloudProvider - Fallback to Static EC2 list rather than fatal error #4480

Closed

gjtempleton changed the title ~~CA failed to load Instance Type list at AWS ap-northeast-3 (Osaka) region~~ CA failed to load Instance Type list unless configured with hostNetworking Dec 1, 2021

matti mentioned this issue Mar 30, 2022

cluster autoscaler crashes when master api is unavailable #4776

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 3, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 1, 2022

k8s-ci-robot closed this as completed Jul 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CA failed to load Instance Type list unless configured with hostNetworking #4464

CA failed to load Instance Type list unless configured with hostNetworking #4464

adaam commented Nov 12, 2021 •

edited by gjtempleton

Loading

gjtempleton commented Nov 22, 2021

dan-tw commented Nov 23, 2021 •

edited

Loading

languitar commented Nov 26, 2021

gjtempleton commented Nov 26, 2021

languitar commented Nov 26, 2021

gjtempleton commented Nov 28, 2021

dan-tw commented Nov 28, 2021 •

edited

Loading

gjtempleton commented Nov 28, 2021

gjtempleton commented Nov 28, 2021

dan-tw commented Nov 29, 2021 •

edited

Loading

Vadim-Zenin commented Nov 29, 2021 •

edited

Loading

gjtempleton commented Dec 1, 2021

gjtempleton commented Dec 1, 2021

mohsen0 commented Feb 2, 2022

k8s-triage-robot commented May 3, 2022

k8s-triage-robot commented Jul 1, 2022

k8s-triage-robot commented Jul 31, 2022

k8s-ci-robot commented Jul 31, 2022

bogdando commented Sep 6, 2022

MageshSrinivasulu commented Feb 8, 2023

CA failed to load Instance Type list unless configured with hostNetworking #4464

CA failed to load Instance Type list unless configured with hostNetworking #4464

Comments

adaam commented Nov 12, 2021 • edited by gjtempleton Loading

gjtempleton commented Nov 22, 2021

dan-tw commented Nov 23, 2021 • edited Loading

languitar commented Nov 26, 2021

gjtempleton commented Nov 26, 2021

languitar commented Nov 26, 2021

gjtempleton commented Nov 28, 2021

dan-tw commented Nov 28, 2021 • edited Loading

gjtempleton commented Nov 28, 2021

gjtempleton commented Nov 28, 2021

dan-tw commented Nov 29, 2021 • edited Loading

Vadim-Zenin commented Nov 29, 2021 • edited Loading

gjtempleton commented Dec 1, 2021

gjtempleton commented Dec 1, 2021

mohsen0 commented Feb 2, 2022

k8s-triage-robot commented May 3, 2022

k8s-triage-robot commented Jul 1, 2022

k8s-triage-robot commented Jul 31, 2022

k8s-ci-robot commented Jul 31, 2022

bogdando commented Sep 6, 2022

MageshSrinivasulu commented Feb 8, 2023

adaam commented Nov 12, 2021 •

edited by gjtempleton

Loading

dan-tw commented Nov 23, 2021 •

edited

Loading

dan-tw commented Nov 28, 2021 •

edited

Loading

dan-tw commented Nov 29, 2021 •

edited

Loading

Vadim-Zenin commented Nov 29, 2021 •

edited

Loading