My pod is in the CrashLoopBackOff state after configuring cluster-autoscaler #4220

ibata · 2021-07-23T02:31:24Z

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
v1.17.3

Component version:

What k8s version are you using (kubectl version)?:

kubectl -n kube-system version --short
Client Version: v1.21.1
Server Version: v1.17.17-eks-c5067d
WARNING: version difference between client (1.21) and server (1.17) exceeds the supported minor version skew of +/-1

What environment is this in?:

aws eks

What did you expect to happen?:

I'm expecting to get my first cluster-autoscaler to be setup and working. Meaning replace my ASG.

What happened instead?:

Getting exactly this error reported by https://aws.amazon.com/premiumsupport/knowledge-center/eks-pod-status-troubleshooting/

$ kubectl describe po crash-app-6847947bf8-28rq6

Name: crash-app-6847947bf8-28rq6
Namespace: default
Priority: 0
PriorityClassName:
Node: ip-192-168-6-51.us-east-2.compute.internal/192.168.6.51
Start Time: Wed, 22 Jan 2020 08:42:20 +0200
Labels: pod-template-hash=6847947bf8
run=crash-app
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 192.168.29.73
Controlled By: ReplicaSet/crash-app-6847947bf8
Containers:
main:
Container ID: docker://6aecdce22adf08de2dbcd48f5d3d8d4f00f8e86bddca03384e482e71b3c20442
Image: alpine
Image ID: docker-pullable://alpine@sha256:ab00606a42621fb68f2ed6ad3c88be54397f981a7b70a79db3d1172b11c4367d
Port: 80/TCP
Host Port: 0/TCP
Command:
/bin/sleep
1
State: Waiting
Reason: CrashLoopBackOff
...
Events:
Type Reason Age From Message

Normal Scheduled 47s default-scheduler Successfully assigned default/crash-app-6847947bf8-28rq6 to ip-192-168-6-51.us-east-2.compute.internal
Normal Pulling 28s (x3 over 46s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Pulling image "alpine"
Normal Pulled 28s (x3 over 46s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Successfully pulled image "alpine"
Normal Created 28s (x3 over 45s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Created container main
Normal Started 28s (x3 over 45s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Started container main
Warning BackOff 12s (x4 over 42s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Back-off restarting failed container

How to reproduce it (as minimally and precisely as possible):
You need to have the same eks . kubectl, cluster-autoscaler versions

kubectl -n kube-system version --short
Client Version: v1.21.1
Server Version: v1.17.17-eks-c5067d
WARNING: version difference between client (1.21) and server (1.17) exceeds the supported minor version skew of +/-1

kubectl -n kube-system get node -o custom-columns=NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion

NAME VERSION
ip-10-44-17-206.us-west-2.compute.internal v1.17.12-eks-7684af
ip-10-44-20-171.us-west-2.compute.internal v1.17.12-eks-7684af

The text was updated successfully, but these errors were encountered:

EKami · 2021-07-26T09:38:16Z

Same issue here with version v1.20, everything was working fine 1 week ago until I redeploy a new cluster and cluster-autoscaler starts crashing with no apparent reason (even the logs on the deployment/pod doesn't give any clue on what's happening)

rubroboletus · 2021-07-26T12:49:50Z

Hi. was in same situation, do you use requests / limits settings for cluster-autoscaler? About 2 weeks ago, there was enough limit 300M RAM, now 500M is required. If you use lower number, OOM kills appear.

EKami · 2021-07-26T13:31:21Z

I do use a 300Mi limit yeah, that's weird it changed just like that since I didn't change the package version, it's exactly the same config as what I had a week prior 🤔 .
I'll try removing the limit, thanks for the pointer :)

EKami · 2021-07-26T18:41:07Z

@rubroboletus It worked, thanks! :)

* Fixes: Young-ook#88 * Original issue: kubernetes/autoscaler#4220

mercuriete · 2021-07-28T12:14:55Z

Confirmed
changing from 300MiB to 500MiB fixes the problem.

I am running an 1.20 cluster with a node group using the 22-07-2021 AMI linux.
Starting to fail this Monday after a node group update.

thanks.

* Fixes: #88 * Original issue: kubernetes/autoscaler#4220 Co-authored-by: Abel Garcia Dorta <[email protected]>

Update default values for memory consumption to fix this issue: kubernetes/autoscaler#4220

seunggs · 2021-11-06T00:03:01Z

Still seeing this problem here with no limit (also tried with limits 600MiB as suggested by AWS docs)

Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  10m                  default-scheduler  Successfully assigned kube-system/cluster-autoscaler-release-aws-cluster-autoscaler-6d58fb855g84x to ip-10-0-16-212.us-west-1.compute.internal
  Normal   Pulled     10m                  kubelet            Successfully pulled image "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.1" in 368.005529ms
  Normal   Pulled     10m                  kubelet            Successfully pulled image "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.1" in 350.720347ms
  Normal   Pulled     9m43s                kubelet            Successfully pulled image "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.1" in 368.7982ms
  Normal   Created    9m (x4 over 10m)     kubelet            Created container aws-cluster-autoscaler
  Normal   Pulling    9m (x4 over 10m)     kubelet            Pulling image "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.1"
  Normal   Pulled     9m                   kubelet            Successfully pulled image "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.1" in 359.88954ms
  Normal   Started    8m59s (x4 over 10m)  kubelet            Started container aws-cluster-autoscaler
  Warning  BackOff    31s (x44 over 10m)   kubelet            Back-off restarting failed container

Container logs show this repeated go routine:

goroutine 301 [sync.Cond.Wait]:
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:310
sync.runtime_notifyListWait(0xc000908b40, 0x0)
        /usr/local/go/src/runtime/sema.go:513 +0xf8
sync.(*Cond).Wait(0xc000908b30)
        /usr/local/go/src/sync/cond.go:56 +0x9d
golang.org/x/net/http2.(*pipe).Read(0xc000908b28, 0xc000266400, 0x200, 0x200, 0x0, 0x0, 0x0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/pipe.go:65 +0x8f
golang.org/x/net/http2.transportResponseBody.Read(0xc000908b00, 0xc000266400, 0x200, 0x200, 0x0, 0x0, 0x0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:2108 +0xaf
encoding/json.(*Decoder).refill(0xc000908f20, 0xc00103e9e0, 0x0)
        /usr/local/go/src/encoding/json/stream.go:165 +0xeb
encoding/json.(*Decoder).readValue(0xc000908f20, 0x0, 0x0, 0x36423c0)
        /usr/local/go/src/encoding/json/stream.go:140 +0x1e8
encoding/json.(*Decoder).Decode(0xc000908f20, 0x36e3ec0, 0xc00103e9e0, 0x3c0c180, 0x0)
        /usr/local/go/src/encoding/json/stream.go:63 +0x79
k8s.io/apimachinery/pkg/util/framer.(*jsonFrameReader).Read(0xc00105daa0, 0xc00095a400, 0x400, 0x400, 0x10, 0x37d1a20, 0x38)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/framer/framer.go:152 +0x1a1
k8s.io/apimachinery/pkg/runtime/serializer/streaming.(*decoder).Decode(0xc001056730, 0x0, 0x44ebba0, 0xc0008279c0, 0xc000ed1f98, 0x15b916f, 0xc000ed1e60, 0x453c8e0, 0xc000058018)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/runtime/serializer/streaming/streaming.go:77 +0x89
k8s.io/client-go/rest/watch.(*Decoder).Decode(0xc00103e9c0, 0x3deeb27, 0x1, 0x0, 0x0, 0x0, 0x1f4)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest/watch/decoder.go:49 +0x6e
k8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc000827980)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:105 +0xe5
created by k8s.io/apimachinery/pkg/watch.NewStreamWatcher
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:76 +0xea

Does anyone have any idea what's going on here?

seunggs · 2021-11-06T16:28:29Z

This was caused by the incorrect AWS role trust policy - would've been a bit easier to debug if there were helpful error messages but my fault for not following the aws instructions carefully.

nehatomar12 · 2021-11-15T12:35:53Z

@seunggs did you get any solution
I am getting the similar error

I1115 09:45:17.144398       1 reflector.go:255] Listing and watching *v1.ReplicationController from k8s.io/client-go/informers/factory.go:134
W1115 09:45:17.440224       1 aws_util.go:84] Error fetching https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/us-west-1/index.json skipping...
F1115 09:45:17.440266       1 aws_cloud_provider.go:358] Failed to generate AWS EC2 Instance Types: unable to load EC2 Instance Type list
goroutine 71 [running]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2.stacks(0xc000182001, 0xc0001be460, 0x8a, 0xe0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1026 +0xb8
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2.(*loggingT).output(0x617fd80, 0xc000000003, 0x0, 0x0, 0xc000957490, 0x609f526, 0x15, 0x166, 0x0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:975 +0x1a3

seunggs · 2021-11-15T15:48:12Z

My issue was due to an incorrect IAM role trust policy (I was using Pulumi to generate it and discovered that my Pulumi code was not generating the policy correctly) - your issue also seems related to permissions. See this issue that seems to be related: #3216

marcelofabricanti · 2021-11-23T21:43:28Z

Same issue here.

Fix the policy ARN in cluster-autoscale service account solved.
When using EKS, the policy must have "trusted entities" as the OIDC provider.

k8s-triage-robot · 2022-02-21T21:55:03Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

neogeogre · 2022-03-17T09:06:48Z

For people with the same issue using terraform and aws this can help to auto-generate the correct policy for the service account of the autoscaler:

https://github.com/terraform-aws-modules/terraform-aws-iam/tree/master/modules/iam-role-for-service-accounts-eks

k8s-triage-robot · 2022-04-16T09:53:53Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-05-16T10:32:59Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-05-16T10:33:15Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

GabeOpo · 2022-07-23T02:45:32Z

Hi, I am facing the same issue with eks kubernetes version 1.21

cluster-autoscaler-5dd6459897-mpqf8 0/1 CrashLoopBackOff 7 13m

this is the log I see, I applied the changed from the memory from 300Mi to 500Mi but still getting the same error, also open id is in the trusted relationships

goroutine 285 [sync.Cond.Wait]:
runtime.goparkunlock(...)
/usr/local/go/src/runtime/proc.go:310
sync.runtime_notifyListWait(0xc000b34b40, 0xc000000000)
/usr/local/go/src/runtime/sema.go:513 +0xf8
sync.(*Cond).Wait(0xc000b34b30)
/usr/local/go/src/sync/cond.go:56 +0x9d
golang.org/x/net/http2.(*pipe).Read(0xc000b34b28, 0xc00134b200, 0x200, 0x200, 0x0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/pipe.go:65 +0x8f
golang.org/x/net/http2.transportResponseBody.Read(0xc000b34b00, 0xc00134b200, 0x200, 0x200, 0x0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:2108 +0xaf
encoding/json.(*Decoder).refill(0xc0007f9760, 0xc0009ec2a0, 0x0)
/usr/local/go/src/encoding/json/stream.go:165 +0xeb
encoding/json.(*Decoder).readValue(0xc0007f9760, 0x0, 0x0, 0x3652400)
/usr/local/go/src/encoding/json/stream.go:140 +0x1e8
encoding/json.(*Decoder).Decode(0xc0007f9760, 0x36f3f00, 0xc0009ec2a0, 0x437aa1, 0x3cd95e0)
/usr/local/go/src/encoding/json/stream.go:63 +0x79
k8s.io/apimachinery/pkg/util/framer.(*jsonFrameReader).Read(0xc000c2de30, 0xc00067e400, 0x400, 0x400, 0xc000061e10, 0xc000061000, 0x38)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/framer/framer.go:152 +0x1a1
k8s.io/apimachinery/pkg/runtime/serializer/streaming.(*decoder).Decode(0xc000357720, 0x0, 0x44fd580, 0xc0014600c0, 0xc0010f6dc8, 0x41e0d8, 0xc0010f6db0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/runtime/serializer/streaming/streaming.go:77 +0x89
k8s.io/client-go/rest/watch.(*Decoder).Decode(0xc0009ec260, 0xc001226d80, 0x0, 0x0, 0x0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest/watch/decoder.go:49 +0x6e
k8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc001460080)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:105 +0xe5
created by k8s.io/apimachinery/pkg/watch.NewStreamWatcher
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:76 +0xea

…hanged. See kubernetes/autoscaler#4220

ibata added the kind/bug Categorizes issue or PR as related to a bug. label Jul 23, 2021

mercuriete mentioned this issue Jul 28, 2021

autoscaler is in crashloopbackoff after upgrading node-groups to latests ami linux Young-ook/terraform-aws-eks#88

Closed

mercuriete pushed a commit to acidtango/terraform-aws-eks that referenced this issue Jul 28, 2021

fix: autoscaler now requires 500MiB instead of 300MiB

33648e0

* Fixes: Young-ook#88 * Original issue: kubernetes/autoscaler#4220

mercuriete mentioned this issue Jul 28, 2021

fix: autoscaler now requires 500MiB instead of 300MiB Young-ook/terraform-aws-eks#89

Merged

Young-ook pushed a commit to Young-ook/terraform-aws-eks that referenced this issue Jul 29, 2021

fix: autoscaler now requires 500MiB instead of 300MiB (#89)

ea85592

* Fixes: #88 * Original issue: kubernetes/autoscaler#4220 Co-authored-by: Abel Garcia Dorta <[email protected]>

viatcheslavmogilevsky added a commit to provectus/sak-scaling that referenced this issue Aug 3, 2021

Fix memory limit and request for cluster autoscaler

d26d0f7

Update default values for memory consumption to fix this issue: kubernetes/autoscaler#4220

viatcheslavmogilevsky mentioned this issue Aug 3, 2021

Fix memory limit and request for cluster autoscaler provectus/sak-scaling#1

Merged

jbartosik added the area/cluster-autoscaler label Sep 15, 2021

jaypipes mentioned this issue Dec 2, 2021

CA - AWS CloudProvider - Fallback to Static EC2 list rather than fatal error #4480

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 21, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 16, 2022

k8s-ci-robot closed this as completed May 16, 2022

asprin107 added a commit to asprin107/k8s-sandbox that referenced this issue Nov 16, 2022

add cluster autoscaler. but requirements of resource limit had been c…

849ea05

…hanged. See kubernetes/autoscaler#4220

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My pod is in the CrashLoopBackOff state after configuring cluster-autoscaler #4220

My pod is in the CrashLoopBackOff state after configuring cluster-autoscaler #4220

ibata commented Jul 23, 2021

EKami commented Jul 26, 2021

rubroboletus commented Jul 26, 2021

EKami commented Jul 26, 2021

EKami commented Jul 26, 2021

mercuriete commented Jul 28, 2021

seunggs commented Nov 6, 2021 •

edited

Loading

seunggs commented Nov 6, 2021

nehatomar12 commented Nov 15, 2021

seunggs commented Nov 15, 2021

marcelofabricanti commented Nov 23, 2021

k8s-triage-robot commented Feb 21, 2022

neogeogre commented Mar 17, 2022 •

edited

Loading

k8s-triage-robot commented Apr 16, 2022

k8s-triage-robot commented May 16, 2022

k8s-ci-robot commented May 16, 2022

GabeOpo commented Jul 23, 2022

My pod is in the CrashLoopBackOff state after configuring cluster-autoscaler #4220

My pod is in the CrashLoopBackOff state after configuring cluster-autoscaler #4220

Comments

ibata commented Jul 23, 2021

EKami commented Jul 26, 2021

rubroboletus commented Jul 26, 2021

EKami commented Jul 26, 2021

EKami commented Jul 26, 2021

mercuriete commented Jul 28, 2021

seunggs commented Nov 6, 2021 • edited Loading

seunggs commented Nov 6, 2021

nehatomar12 commented Nov 15, 2021

seunggs commented Nov 15, 2021

marcelofabricanti commented Nov 23, 2021

k8s-triage-robot commented Feb 21, 2022

neogeogre commented Mar 17, 2022 • edited Loading

k8s-triage-robot commented Apr 16, 2022

k8s-triage-robot commented May 16, 2022

k8s-ci-robot commented May 16, 2022

GabeOpo commented Jul 23, 2022

seunggs commented Nov 6, 2021 •

edited

Loading

neogeogre commented Mar 17, 2022 •

edited

Loading