Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My pod is in the CrashLoopBackOff state after configuring cluster-autoscaler #4220

Closed
ibata opened this issue Jul 23, 2021 · 16 comments
Closed
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@ibata
Copy link

ibata commented Jul 23, 2021

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
v1.17.3

Component version:

What k8s version are you using (kubectl version)?:

kubectl -n kube-system version --short
Client Version: v1.21.1
Server Version: v1.17.17-eks-c5067d
WARNING: version difference between client (1.21) and server (1.17) exceeds the supported minor version skew of +/-1

What environment is this in?:

aws eks

What did you expect to happen?:

I'm expecting to get my first cluster-autoscaler to be setup and working. Meaning replace my ASG.

What happened instead?:

Getting exactly this error reported by https://aws.amazon.com/premiumsupport/knowledge-center/eks-pod-status-troubleshooting/

$ kubectl describe po crash-app-6847947bf8-28rq6

Name: crash-app-6847947bf8-28rq6
Namespace: default
Priority: 0
PriorityClassName:
Node: ip-192-168-6-51.us-east-2.compute.internal/192.168.6.51
Start Time: Wed, 22 Jan 2020 08:42:20 +0200
Labels: pod-template-hash=6847947bf8
run=crash-app
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 192.168.29.73
Controlled By: ReplicaSet/crash-app-6847947bf8
Containers:
main:
Container ID: docker://6aecdce22adf08de2dbcd48f5d3d8d4f00f8e86bddca03384e482e71b3c20442
Image: alpine
Image ID: docker-pullable://alpine@sha256:ab00606a42621fb68f2ed6ad3c88be54397f981a7b70a79db3d1172b11c4367d
Port: 80/TCP
Host Port: 0/TCP
Command:
/bin/sleep
1
State: Waiting
Reason: CrashLoopBackOff
...
Events:
Type Reason Age From Message


Normal Scheduled 47s default-scheduler Successfully assigned default/crash-app-6847947bf8-28rq6 to ip-192-168-6-51.us-east-2.compute.internal
Normal Pulling 28s (x3 over 46s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Pulling image "alpine"
Normal Pulled 28s (x3 over 46s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Successfully pulled image "alpine"
Normal Created 28s (x3 over 45s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Created container main
Normal Started 28s (x3 over 45s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Started container main
Warning BackOff 12s (x4 over 42s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Back-off restarting failed container

How to reproduce it (as minimally and precisely as possible):
You need to have the same eks . kubectl, cluster-autoscaler versions

kubectl -n kube-system version --short
Client Version: v1.21.1
Server Version: v1.17.17-eks-c5067d
WARNING: version difference between client (1.21) and server (1.17) exceeds the supported minor version skew of +/-1

kubectl -n kube-system get node -o custom-columns=NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion

NAME VERSION
ip-10-44-17-206.us-west-2.compute.internal v1.17.12-eks-7684af
ip-10-44-20-171.us-west-2.compute.internal v1.17.12-eks-7684af

@ibata ibata added the kind/bug Categorizes issue or PR as related to a bug. label Jul 23, 2021
@EKami
Copy link

EKami commented Jul 26, 2021

Same issue here with version v1.20, everything was working fine 1 week ago until I redeploy a new cluster and cluster-autoscaler starts crashing with no apparent reason (even the logs on the deployment/pod doesn't give any clue on what's happening)

@rubroboletus
Copy link

Hi. was in same situation, do you use requests / limits settings for cluster-autoscaler? About 2 weeks ago, there was enough limit 300M RAM, now 500M is required. If you use lower number, OOM kills appear.

@EKami
Copy link

EKami commented Jul 26, 2021

I do use a 300Mi limit yeah, that's weird it changed just like that since I didn't change the package version, it's exactly the same config as what I had a week prior 🤔 .
I'll try removing the limit, thanks for the pointer :)

@EKami
Copy link

EKami commented Jul 26, 2021

@rubroboletus It worked, thanks! :)

@mercuriete
Copy link

Confirmed
changing from 300MiB to 500MiB fixes the problem.

I am running an 1.20 cluster with a node group using the 22-07-2021 AMI linux.
Starting to fail this Monday after a node group update.

thanks.

Young-ook pushed a commit to Young-ook/terraform-aws-eks that referenced this issue Jul 29, 2021
viatcheslavmogilevsky added a commit to provectus/sak-scaling that referenced this issue Aug 3, 2021
Update default values for memory consumption to fix this issue: kubernetes/autoscaler#4220
@seunggs
Copy link

seunggs commented Nov 6, 2021

Still seeing this problem here with no limit (also tried with limits 600MiB as suggested by AWS docs)

Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  10m                  default-scheduler  Successfully assigned kube-system/cluster-autoscaler-release-aws-cluster-autoscaler-6d58fb855g84x to ip-10-0-16-212.us-west-1.compute.internal
  Normal   Pulled     10m                  kubelet            Successfully pulled image "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.1" in 368.005529ms
  Normal   Pulled     10m                  kubelet            Successfully pulled image "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.1" in 350.720347ms
  Normal   Pulled     9m43s                kubelet            Successfully pulled image "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.1" in 368.7982ms
  Normal   Created    9m (x4 over 10m)     kubelet            Created container aws-cluster-autoscaler
  Normal   Pulling    9m (x4 over 10m)     kubelet            Pulling image "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.1"
  Normal   Pulled     9m                   kubelet            Successfully pulled image "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.1" in 359.88954ms
  Normal   Started    8m59s (x4 over 10m)  kubelet            Started container aws-cluster-autoscaler
  Warning  BackOff    31s (x44 over 10m)   kubelet            Back-off restarting failed container

Container logs show this repeated go routine:

goroutine 301 [sync.Cond.Wait]:
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:310
sync.runtime_notifyListWait(0xc000908b40, 0x0)
        /usr/local/go/src/runtime/sema.go:513 +0xf8
sync.(*Cond).Wait(0xc000908b30)
        /usr/local/go/src/sync/cond.go:56 +0x9d
golang.org/x/net/http2.(*pipe).Read(0xc000908b28, 0xc000266400, 0x200, 0x200, 0x0, 0x0, 0x0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/pipe.go:65 +0x8f
golang.org/x/net/http2.transportResponseBody.Read(0xc000908b00, 0xc000266400, 0x200, 0x200, 0x0, 0x0, 0x0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:2108 +0xaf
encoding/json.(*Decoder).refill(0xc000908f20, 0xc00103e9e0, 0x0)
        /usr/local/go/src/encoding/json/stream.go:165 +0xeb
encoding/json.(*Decoder).readValue(0xc000908f20, 0x0, 0x0, 0x36423c0)
        /usr/local/go/src/encoding/json/stream.go:140 +0x1e8
encoding/json.(*Decoder).Decode(0xc000908f20, 0x36e3ec0, 0xc00103e9e0, 0x3c0c180, 0x0)
        /usr/local/go/src/encoding/json/stream.go:63 +0x79
k8s.io/apimachinery/pkg/util/framer.(*jsonFrameReader).Read(0xc00105daa0, 0xc00095a400, 0x400, 0x400, 0x10, 0x37d1a20, 0x38)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/framer/framer.go:152 +0x1a1
k8s.io/apimachinery/pkg/runtime/serializer/streaming.(*decoder).Decode(0xc001056730, 0x0, 0x44ebba0, 0xc0008279c0, 0xc000ed1f98, 0x15b916f, 0xc000ed1e60, 0x453c8e0, 0xc000058018)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/runtime/serializer/streaming/streaming.go:77 +0x89
k8s.io/client-go/rest/watch.(*Decoder).Decode(0xc00103e9c0, 0x3deeb27, 0x1, 0x0, 0x0, 0x0, 0x1f4)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest/watch/decoder.go:49 +0x6e
k8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc000827980)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:105 +0xe5
created by k8s.io/apimachinery/pkg/watch.NewStreamWatcher
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:76 +0xea

Does anyone have any idea what's going on here?

@seunggs
Copy link

seunggs commented Nov 6, 2021

This was caused by the incorrect AWS role trust policy - would've been a bit easier to debug if there were helpful error messages but my fault for not following the aws instructions carefully.

@nehatomar12
Copy link

@seunggs did you get any solution
I am getting the similar error

I1115 09:45:17.144398       1 reflector.go:255] Listing and watching *v1.ReplicationController from k8s.io/client-go/informers/factory.go:134
W1115 09:45:17.440224       1 aws_util.go:84] Error fetching https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/us-west-1/index.json skipping...
F1115 09:45:17.440266       1 aws_cloud_provider.go:358] Failed to generate AWS EC2 Instance Types: unable to load EC2 Instance Type list
goroutine 71 [running]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2.stacks(0xc000182001, 0xc0001be460, 0x8a, 0xe0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1026 +0xb8
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2.(*loggingT).output(0x617fd80, 0xc000000003, 0x0, 0x0, 0xc000957490, 0x609f526, 0x15, 0x166, 0x0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:975 +0x1a3

@seunggs
Copy link

seunggs commented Nov 15, 2021

My issue was due to an incorrect IAM role trust policy (I was using Pulumi to generate it and discovered that my Pulumi code was not generating the policy correctly) - your issue also seems related to permissions. See this issue that seems to be related: #3216

@marcelofabricanti
Copy link

Same issue here.

Fix the policy ARN in cluster-autoscale service account solved.
When using EKS, the policy must have "trusted entities" as the OIDC provider.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 21, 2022
@neogeogre
Copy link

neogeogre commented Mar 17, 2022

For people with the same issue using terraform and aws this can help to auto-generate the correct policy for the service account of the autoscaler:

https://github.com/terraform-aws-modules/terraform-aws-iam/tree/master/modules/iam-role-for-service-accounts-eks

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 16, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@GabeOpo
Copy link

GabeOpo commented Jul 23, 2022

Hi, I am facing the same issue with eks kubernetes version 1.21

cluster-autoscaler-5dd6459897-mpqf8 0/1 CrashLoopBackOff 7 13m

this is the log I see, I applied the changed from the memory from 300Mi to 500Mi but still getting the same error, also open id is in the trusted relationships

goroutine 285 [sync.Cond.Wait]:
runtime.goparkunlock(...)
/usr/local/go/src/runtime/proc.go:310
sync.runtime_notifyListWait(0xc000b34b40, 0xc000000000)
/usr/local/go/src/runtime/sema.go:513 +0xf8
sync.(*Cond).Wait(0xc000b34b30)
/usr/local/go/src/sync/cond.go:56 +0x9d
golang.org/x/net/http2.(*pipe).Read(0xc000b34b28, 0xc00134b200, 0x200, 0x200, 0x0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/pipe.go:65 +0x8f
golang.org/x/net/http2.transportResponseBody.Read(0xc000b34b00, 0xc00134b200, 0x200, 0x200, 0x0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:2108 +0xaf
encoding/json.(*Decoder).refill(0xc0007f9760, 0xc0009ec2a0, 0x0)
/usr/local/go/src/encoding/json/stream.go:165 +0xeb
encoding/json.(*Decoder).readValue(0xc0007f9760, 0x0, 0x0, 0x3652400)
/usr/local/go/src/encoding/json/stream.go:140 +0x1e8
encoding/json.(*Decoder).Decode(0xc0007f9760, 0x36f3f00, 0xc0009ec2a0, 0x437aa1, 0x3cd95e0)
/usr/local/go/src/encoding/json/stream.go:63 +0x79
k8s.io/apimachinery/pkg/util/framer.(*jsonFrameReader).Read(0xc000c2de30, 0xc00067e400, 0x400, 0x400, 0xc000061e10, 0xc000061000, 0x38)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/framer/framer.go:152 +0x1a1
k8s.io/apimachinery/pkg/runtime/serializer/streaming.(*decoder).Decode(0xc000357720, 0x0, 0x44fd580, 0xc0014600c0, 0xc0010f6dc8, 0x41e0d8, 0xc0010f6db0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/runtime/serializer/streaming/streaming.go:77 +0x89
k8s.io/client-go/rest/watch.(*Decoder).Decode(0xc0009ec260, 0xc001226d80, 0x0, 0x0, 0x0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest/watch/decoder.go:49 +0x6e
k8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc001460080)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:105 +0xe5
created by k8s.io/apimachinery/pkg/watch.NewStreamWatcher
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:76 +0xea

asprin107 added a commit to asprin107/k8s-sandbox that referenced this issue Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests