Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster autoscaler crashes when master api is unavailable #4776

Closed
matti opened this issue Mar 30, 2022 · 5 comments
Closed

cluster autoscaler crashes when master api is unavailable #4776

matti opened this issue Mar 30, 2022 · 5 comments
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@matti
Copy link

matti commented Mar 30, 2022

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.23.0

What k8s version are you using (kubectl version)?:

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.9", GitCommit:"b631974d68ac5045e076c86a5c66fba6f128dc72", GitTreeState:"clean", BuildDate:"2022-01-19T17:51:12Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9", GitTreeState:"clean", BuildDate:"2021-10-29T23:32:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

aws eks 1.21

What did you expect to happen?:

cluster autoscaler not to crash

What happened instead?:

I0330 09:57:35.062929       1 flags.go:57] FLAG: --write-status-configmap="true"
I0330 09:57:35.062935       1 main.go:401] Cluster Autoscaler 1.23.0
F0330 09:58:05.064761       1 main.go:430] Failed to get nodes from apiserver: Get "https://[fd9f:3e81:ae5b::1]:443/api/v1/nodes": dial tcp [fd9f:3e81:ae5b::1]:443: i/o timeout
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0x1)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1038 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x611e4e0, 0x3, 0x0, 0xc0005bb570, 0x0, {0x4d57ec2, 0x1}, 0xc0005f2350, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:987 +0x5fd
k8s.io/klog/v2.(*loggingT).printf(0x0, 0x0, 0x0, {0x0, 0x0}, {0x3cb5678, 0x26}, {0xc0005f2350, 0x1, 0x1})
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:753 +0x1c5
k8s.io/klog/v2.Fatalf(...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1532
main.main()
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:430 +0x4ce

goroutine 18 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1181 +0x6a
created by k8s.io/klog/v2.init.0
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:420 +0xfb

goroutine 29 [chan receive]:
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/exoscale/internal/k8s.io/klog.(*loggingT).flushDaemon(0xc00007f3e0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/exoscale/internal/k8s.io/klog/klog.go:1026 +0x6a
created by k8s.io/autoscaler/cluster-autoscaler/cloudprovider/exoscale/internal/k8s.io/klog.init.0
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/exoscale/internal/k8s.io/klog/klog.go:427 +0xf4

goroutine 30 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc0001c2000)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/go.opencensus.io/stats/view/worker.go:276 +0xb9
created by go.opencensus.io/stats/view.init.0
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/go.opencensus.io/stats/view/worker.go:34 +0x92

goroutine 56 [IO wait]:
internal/poll.runtime_pollWait(0x7f91522bb4f0, 0x72)
	/usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc000309c00, 0xc00005e000, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc000309c00)
	/usr/local/go/src/internal/poll/fd_unix.go:402 +0x22c
net.(*netFD).accept(0xc000309c00)
	/usr/local/go/src/net/fd_unix.go:173 +0x35
net.(*TCPListener).accept(0xc0004a9f20)
	/usr/local/go/src/net/tcpsock_posix.go:140 +0x28
net.(*TCPListener).Accept(0xc0004a9f20)
	/usr/local/go/src/net/tcpsock.go:262 +0x3d
net/http.(*Server).Serve(0xc0001c8620, {0x421e1c0, 0xc0004a9f20})
	/usr/local/go/src/net/http/server.go:3002 +0x394
net/http.(*Server).ListenAndServe(0xc0001c8620)
	/usr/local/go/src/net/http/server.go:2931 +0x7d
net/http.ListenAndServe(...)
	/usr/local/go/src/net/http/server.go:3185
main.main.func1()
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:413 +0x1fa
created by main.main
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:403 +0x2d7

How to reproduce it (as minimally and precisely as possible):

Related to #2556, #4464 and 518

@matti matti added the kind/bug Categorizes issue or PR as related to a bug. label Mar 30, 2022
@matti
Copy link
Author

matti commented Mar 30, 2022

the problem is that restarting pods go to crashLoopBackOff and when master api becomes available again, the cluster runs without autoscaler that could work until backoff has been suffered

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 3, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 2, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants