cluster-autoscaler: crashes when k8s API is updated #2556

max-rocket-internet · 2019-11-25T11:56:16Z

We are using AWS EKS and when AWS periodically updates the EKS service, we see metrics-service crash. For example, last week the service was updated from v1.13.11 to 1.13.12 and this caused the pod to crash. Here's the last state of the pod:

    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Tue, 19 Nov 2019 02:03:57 +0100
      Finished:     Tue, 19 Nov 2019 02:04:27 +0100

There's nothing really interesting in the logs at this time, just this:

I1119 01:03:57.820185       1 main.go:333] Cluster Autoscaler 1.13.1
F1119 01:04:27.821536       1 main.go:355] Failed to get nodes from apiserver: Get https://172.20.0.1:443/api/v1/nodes: dial tcp 172.20.0.1:443: i/o timeout

The metrics-server also crashed at the same time so perhaps an issue in one of the golang dependencies?

The text was updated successfully, but these errors were encountered:

max-rocket-internet · 2019-11-25T11:56:23Z

Version: k8s.gcr.io/cluster-autoscaler:v1.13.1

max-rocket-internet · 2019-11-25T11:56:43Z

Related: kubernetes-sigs/metrics-server#372

losipiuk · 2019-11-25T12:46:10Z

Hi @max-rocket-internet ,

It is intentional exit in case when we are not yet running (during initialization) if we cannot reach API server.

autoscaler/cluster-autoscaler/main.go

Line 390 in 3413247

klog.Fatalf("Failed to get nodes from apiserver: %v", err)

Why do you see it as a problem? The kubelet (if CA is deployed as static pod) or deployment controller (otherwise) will be restarting CA on regular basis anyway.

max-rocket-internet · 2019-11-25T13:21:32Z

Why do you see it as a problem?

It's definitely a problem. It's an error, check the reason and exit code. Cluster updates are happening every month or so and nothing else crashes in this process. We have monitoring and alerts for these events.

The CA should recover from this without exiting with non-zero status IMO 🙂

The kubelet or deployment controller will be restarting CA on regular basis anyway.

Why? We don't see any restarts of the pod outside of crashes and updates?

losipiuk · 2019-11-25T15:16:21Z

Why do you see it as a problem?

It's definitely a problem. It's an error, check the reason and exit code. Cluster updates are happening every month or so and nothing else crashes in this process. We have monitoring and alerts for these events.

The CA should recover from this without exiting with non-zero status IMO

I agree it would be cleaner. I was just pointing that crash is not end of the world as the CA pod will be restarted after crash anyway.
And it would not work without access to API server anyway.

Actually crashing on fatal error in main() like we do (e.g. on lost leader election token) is common for other k8s controllers.
E.g. here: https://github.com/kubernetes/kubernetes/blob/46a29a0cc30c0e601febd93a5851fcce615c2964/cmd/cloud-controller-manager/app/controllermanager.go#L118
I assume it does not manifest as crash to you, because controller-manager is running on master and is restarted together with API server on upgrade.
Are you running CA on master or on standard cluster nodes?

Also are you running single k8s master? Regional setup with multiple masters would also help as you CA would not loose connectivity to control plane.

The kubelet or deployment controller will be restarting CA on regular basis anyway.

Why? We don't see any restarts of the pod outside of crashes and updates?

I meant restart after crash :)

max-rocket-internet · 2019-11-25T17:02:19Z

I agree it would be cleaner.

Cool 😃

I was just pointing that crash is not end of the world as the CA pod will be restarted after crash anyway.

Agreed. We just have a low tolerance for misbehaving containers.

And it would not work without access to API server anyway.

Yes but it could perhaps retry in a loop for a while before exiting with error?

because controller-manager is running on master and is restarted together with API server on upgrade.
Also are you running single k8s master? Regional setup with multiple masters

AWS EKS. It's a service. No masters we can see.

Are you running CA on master or on standard cluster nodes?

On the standard cluster nodes

Actually crashing on fatal error in main() like we do is common for other k8s controllers.

OK but we have in our cluster many other apps that are using the k8s API that do not crash at this time 🙂 e.g. ingress-controllers, datadog, kube-proxy, external-dns, node-problem-detector, aws-vpc-cni, prometheus, k8s-event-logger etc etc

losipiuk · 2019-11-28T16:37:05Z

Yes but it could perhaps retry in a loop for a while before exiting with error?

Makes sense. Happy to accept a PR :)

fejta-bot · 2020-02-26T17:22:58Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

max-rocket-internet · 2020-02-27T10:08:10Z

/remove-lifecycle stale

fejta-bot · 2020-05-27T10:51:28Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-06-26T11:32:13Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

ltagliamonte-dd · 2020-07-01T20:28:22Z

would love to see this fixed as well, this behaviour triggers our alert system during rolling updates of our cluster.

max-rocket-internet · 2020-07-06T10:12:57Z

this behaviour triggers our alert system during rolling updates of our cluster.

That's exactly our problem also.

/remove-lifecycle rotten

max-rocket-internet · 2020-07-06T10:13:13Z

/remove-lifecycle stale

Jeffwan · 2020-08-05T00:53:01Z

em. EKS rolling upgrade will terminate the master. Load balancer has timeout if in-flight requests are not finished. For some extra cases, it's possible that master node is not removed from itself and there's dead backend. My teammate is working on more smooth upgrade improvement.

fejta-bot · 2020-11-03T01:01:19Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

gillbee · 2020-11-04T13:52:10Z

Would like to see this fixed. Also seeing issues with EKS.

max-rocket-internet · 2020-11-05T09:25:27Z

/remove-lifecycle stale

fejta-bot · 2021-02-03T10:08:13Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2021-03-05T10:54:27Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-04-04T11:38:06Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-04-04T11:38:14Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gavintg · 2022-10-07T21:05:16Z

Why does it keep crashing using EKS v1.22? I really do not get why this issue is closed.

max-rocket-internet mentioned this issue Nov 25, 2019

Crashes when k8s API is updated kubernetes-sigs/metrics-server#372

Closed

max-rocket-internet changed the title ~~Crashes when k8s API is updated~~ cluster-autoscaler: crashes when k8s API is updated Nov 25, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 26, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 27, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 27, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 26, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 6, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 3, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 5, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 3, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 5, 2021

k8s-ci-robot closed this as completed Apr 4, 2021

matti mentioned this issue Mar 30, 2022

cluster autoscaler crashes when master api is unavailable #4776

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster-autoscaler: crashes when k8s API is updated #2556

cluster-autoscaler: crashes when k8s API is updated #2556

max-rocket-internet commented Nov 25, 2019

max-rocket-internet commented Nov 25, 2019

max-rocket-internet commented Nov 25, 2019

losipiuk commented Nov 25, 2019

max-rocket-internet commented Nov 25, 2019

losipiuk commented Nov 25, 2019

max-rocket-internet commented Nov 25, 2019

losipiuk commented Nov 28, 2019

fejta-bot commented Feb 26, 2020

max-rocket-internet commented Feb 27, 2020

fejta-bot commented May 27, 2020

fejta-bot commented Jun 26, 2020

ltagliamonte-dd commented Jul 1, 2020

max-rocket-internet commented Jul 6, 2020

max-rocket-internet commented Jul 6, 2020

Jeffwan commented Aug 5, 2020

fejta-bot commented Nov 3, 2020

gillbee commented Nov 4, 2020

max-rocket-internet commented Nov 5, 2020

fejta-bot commented Feb 3, 2021

fejta-bot commented Mar 5, 2021

fejta-bot commented Apr 4, 2021

k8s-ci-robot commented Apr 4, 2021

gavintg commented Oct 7, 2022

cluster-autoscaler: crashes when k8s API is updated #2556

cluster-autoscaler: crashes when k8s API is updated #2556

Comments

max-rocket-internet commented Nov 25, 2019

max-rocket-internet commented Nov 25, 2019

max-rocket-internet commented Nov 25, 2019

losipiuk commented Nov 25, 2019

max-rocket-internet commented Nov 25, 2019

losipiuk commented Nov 25, 2019

max-rocket-internet commented Nov 25, 2019

losipiuk commented Nov 28, 2019

fejta-bot commented Feb 26, 2020

max-rocket-internet commented Feb 27, 2020

fejta-bot commented May 27, 2020

fejta-bot commented Jun 26, 2020

ltagliamonte-dd commented Jul 1, 2020

max-rocket-internet commented Jul 6, 2020

max-rocket-internet commented Jul 6, 2020

Jeffwan commented Aug 5, 2020

fejta-bot commented Nov 3, 2020

gillbee commented Nov 4, 2020

max-rocket-internet commented Nov 5, 2020

fejta-bot commented Feb 3, 2021

fejta-bot commented Mar 5, 2021

fejta-bot commented Apr 4, 2021

k8s-ci-robot commented Apr 4, 2021

gavintg commented Oct 7, 2022