Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-autoscaler: crashes when k8s API is updated #2556

Closed
max-rocket-internet opened this issue Nov 25, 2019 · 23 comments
Closed

cluster-autoscaler: crashes when k8s API is updated #2556

max-rocket-internet opened this issue Nov 25, 2019 · 23 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@max-rocket-internet
Copy link

We are using AWS EKS and when AWS periodically updates the EKS service, we see metrics-service crash. For example, last week the service was updated from v1.13.11 to 1.13.12 and this caused the pod to crash. Here's the last state of the pod:

    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Tue, 19 Nov 2019 02:03:57 +0100
      Finished:     Tue, 19 Nov 2019 02:04:27 +0100

There's nothing really interesting in the logs at this time, just this:

I1119 01:03:57.820185       1 main.go:333] Cluster Autoscaler 1.13.1
F1119 01:04:27.821536       1 main.go:355] Failed to get nodes from apiserver: Get https://172.20.0.1:443/api/v1/nodes: dial tcp 172.20.0.1:443: i/o timeout

The metrics-server also crashed at the same time so perhaps an issue in one of the golang dependencies?

@max-rocket-internet
Copy link
Author

Version: k8s.gcr.io/cluster-autoscaler:v1.13.1

@max-rocket-internet
Copy link
Author

Related: kubernetes-sigs/metrics-server#372

@losipiuk
Copy link
Contributor

Hi @max-rocket-internet ,

It is intentional exit in case when we are not yet running (during initialization) if we cannot reach API server.

klog.Fatalf("Failed to get nodes from apiserver: %v", err)

Why do you see it as a problem? The kubelet (if CA is deployed as static pod) or deployment controller (otherwise) will be restarting CA on regular basis anyway.

@max-rocket-internet
Copy link
Author

Why do you see it as a problem?

It's definitely a problem. It's an error, check the reason and exit code. Cluster updates are happening every month or so and nothing else crashes in this process. We have monitoring and alerts for these events.

The CA should recover from this without exiting with non-zero status IMO 🙂

The kubelet or deployment controller will be restarting CA on regular basis anyway.

Why? We don't see any restarts of the pod outside of crashes and updates?

@max-rocket-internet max-rocket-internet changed the title Crashes when k8s API is updated cluster-autoscaler: crashes when k8s API is updated Nov 25, 2019
@losipiuk
Copy link
Contributor

Why do you see it as a problem?

It's definitely a problem. It's an error, check the reason and exit code. Cluster updates are happening every month or so and nothing else crashes in this process. We have monitoring and alerts for these events.

The CA should recover from this without exiting with non-zero status IMO

I agree it would be cleaner. I was just pointing that crash is not end of the world as the CA pod will be restarted after crash anyway.
And it would not work without access to API server anyway.

Actually crashing on fatal error in main() like we do (e.g. on lost leader election token) is common for other k8s controllers.
E.g. here: https://github.com/kubernetes/kubernetes/blob/46a29a0cc30c0e601febd93a5851fcce615c2964/cmd/cloud-controller-manager/app/controllermanager.go#L118
I assume it does not manifest as crash to you, because controller-manager is running on master and is restarted together with API server on upgrade.
Are you running CA on master or on standard cluster nodes?

Also are you running single k8s master? Regional setup with multiple masters would also help as you CA would not loose connectivity to control plane.

The kubelet or deployment controller will be restarting CA on regular basis anyway.

Why? We don't see any restarts of the pod outside of crashes and updates?

I meant restart after crash :)

@max-rocket-internet
Copy link
Author

I agree it would be cleaner.

Cool 😃

I was just pointing that crash is not end of the world as the CA pod will be restarted after crash anyway.

Agreed. We just have a low tolerance for misbehaving containers.

And it would not work without access to API server anyway.

Yes but it could perhaps retry in a loop for a while before exiting with error?

because controller-manager is running on master and is restarted together with API server on upgrade.
Also are you running single k8s master? Regional setup with multiple masters

AWS EKS. It's a service. No masters we can see.

Are you running CA on master or on standard cluster nodes?

On the standard cluster nodes

Actually crashing on fatal error in main() like we do is common for other k8s controllers.

OK but we have in our cluster many other apps that are using the k8s API that do not crash at this time 🙂 e.g. ingress-controllers, datadog, kube-proxy, external-dns, node-problem-detector, aws-vpc-cni, prometheus, k8s-event-logger etc etc

@losipiuk
Copy link
Contributor

Yes but it could perhaps retry in a loop for a while before exiting with error?

Makes sense. Happy to accept a PR :)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 26, 2020
@max-rocket-internet
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 27, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 27, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 26, 2020
@ltagliamonte-dd
Copy link

would love to see this fixed as well, this behaviour triggers our alert system during rolling updates of our cluster.

@max-rocket-internet
Copy link
Author

this behaviour triggers our alert system during rolling updates of our cluster.

That's exactly our problem also.

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 6, 2020
@max-rocket-internet
Copy link
Author

/remove-lifecycle stale

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 5, 2020

em. EKS rolling upgrade will terminate the master. Load balancer has timeout if in-flight requests are not finished. For some extra cases, it's possible that master node is not removed from itself and there's dead backend. My teammate is working on more smooth upgrade improvement.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 3, 2020
@gillbee
Copy link

gillbee commented Nov 4, 2020

Would like to see this fixed. Also seeing issues with EKS.

@max-rocket-internet
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 5, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 3, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 5, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gavintg
Copy link

gavintg commented Oct 7, 2022

Why does it keep crashing using EKS v1.22? I really do not get why this issue is closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

8 participants