-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster-autoscaler: crashes when k8s API is updated #2556
Comments
Version: |
Related: kubernetes-sigs/metrics-server#372 |
Hi @max-rocket-internet , It is intentional exit in case when we are not yet running (during initialization) if we cannot reach API server. autoscaler/cluster-autoscaler/main.go Line 390 in 3413247
Why do you see it as a problem? The kubelet (if CA is deployed as static pod) or deployment controller (otherwise) will be restarting CA on regular basis anyway. |
It's definitely a problem. It's an error, check the reason and exit code. Cluster updates are happening every month or so and nothing else crashes in this process. We have monitoring and alerts for these events. The CA should recover from this without exiting with non-zero status IMO 🙂
Why? We don't see any restarts of the pod outside of crashes and updates? |
I agree it would be cleaner. I was just pointing that crash is not end of the world as the CA pod will be restarted after crash anyway. Actually crashing on fatal error in main() like we do (e.g. on lost leader election token) is common for other k8s controllers. Also are you running single k8s master? Regional setup with multiple masters would also help as you CA would not loose connectivity to control plane.
I meant restart after crash :) |
Cool 😃
Agreed. We just have a low tolerance for misbehaving containers.
Yes but it could perhaps retry in a loop for a while before exiting with error?
AWS EKS. It's a service. No masters we can see.
On the standard cluster nodes
OK but we have in our cluster many other apps that are using the k8s API that do not crash at this time 🙂 e.g. ingress-controllers, datadog, kube-proxy, external-dns, node-problem-detector, aws-vpc-cni, prometheus, k8s-event-logger etc etc |
Makes sense. Happy to accept a PR :) |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
would love to see this fixed as well, this behaviour triggers our alert system during rolling updates of our cluster. |
That's exactly our problem also. /remove-lifecycle rotten |
/remove-lifecycle stale |
em. EKS rolling upgrade will terminate the master. Load balancer has timeout if in-flight requests are not finished. For some extra cases, it's possible that master node is not removed from itself and there's dead backend. My teammate is working on more smooth upgrade improvement. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Would like to see this fixed. Also seeing issues with EKS. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Rotten issues close after 30d of inactivity. Send feedback to sig-contributor-experience at kubernetes/community. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Why does it keep crashing using EKS v1.22? I really do not get why this issue is closed. |
We are using AWS EKS and when AWS periodically updates the EKS service, we see metrics-service crash. For example, last week the service was updated from v1.13.11 to 1.13.12 and this caused the pod to crash. Here's the last state of the pod:
There's nothing really interesting in the logs at this time, just this:
The metrics-server also crashed at the same time so perhaps an issue in one of the golang dependencies?
The text was updated successfully, but these errors were encountered: