-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[observability] fire alert when we have an excessive node count #10049
Comments
Hm. I am not sure how can we make this alert be flexible enough. With time gitpod will grow, and what we think is excessive number of nodes will be norm. So I have doubts how useful this alert will be. |
This PR may alleviate this problem. |
@sagor999 this happened when the auto scaler misbehaved on a cordoned cluster, we had 10+ nodes, many of which needed to be removed because they were hosting zero workspaces. The problem with the autoscaler was that it would not remove nodes while one of their pods was stuck in Terminating due to a finalizer on a workspace pod. In other words, many regular nodes had zero workspaces, but weren't being removed by the auto scaler. Perhaps it would make more sense to build an alert for that condition, where 4 or more nodes exist for a workspace type (regular, prebuild), but they've had zero corresponding workspaces for a duration of at least 1h. |
@sagor999 I am going to close this for now, it's a fringe scenario that's only happened once. |
Is your feature request related to a problem? Please describe
If many nodes are added, many workspaces start, begin to stop, but one gets stuck in terminating and the finalizer is not removed, the auto scaler can get "stuck" scaling down on that pod's node, and will not scale down till you remove the finalizer on that pod.
Describe the behaviour you'd like
Look at grafana to assess what ratio of nodes (regular or headless) make sense for an alert. In general, we should always have less nodes than workspaces or an equal number of both are zero. That should be good enough to catch this condition in the future.
The text was updated successfully, but these errors were encountered: