[observability] fire alert when we have an excessive node count #10049

kylos101 · 2022-05-16T17:55:43Z

Is your feature request related to a problem? Please describe

If many nodes are added, many workspaces start, begin to stop, but one gets stuck in terminating and the finalizer is not removed, the auto scaler can get "stuck" scaling down on that pod's node, and will not scale down till you remove the finalizer on that pod.

Describe the behaviour you'd like

Look at grafana to assess what ratio of nodes (regular or headless) make sense for an alert. In general, we should always have less nodes than workspaces or an equal number of both are zero. That should be good enough to catch this condition in the future.

sagor999 · 2022-05-19T00:29:35Z

Hm. I am not sure how can we make this alert be flexible enough. With time gitpod will grow, and what we think is excessive number of nodes will be norm.
Also we have a limit on how many nodes can be in one instance group.

So I have doubts how useful this alert will be.

utam0k · 2022-05-24T06:34:51Z

This PR may alleviate this problem.
#10085

kylos101 · 2022-05-29T18:55:37Z

@sagor999 this happened when the auto scaler misbehaved on a cordoned cluster, we had 10+ nodes, many of which needed to be removed because they were hosting zero workspaces. The problem with the autoscaler was that it would not remove nodes while one of their pods was stuck in Terminating due to a finalizer on a workspace pod.

In other words, many regular nodes had zero workspaces, but weren't being removed by the auto scaler. Perhaps it would make more sense to build an alert for that condition, where 4 or more nodes exist for a workspace type (regular, prebuild), but they've had zero corresponding workspaces for a duration of at least 1h.

kylos101 · 2022-05-29T18:56:20Z

This PR may alleviate this problem. #10085

@utam0k I do not think so, this is more a problem with the autoscaler not scaling down nodes while a single pod is stuck in Terminating status.

kylos101 · 2022-06-28T18:35:03Z

@sagor999 I am going to close this for now, it's a fringe scenario that's only happened once.

kylos101 added the team: workspace Issue belongs to the Workspace team label May 16, 2022

kylos101 added this to 🌌 Workspace Team May 16, 2022

kylos101 moved this to Scheduled in 🌌 Workspace Team May 16, 2022

kylos101 closed this as completed Jun 28, 2022

Repository owner moved this from Scheduled to Done in 🌌 Workspace Team Jun 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[observability] fire alert when we have an excessive node count #10049

[observability] fire alert when we have an excessive node count #10049

kylos101 commented May 16, 2022

sagor999 commented May 19, 2022

utam0k commented May 24, 2022

kylos101 commented May 29, 2022

kylos101 commented May 29, 2022

kylos101 commented Jun 28, 2022

[observability] fire alert when we have an excessive node count #10049

[observability] fire alert when we have an excessive node count #10049

Comments

kylos101 commented May 16, 2022

Is your feature request related to a problem? Please describe

Describe the behaviour you'd like

sagor999 commented May 19, 2022

utam0k commented May 24, 2022

kylos101 commented May 29, 2022

kylos101 commented May 29, 2022

kylos101 commented Jun 28, 2022