Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[observability] fire alert when we have an excessive node count #10049

Closed
kylos101 opened this issue May 16, 2022 · 5 comments
Closed

[observability] fire alert when we have an excessive node count #10049

kylos101 opened this issue May 16, 2022 · 5 comments
Labels
team: workspace Issue belongs to the Workspace team

Comments

@kylos101
Copy link
Contributor

Is your feature request related to a problem? Please describe

If many nodes are added, many workspaces start, begin to stop, but one gets stuck in terminating and the finalizer is not removed, the auto scaler can get "stuck" scaling down on that pod's node, and will not scale down till you remove the finalizer on that pod.

Describe the behaviour you'd like

Look at grafana to assess what ratio of nodes (regular or headless) make sense for an alert. In general, we should always have less nodes than workspaces or an equal number of both are zero. That should be good enough to catch this condition in the future.

@kylos101 kylos101 added the team: workspace Issue belongs to the Workspace team label May 16, 2022
@kylos101 kylos101 moved this to Scheduled in 🌌 Workspace Team May 16, 2022
@sagor999
Copy link
Contributor

Hm. I am not sure how can we make this alert be flexible enough. With time gitpod will grow, and what we think is excessive number of nodes will be norm.
Also we have a limit on how many nodes can be in one instance group.

So I have doubts how useful this alert will be.

@utam0k
Copy link
Contributor

utam0k commented May 24, 2022

This PR may alleviate this problem.
#10085

@kylos101
Copy link
Contributor Author

@sagor999 this happened when the auto scaler misbehaved on a cordoned cluster, we had 10+ nodes, many of which needed to be removed because they were hosting zero workspaces. The problem with the autoscaler was that it would not remove nodes while one of their pods was stuck in Terminating due to a finalizer on a workspace pod.

In other words, many regular nodes had zero workspaces, but weren't being removed by the auto scaler. Perhaps it would make more sense to build an alert for that condition, where 4 or more nodes exist for a workspace type (regular, prebuild), but they've had zero corresponding workspaces for a duration of at least 1h.

@kylos101
Copy link
Contributor Author

This PR may alleviate this problem. #10085

@utam0k I do not think so, this is more a problem with the autoscaler not scaling down nodes while a single pod is stuck in Terminating status.

@kylos101
Copy link
Contributor Author

@sagor999 I am going to close this for now, it's a fringe scenario that's only happened once.

Repository owner moved this from Scheduled to Done in 🌌 Workspace Team Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team: workspace Issue belongs to the Workspace team
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants