Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[observability] create an alert when file descriptors exhausted #12935

Closed
jenting opened this issue Sep 14, 2022 · 17 comments
Closed

[observability] create an alert when file descriptors exhausted #12935

jenting opened this issue Sep 14, 2022 · 17 comments
Assignees
Labels
team: workspace Issue belongs to the Workspace team

Comments

@jenting
Copy link
Contributor

jenting commented Sep 14, 2022

Is your feature request related to a problem? Please describe

create an alert when file descriptors exhausted

Describe the behaviour you'd like

Having an alert when the number of file descriptors exhausted

Describe alternatives you've considered

None

Additional context

https://gitpod.slack.com/archives/C04245JPHKL/p1663083593170859

@jenting jenting added the team: workspace Issue belongs to the Workspace team label Sep 14, 2022
@jenting jenting self-assigned this Sep 14, 2022
@jenting jenting moved this to In Progress in 🌌 Workspace Team Sep 14, 2022
@jenting
Copy link
Contributor Author

jenting commented Sep 16, 2022

I tried different views of the metrics process_open_fds, to send the alerts as soon as possible.

image

However, view with the sum of process_open_fds, at 09/13 05:30 UTC, the value is still slow.

We could consider sending alerts if the sum of process_open_fd over 3,000,000, but it indicates that the alert will be sent out after 09/13 10:00 UTC. But I think it's too late to send the alert out.

image

Any suggestion to make the alerts send out as soon as possible? @kylos101

@kylos101
Copy link
Contributor

@jenting can you share the query you found most promising?

@kylos101
Copy link
Contributor

👋 hey @jenting , have you tried an alert like this? https://www.robustperception.io/alerting-on-approaching-open-file-limits/

@kylos101
Copy link
Contributor

Hey @jenting , I recommend trying to write an alert that is node or workspace based, rather than cluster based.

@jenting
Copy link
Contributor Author

jenting commented Sep 22, 2022

For the node-based metric node_filefd_allocated, the grafana query.

If we write an alert that is node-based, and the alert criteria is the current file descriptors / total file descriptors, we can see that the current file descriptors is far far away to total file descriptors. The grafana query.

Therefore, we can't use the criteria node_filefd_allocated{cluster="<cluster-name>"}/node_filefd_maximum{cluster="<cluster-name>"} as the alert rule.

@kylos101
Copy link
Contributor

kylos101 commented Oct 5, 2022

@jenting I recommend handing this off to @utam0k , as he is on-call this week, to see if he can finish.

@utam0k perhaps you could look later this week?

@utam0k utam0k assigned utam0k and unassigned jenting Oct 5, 2022
@utam0k
Copy link
Contributor

utam0k commented Oct 14, 2022

@jenting This incident was caused by ws-manager with PVC after all, right? So I don't think the alert really needs to be issued until just before the fd of the node is depleted, which is about 80% of the time. What do you think?

@jenting
Copy link
Contributor Author

jenting commented Oct 14, 2022

This incident was caused by ws-manager with PVC after all, right? So I don't think the alert really needs to be issued until just before the fd of the node is depleted, which is about 80% of the time. What do you think?

I agree with you. We could put the threshold 80% and I did check our overall fd usage, we are far away from 80%. (If I remember correctly, the fd usage is under 10%).

@utam0k
Copy link
Contributor

utam0k commented Oct 14, 2022

The problem with the fd of the supervisor should have been a side issue and not the root cause.

@jenting
Copy link
Contributor Author

jenting commented Oct 14, 2022

The problem with the fd of the supervisor should have been a side issue and not the root cause.

Yes, I think the ws-manager failed to handle any pod event.
It caused all related components which would interact with the ws-manager might be impacted. For example, the component does not handle the connection between the ws-manager correctly.

@kylos101
Copy link
Contributor

@utam0k does this mean we no longer need an alert? If so, what else is needed before we close this issue? To recap, the intent of this issue was to create an alert.

@kylos101
Copy link
Contributor

@utam0k if we no longer need an alert, please close this issue as not planned?

@jenting is there a separate issue that needs to be created to solve s-manager failed to handle any pod event.? If yes, can you share if this is related to PVC or in general? I ask to limit scope, so we can focus on closing this issue (either by creating an alert or losing this because we don't need an alert, and making a separate issue to track if needed).

@kylos101
Copy link
Contributor

@utam0k if we no longer need an alert, please close this issue as not planned?

@kylos101
Copy link
Contributor

@jenting is there a separate issue that needs to be created to solve s-manager failed to handle any pod event.? If yes, can you share if this is related to PVC or in general? I ask to limit scope, so we can focus on closing this issue (either by creating an alert or losing this because we don't need an alert, and making a separate issue to track if needed).

@jenting
Copy link
Contributor Author

jenting commented Oct 17, 2022

@jenting is there a separate issue that needs to be created to solve s-manager failed to handle any pod event.? If yes, can you share if this is related to PVC or in general? I ask to limit scope, so we can focus on closing this issue (either by creating an alert or losing this because we don't need an alert, and making a separate issue to track if needed).

No, we don't need to create a new issue to solve ws-manager failed to handle any pod event.

Let's link to the culprit issue #13007 and close this one.

@kylos101
Copy link
Contributor

Okay, thanks! I will close this issue as won't fix.

@kylos101 kylos101 closed this as not planned Won't fix, can't repro, duplicate, stale Oct 18, 2022
Repository owner moved this from In Progress to Awaiting Deployment in 🌌 Workspace Team Oct 18, 2022
@utam0k
Copy link
Contributor

utam0k commented Oct 18, 2022

Thanks a lot @jenting and @kylos101

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team: workspace Issue belongs to the Workspace team
Projects
None yet
Development

No branches or pull requests

3 participants