[observability] create an alert when file descriptors exhausted #12935

jenting · 2022-09-14T02:04:48Z

Is your feature request related to a problem? Please describe

create an alert when file descriptors exhausted

Describe the behaviour you'd like

Having an alert when the number of file descriptors exhausted

Describe alternatives you've considered

None

Additional context

https://gitpod.slack.com/archives/C04245JPHKL/p1663083593170859

jenting · 2022-09-16T06:25:23Z

I tried different views of the metrics process_open_fds, to send the alerts as soon as possible.

However, view with the sum of process_open_fds, at 09/13 05:30 UTC, the value is still slow.

We could consider sending alerts if the sum of process_open_fd over 3,000,000, but it indicates that the alert will be sent out after 09/13 10:00 UTC. But I think it's too late to send the alert out.

Any suggestion to make the alerts send out as soon as possible? @kylos101

kylos101 · 2022-09-19T05:00:01Z

@jenting can you share the query you found most promising?

kylos101 · 2022-09-21T01:13:37Z

👋 hey @jenting , have you tried an alert like this? https://www.robustperception.io/alerting-on-approaching-open-file-limits/

kylos101 · 2022-09-21T01:31:45Z

Hey @jenting , I recommend trying to write an alert that is node or workspace based, rather than cluster based.

jenting · 2022-09-22T05:26:56Z

For the node-based metric node_filefd_allocated, the grafana query.

If we write an alert that is node-based, and the alert criteria is the current file descriptors / total file descriptors, we can see that the current file descriptors is far far away to total file descriptors. The grafana query.

Therefore, we can't use the criteria node_filefd_allocated{cluster="<cluster-name>"}/node_filefd_maximum{cluster="<cluster-name>"} as the alert rule.

kylos101 · 2022-10-05T04:54:06Z

@jenting I recommend handing this off to @utam0k , as he is on-call this week, to see if he can finish.

@utam0k perhaps you could look later this week?

utam0k · 2022-10-14T08:36:31Z

@jenting This incident was caused by ws-manager with PVC after all, right? So I don't think the alert really needs to be issued until just before the fd of the node is depleted, which is about 80% of the time. What do you think?

jenting · 2022-10-14T08:38:35Z

This incident was caused by ws-manager with PVC after all, right? So I don't think the alert really needs to be issued until just before the fd of the node is depleted, which is about 80% of the time. What do you think?

I agree with you. We could put the threshold 80% and I did check our overall fd usage, we are far away from 80%. (If I remember correctly, the fd usage is under 10%).

utam0k · 2022-10-14T08:40:16Z

The problem with the fd of the supervisor should have been a side issue and not the root cause.

jenting · 2022-10-14T08:48:17Z

The problem with the fd of the supervisor should have been a side issue and not the root cause.

Yes, I think the ws-manager failed to handle any pod event.
It caused all related components which would interact with the ws-manager might be impacted. For example, the component does not handle the connection between the ws-manager correctly.

kylos101 · 2022-10-14T14:29:13Z

@utam0k does this mean we no longer need an alert? If so, what else is needed before we close this issue? To recap, the intent of this issue was to create an alert.

kylos101 · 2022-10-14T21:47:19Z

@utam0k if we no longer need an alert, please close this issue as not planned?

@jenting is there a separate issue that needs to be created to solve s-manager failed to handle any pod event.? If yes, can you share if this is related to PVC or in general? I ask to limit scope, so we can focus on closing this issue (either by creating an alert or losing this because we don't need an alert, and making a separate issue to track if needed).

kylos101 · 2022-10-17T21:34:29Z

@utam0k if we no longer need an alert, please close this issue as not planned?

kylos101 · 2022-10-17T21:34:38Z

@jenting is there a separate issue that needs to be created to solve s-manager failed to handle any pod event.? If yes, can you share if this is related to PVC or in general? I ask to limit scope, so we can focus on closing this issue (either by creating an alert or losing this because we don't need an alert, and making a separate issue to track if needed).

jenting · 2022-10-17T23:19:04Z

@jenting is there a separate issue that needs to be created to solve s-manager failed to handle any pod event.? If yes, can you share if this is related to PVC or in general? I ask to limit scope, so we can focus on closing this issue (either by creating an alert or losing this because we don't need an alert, and making a separate issue to track if needed).

No, we don't need to create a new issue to solve ws-manager failed to handle any pod event.

Let's link to the culprit issue #13007 and close this one.

kylos101 · 2022-10-18T02:32:25Z

Okay, thanks! I will close this issue as won't fix.

utam0k · 2022-10-18T05:30:25Z

Thanks a lot @jenting and @kylos101

jenting added this to 🌌 Workspace Team Sep 14, 2022

jenting added the team: workspace Issue belongs to the Workspace team label Sep 14, 2022

jenting self-assigned this Sep 14, 2022

jenting moved this to In Progress in 🌌 Workspace Team Sep 14, 2022

utam0k assigned utam0k and unassigned jenting Oct 5, 2022

kylos101 closed this as not planned Won't fix, can't repro, duplicate, stale Oct 18, 2022

Repository owner moved this from In Progress to Awaiting Deployment in 🌌 Workspace Team Oct 18, 2022

kylos101 removed this from 🌌 Workspace Team Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[observability] create an alert when file descriptors exhausted #12935

[observability] create an alert when file descriptors exhausted #12935

jenting commented Sep 14, 2022

jenting commented Sep 16, 2022

kylos101 commented Sep 19, 2022

kylos101 commented Sep 21, 2022

kylos101 commented Sep 21, 2022

jenting commented Sep 22, 2022

kylos101 commented Oct 5, 2022

utam0k commented Oct 14, 2022

jenting commented Oct 14, 2022

utam0k commented Oct 14, 2022

jenting commented Oct 14, 2022

kylos101 commented Oct 14, 2022

kylos101 commented Oct 14, 2022

kylos101 commented Oct 17, 2022

kylos101 commented Oct 17, 2022

jenting commented Oct 17, 2022

kylos101 commented Oct 18, 2022

utam0k commented Oct 18, 2022

[observability] create an alert when file descriptors exhausted #12935

[observability] create an alert when file descriptors exhausted #12935

Comments

jenting commented Sep 14, 2022

Is your feature request related to a problem? Please describe

Describe the behaviour you'd like

Describe alternatives you've considered

Additional context

jenting commented Sep 16, 2022

kylos101 commented Sep 19, 2022

kylos101 commented Sep 21, 2022

kylos101 commented Sep 21, 2022

jenting commented Sep 22, 2022

kylos101 commented Oct 5, 2022

utam0k commented Oct 14, 2022

jenting commented Oct 14, 2022

utam0k commented Oct 14, 2022

jenting commented Oct 14, 2022

kylos101 commented Oct 14, 2022

kylos101 commented Oct 14, 2022

kylos101 commented Oct 17, 2022

kylos101 commented Oct 17, 2022

jenting commented Oct 17, 2022

kylos101 commented Oct 18, 2022

utam0k commented Oct 18, 2022