-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alerting] [o11y] Gain insight into task manager health apis when a problem occurs #101505
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
Some more thoughts on 1:
|
If we were to index this data, we could try adding it to the event log directly. I don't believe we have any |
If we end up going with option 2, I think we'd want a couple of config knobs / dials, and already have some existing ones we could make use of:
Gidi posted the following elsewhere, regarding the TM stats being printed on an interval when debug logging is on:
It's an odd name for a interval, but it probably makes sense to align it on this value. Note that it's not currently doc'd in https://www.elastic.co/guide/en/kibana/current/task-manager-settings-kb.html , so we would need to add that. |
Agreed. Should we look at worst case drift, or closer to average? I'm assuming we want to look at worse case and if so, should we consider some kind of rate limit? From looking at the code, the debug is fired quite often (I have a hard time reading some of this code so I could be wrong) so I worry that if the user is in a "slow state" where drift is above (for example) 1m that the log will be super noisy. The monitored metrics report over a time period, rather than a counter so if the pressure resolved itself and the drift went back down, the logs would theoretically stop then right? |
Relates to #98902 (comment)
We have the ability to see health metrics but not necessarily when we need to see them (when the problem occurs). This is most noticeable when the issue is intermittent and isn't noticeable at first.
To combat this, we have a couple options:
1. Persist health metrics over time so we are able to query for metrics at certain time periods
This option involves persisting, at a regular interval, the results of the task manager health api in an index, which can be queried using a range filter to determine the metrics at the time of problems occurring. After some initial thinking, two solutions seem the more obvious:
The first option is ideal as it gives us complete control over the index, including how often we index.
2. Log health metrics when we detect a problem (the current task manager health api contains buckets of data and each bucket features a "status" field) so users can go back and see what was logged when they experienced the issue
This option involves the task manager self-detecting that it's in a problem state and writing to the event log or the Kibana server log. This gives us the necessary insight, but it's also a little bit reliant on the task manager properly self-reporting status so we have the right logs.
The text was updated successfully, but these errors were encountered: