[Alerting] [o11y] Gain insight into task manager health apis when a problem occurs #101505

chrisronline · 2021-06-07T16:40:17Z

We have the ability to see health metrics but not necessarily when we need to see them (when the problem occurs). This is most noticeable when the issue is intermittent and isn't noticeable at first.

To combat this, we have a couple options:

1. Persist health metrics over time so we are able to query for metrics at certain time periods

This option involves persisting, at a regular interval, the results of the task manager health api in an index, which can be queried using a range filter to determine the metrics at the time of problems occurring. After some initial thinking, two solutions seem the more obvious:

Create/manager our own index and persist the data there (like we do for the event log)
Integrate into Stack Monitoring Kibana monitoring indices

The first option is ideal as it gives us complete control over the index, including how often we index.

2. Log health metrics when we detect a problem (the current task manager health api contains buckets of data and each bucket features a "status" field) so users can go back and see what was logged when they experienced the issue

This option involves the task manager self-detecting that it's in a problem state and writing to the event log or the Kibana server log. This gives us the necessary insight, but it's also a little bit reliant on the task manager properly self-reporting status so we have the right logs.

elasticmachine · 2021-06-07T16:40:19Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

chrisronline · 2021-06-08T15:03:20Z

Some more thoughts on 1:

The index should probably be managed by task_manager plugin
We should copy/paste the logic from the event log to manage the index, and in the future, it'd be nice to consolidate this to a singe location for both to use
From the start, we should index the entire task manager health api response
We perhaps can default collection to every 1 minute and let the user configure it
We can look to avoid nested fields but indexing multiple documents per collection cycle, but this will increase the number of documents stored. If we go down this route, we should index a document per rule type (which is somewhat bounded as we can calculate the full list for most scenarios, as we can't account for any custom plugins users may use that add rule types)
We don't need the data for that long, so we can probably set the delete max_age (in the ILM policy) to a few days
From the data, we can build some in-house visualizations and dashboards that we can share with users in need

pmuellr · 2021-06-08T17:52:39Z

If we were to index this data, we could try adding it to the event log directly. I don't believe we have any enabled: false fields - or flattened - but either of those could be the type of a new field to store rando data. I'd have to guess we will someday have a need for a field like that. I believe we are currently looking at NOT storing the data though. So, didn't do a ton of thinking on this aspect.

pmuellr · 2021-06-08T18:04:15Z

If we end up going with option 2, I think we'd want a couple of config knobs / dials, and already have some existing ones we could make use of:

config setting to indicate triggers that would determine when to log the health stats; could be multiple triggers; example would be a threshold on a drift value (any value, averages, ???, not quite sure). I think one trigger we should look at is drift, but not quite sure what we'd have the default threshold value to be. 1 minute? I don't think it probably makes sense to make it < 1 minute, as a default, and am wondering if that's too low.
how often to check the triggers

Gidi posted the following elsewhere, regarding the TM stats being printed on an interval when debug logging is on:

We log these at the rate of config.monitored_stats_required_freshness , which is the measurement we use for “how fresh does the monitoring data need to be?“.
If the data is older than that then we consider it stale and return an Error state so you can fire off a notification in cloudwatch or what not.
By default this is the poll interval + 1s , but that wasn’t based on much :man-shrugging:
We can totally change that

It's an odd name for a interval, but it probably makes sense to align it on this value. Note that it's not currently doc'd in https://www.elastic.co/guide/en/kibana/current/task-manager-settings-kb.html , so we would need to add that.

chrisronline · 2021-06-09T15:12:43Z

Thanks @pmuellr. I think our first attempt will be option 2 and I started a WIP PR for it #101751

chrisronline · 2021-06-09T18:00:40Z

config setting to indicate triggers that would determine when to log the health stats; could be multiple triggers; example would be a threshold on a drift value (any value, averages, ???, not quite sure). I think one trigger we should look at is drift, but not quite sure what we'd have the default threshold value to be. 1 minute? I don't think it probably makes sense to make it < 1 minute, as a default, and am wondering if that's too low.

Agreed. Should we look at worst case drift, or closer to average? I'm assuming we want to look at worse case and if so, should we consider some kind of rate limit? From looking at the code, the debug is fired quite often (I have a hard time reading some of this code so I could be wrong) so I worry that if the user is in a "slow state" where drift is above (for example) 1m that the log will be super noisy. The monitored metrics report over a time period, rather than a counter so if the pressure resolved itself and the drift went back down, the logs would theoretically stop then right?

chrisronline added Feature:Alerting Feature:Task Manager Feature:Actions Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jun 7, 2021

chrisronline mentioned this issue Jun 7, 2021

[RAC][Epic] Observability of the alerting framework phase 1 #98902

Closed

chrisronline self-assigned this Jun 8, 2021

chrisronline mentioned this issue Jun 9, 2021

[Task Manager] Log at different levels based on the state #101751

Merged

chrisronline linked a pull request Jun 11, 2021 that will close this issue

[Task Manager] Log at different levels based on the state #101751

Merged

chrisronline mentioned this issue Jun 16, 2021

[Task Manager] Optimize status field output for health api #102400

Open

chrisronline closed this as completed in #101751 Jun 16, 2021

ymao1 mentioned this issue Aug 24, 2021

[task manager] provide better diagnostics when task manager performance is degraded #109741

Merged

7 tasks

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] [o11y] Gain insight into task manager health apis when a problem occurs #101505

[Alerting] [o11y] Gain insight into task manager health apis when a problem occurs #101505

chrisronline commented Jun 7, 2021 •

edited

Loading

elasticmachine commented Jun 7, 2021

chrisronline commented Jun 8, 2021

pmuellr commented Jun 8, 2021

pmuellr commented Jun 8, 2021

chrisronline commented Jun 9, 2021

chrisronline commented Jun 9, 2021

[Alerting] [o11y] Gain insight into task manager health apis when a problem occurs #101505

[Alerting] [o11y] Gain insight into task manager health apis when a problem occurs #101505

Comments

chrisronline commented Jun 7, 2021 • edited Loading

1. Persist health metrics over time so we are able to query for metrics at certain time periods

2. Log health metrics when we detect a problem (the current task manager health api contains buckets of data and each bucket features a "status" field) so users can go back and see what was logged when they experienced the issue

elasticmachine commented Jun 7, 2021

chrisronline commented Jun 8, 2021

pmuellr commented Jun 8, 2021

pmuellr commented Jun 8, 2021

chrisronline commented Jun 9, 2021

chrisronline commented Jun 9, 2021

chrisronline commented Jun 7, 2021 •

edited

Loading