[Discuss] Resurrect Zombie Alert tasks #53603

gmmorris · 2019-12-19T16:00:16Z

We have identified an issue where an Alert can force itself into a Zombie state where it's underlying task abort it's retry logic and remains forever in a failed state.

The underlying reason is that Alerting doesn't currently use Task Manager's schedule (we have an issue for it #46001) and instead uses its own internal schedule, meaning Task Manager doesn't classify the Alerting Task as a recurring task. As things stand, only recurring tasks are allowed to retry indefinitely in Task Manager.

As I can see it, there are two options I can see for addressing this:

Address issue Convert alerts to use task manager intervals #46001, which is now somewhat more feasible thanks to the work we did to support runNow. We could allow TaskManager to claim a task for the sole purpose of updating it's schedule, which would side step the issue we had previously encountered, but could result in long update requests as Task Manager might have to wait for tasks to become free for claiming. This is still not an ideal solution (I have toyed with other ideas, such as spawning a Task whose job it is to update another task once it becomes free [but this is not simple, as there's potential for multiples of these... taskpocalypse waiting to happen 🤣]), but a feasible one.
We could allow a TaskType to have an infinite number of tries, and what that would mean is that when the task fails it will continue to retry for ever until it succeeds again. The nice thing is that we already have the mechanism in place in Task Manager to space these attempts out further and further as it continues to fail (it adds 5 minutes per failure). This spacing out is reset once it succeeds, so it has relatively graceful way of handling this failure mode.

My instinct, due to the looming 7.6 release, is to go with approach Option 2 and then prioritise Option 1 for 7.7.
The reason being that I'm not sure we can figure out Option 1 in time for 7.6, but as things stand we have two different implementations of scheduling for things being run by TM (one in TM itself and another in Alerting) and that makes things complicated and harder to maintain, so we would still want to address this at some point.

Any thoughts?

The text was updated successfully, but these errors were encountered:

mikecote · 2019-12-19T20:58:26Z

Moved the issue to 7.6 as this is a fix we should get in. I will do a bit of research before putting my thoughts.

mikecote · 2019-12-19T21:56:02Z

The problem I can see with option 2 is: What if the task runs every 10s and then hits a flaky error, it would then wait 5 minutes to try again.

I think a better solution for now would be to move the execution logic [0] entirely into a new function and call it from a try/catch. This would let us handle the error internally without letting task manager know (we would just return the original state + new runAt). We can replicate what task manager does now which I believe is simply log to the console. This could also become a good hook for the event logger as well!

The one challenge you will have with this is you need the alert's interval to schedule the next run. We could move the following code [1] outside the try/catch, include interval in the result and pass it down to the sub function. This would become the only point of failure that task manager would hear from.

This could fail if an error is encountered from Elasticsearch / Kibana (ex: unavailable) or if it fails to decrypt the object but we have bigger troubles if that is the case. For this case, we can rely on task manager's retry logic but we would have to define a getRetry function that always returns true here [2] to make it retry endlessly.

Thoughts?

[0] https://github.com/elastic/kibana/blob/master/x-pack/legacy/plugins/alerting/server/lib/task_runner_factory.ts#L62-L187
[1] https://github.com/elastic/kibana/blob/master/x-pack/legacy/plugins/alerting/server/lib/task_runner_factory.ts#L66-L74
[2] https://github.com/elastic/kibana/blob/master/x-pack/legacy/plugins/alerting/server/alert_type_registry.ts#L47-L50

mikecote · 2019-12-19T22:39:29Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

gmmorris · 2019-12-20T07:00:34Z

Yeah, that makes sense, it still keeps a slight danger of Zombie Alerts but then it'll be down to our own implementation of Alerting and not our users, which makes it feel a little safer.

I'll explore in that direction.

mikecote · 2019-12-20T13:34:28Z

it still keeps a slight danger of Zombie Alerts

Exactly, except it should come back alive at some point with infinite retry logic 😉

gmmorris · 2019-12-23T11:00:29Z

This is now handled in #53688

There is still one open issue we need to address:
Param validation is not currently handled by the new fallback that we've added, which means that if parameter validation begins to fail for the task runner (which means the validation was changed for an alertType that has existing alerts in the system) then the underlying task could end up in a zombie state.

My thinking is that we don't want the fallback in this case as this isn't something that can fix itself over time - it requires an update to the alert before it can be fixed.
This means we need to resurrect these zombie tasks when the alert is updated to address the broken state.
I'm trying to figure out how to test for this, as reaching this broken state isn't actually easy to orchestrate but I thought it was worth raising here while I work on it in case anyone has any thoughts on the matter.

gmmorris · 2019-12-23T17:26:59Z

We now wrap the validation in the fallback as well.
I decided this was a safer step as it means that we will continue to retry using the alert's interval until it's fixed in an update, and so, it's more easily recoverable and it'll keep logging out that something is wrong in the alert.

The only step that is not recoverable now is if fetching the SavedObject fails, and I feel we can keep it this way as falling back form that would require us to retry at some made up interval and would mean we can't provide the state we wish to pass along so it could result in invalid state if any assumptions are changed in TM.
It feels safe to fail in such a case.

gmmorris added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.6.0 v8.0.0 labels Dec 19, 2019

gmmorris self-assigned this Dec 19, 2019

mikecote added the bug Fixes for quality problems that affect the customer experience label Dec 19, 2019

gmmorris mentioned this issue Dec 20, 2019

allows Alerts to recover gracefully from Executor errors #53688

Merged

7 tasks

gmmorris closed this as completed in #53688 Jan 6, 2020

gmmorris mentioned this issue Jan 9, 2020

Prevent Zombie Alerts when encountering issues decrypting attributes #54334

Closed

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discuss] Resurrect Zombie Alert tasks #53603

[Discuss] Resurrect Zombie Alert tasks #53603

gmmorris commented Dec 19, 2019 •

edited

Loading

mikecote commented Dec 19, 2019

mikecote commented Dec 19, 2019

mikecote commented Dec 19, 2019

gmmorris commented Dec 20, 2019

mikecote commented Dec 20, 2019

gmmorris commented Dec 23, 2019

gmmorris commented Dec 23, 2019

[Discuss] Resurrect Zombie Alert tasks #53603

[Discuss] Resurrect Zombie Alert tasks #53603

Comments

gmmorris commented Dec 19, 2019 • edited Loading

mikecote commented Dec 19, 2019

mikecote commented Dec 19, 2019

mikecote commented Dec 19, 2019

gmmorris commented Dec 20, 2019

mikecote commented Dec 20, 2019

gmmorris commented Dec 23, 2019

gmmorris commented Dec 23, 2019

gmmorris commented Dec 19, 2019 •

edited

Loading