Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss] Resurrect Zombie Alert tasks #53603

Closed
gmmorris opened this issue Dec 19, 2019 · 7 comments · Fixed by #53688
Closed

[Discuss] Resurrect Zombie Alert tasks #53603

gmmorris opened this issue Dec 19, 2019 · 7 comments · Fixed by #53688
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.6.0 v8.0.0

Comments

@gmmorris
Copy link
Contributor

gmmorris commented Dec 19, 2019

We have identified an issue where an Alert can force itself into a Zombie state where it's underlying task abort it's retry logic and remains forever in a failed state.

The underlying reason is that Alerting doesn't currently use Task Manager's schedule (we have an issue for it #46001) and instead uses its own internal schedule, meaning Task Manager doesn't classify the Alerting Task as a recurring task. As things stand, only recurring tasks are allowed to retry indefinitely in Task Manager.

As I can see it, there are two options I can see for addressing this:

  1. Address issue Convert alerts to use task manager intervals #46001, which is now somewhat more feasible thanks to the work we did to support runNow. We could allow TaskManager to claim a task for the sole purpose of updating it's schedule, which would side step the issue we had previously encountered, but could result in long update requests as Task Manager might have to wait for tasks to become free for claiming. This is still not an ideal solution (I have toyed with other ideas, such as spawning a Task whose job it is to update another task once it becomes free [but this is not simple, as there's potential for multiples of these... taskpocalypse waiting to happen 🤣]), but a feasible one.

  2. We could allow a TaskType to have an infinite number of tries, and what that would mean is that when the task fails it will continue to retry for ever until it succeeds again. The nice thing is that we already have the mechanism in place in Task Manager to space these attempts out further and further as it continues to fail (it adds 5 minutes per failure). This spacing out is reset once it succeeds, so it has relatively graceful way of handling this failure mode.

My instinct, due to the looming 7.6 release, is to go with approach Option 2 and then prioritise Option 1 for 7.7.
The reason being that I'm not sure we can figure out Option 1 in time for 7.6, but as things stand we have two different implementations of scheduling for things being run by TM (one in TM itself and another in Alerting) and that makes things complicated and harder to maintain, so we would still want to address this at some point.

Any thoughts?

@gmmorris gmmorris added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.6.0 v8.0.0 labels Dec 19, 2019
@gmmorris gmmorris self-assigned this Dec 19, 2019
@mikecote mikecote added the bug Fixes for quality problems that affect the customer experience label Dec 19, 2019
@mikecote
Copy link
Contributor

Moved the issue to 7.6 as this is a fix we should get in. I will do a bit of research before putting my thoughts.

@mikecote
Copy link
Contributor

The problem I can see with option 2 is: What if the task runs every 10s and then hits a flaky error, it would then wait 5 minutes to try again.

I think a better solution for now would be to move the execution logic [0] entirely into a new function and call it from a try/catch. This would let us handle the error internally without letting task manager know (we would just return the original state + new runAt). We can replicate what task manager does now which I believe is simply log to the console. This could also become a good hook for the event logger as well!

The one challenge you will have with this is you need the alert's interval to schedule the next run. We could move the following code [1] outside the try/catch, include interval in the result and pass it down to the sub function. This would become the only point of failure that task manager would hear from.

This could fail if an error is encountered from Elasticsearch / Kibana (ex: unavailable) or if it fails to decrypt the object but we have bigger troubles if that is the case. For this case, we can rely on task manager's retry logic but we would have to define a getRetry function that always returns true here [2] to make it retry endlessly.

Thoughts?

[0] https://github.com/elastic/kibana/blob/master/x-pack/legacy/plugins/alerting/server/lib/task_runner_factory.ts#L62-L187
[1] https://github.com/elastic/kibana/blob/master/x-pack/legacy/plugins/alerting/server/lib/task_runner_factory.ts#L66-L74
[2] https://github.com/elastic/kibana/blob/master/x-pack/legacy/plugins/alerting/server/alert_type_registry.ts#L47-L50

@mikecote
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@gmmorris
Copy link
Contributor Author

Yeah, that makes sense, it still keeps a slight danger of Zombie Alerts but then it'll be down to our own implementation of Alerting and not our users, which makes it feel a little safer.

I'll explore in that direction.

@mikecote
Copy link
Contributor

it still keeps a slight danger of Zombie Alerts

Exactly, except it should come back alive at some point with infinite retry logic 😉

@gmmorris
Copy link
Contributor Author

This is now handled in #53688

There is still one open issue we need to address:
Param validation is not currently handled by the new fallback that we've added, which means that if parameter validation begins to fail for the task runner (which means the validation was changed for an alertType that has existing alerts in the system) then the underlying task could end up in a zombie state.

My thinking is that we don't want the fallback in this case as this isn't something that can fix itself over time - it requires an update to the alert before it can be fixed.
This means we need to resurrect these zombie tasks when the alert is updated to address the broken state.
I'm trying to figure out how to test for this, as reaching this broken state isn't actually easy to orchestrate but I thought it was worth raising here while I work on it in case anyone has any thoughts on the matter.

@gmmorris
Copy link
Contributor Author

We now wrap the validation in the fallback as well.
I decided this was a safer step as it means that we will continue to retry using the alert's interval until it's fixed in an update, and so, it's more easily recoverable and it'll keep logging out that something is wrong in the alert.

The only step that is not recoverable now is if fetching the SavedObject fails, and I feel we can keep it this way as falling back form that would require us to retry at some made up interval and would mean we can't provide the state we wish to pass along so it could result in invalid state if any assumptions are changed in TM.
It feels safe to fail in such a case.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.6.0 v8.0.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants