-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discuss] Resurrect Zombie Alert tasks #53603
Comments
Moved the issue to 7.6 as this is a fix we should get in. I will do a bit of research before putting my thoughts. |
The problem I can see with option 2 is: What if the task runs every I think a better solution for now would be to move the execution logic [0] entirely into a new function and call it from a try/catch. This would let us handle the error internally without letting task manager know (we would just return the original state + new runAt). We can replicate what task manager does now which I believe is simply log to the console. This could also become a good hook for the event logger as well! The one challenge you will have with this is you need the alert's interval to schedule the next run. We could move the following code [1] outside the try/catch, include This could fail if an error is encountered from Elasticsearch / Kibana (ex: unavailable) or if it fails to decrypt the object but we have bigger troubles if that is the case. For this case, we can rely on task manager's retry logic but we would have to define a Thoughts? [0] https://github.com/elastic/kibana/blob/master/x-pack/legacy/plugins/alerting/server/lib/task_runner_factory.ts#L62-L187 |
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
Yeah, that makes sense, it still keeps a slight danger of Zombie Alerts but then it'll be down to our own implementation of Alerting and not our users, which makes it feel a little safer. I'll explore in that direction. |
Exactly, except it should come back alive at some point with infinite retry logic 😉 |
This is now handled in #53688 There is still one open issue we need to address: My thinking is that we don't want the fallback in this case as this isn't something that can fix itself over time - it requires an update to the alert before it can be fixed. |
We now wrap the validation in the fallback as well. The only step that is not recoverable now is if fetching the SavedObject fails, and I feel we can keep it this way as falling back form that would require us to retry at some made up interval and would mean we can't provide the state we wish to pass along so it could result in invalid state if any assumptions are changed in TM. |
We have identified an issue where an Alert can force itself into a Zombie state where it's underlying task abort it's retry logic and remains forever in a failed state.
The underlying reason is that Alerting doesn't currently use Task Manager's
schedule
(we have an issue for it #46001) and instead uses its own internalschedule
, meaning Task Manager doesn't classify the Alerting Task as a recurring task. As things stand, only recurring tasks are allowed to retry indefinitely in Task Manager.As I can see it, there are two options I can see for addressing this:
Address issue Convert alerts to use task manager intervals #46001, which is now somewhat more feasible thanks to the work we did to support
runNow
. We could allow TaskManager to claim a task for the sole purpose of updating it'sschedule
, which would side step the issue we had previously encountered, but could result in long update requests as Task Manager might have to wait for tasks to become free for claiming. This is still not an ideal solution (I have toyed with other ideas, such as spawning a Task whose job it is to update another task once it becomes free [but this is not simple, as there's potential for multiples of these... taskpocalypse waiting to happen 🤣]), but a feasible one.We could allow a TaskType to have an infinite number of tries, and what that would mean is that when the task fails it will continue to retry for ever until it succeeds again. The nice thing is that we already have the mechanism in place in Task Manager to space these attempts out further and further as it continues to fail (it adds 5 minutes per failure). This spacing out is reset once it succeeds, so it has relatively graceful way of handling this failure mode.
My instinct, due to the looming 7.6 release, is to go with approach Option 2 and then prioritise Option 1 for 7.7.
The reason being that I'm not sure we can figure out Option 1 in time for 7.6, but as things stand we have two different implementations of scheduling for things being run by TM (one in TM itself and another in Alerting) and that makes things complicated and harder to maintain, so we would still want to address this at some point.
Any thoughts?
The text was updated successfully, but these errors were encountered: