[alerting] log warning when alert tasks are disabled due to saved object not found #101227
Labels
Feature:Alerting
Team:ResponseOps
Label for the ResponseOps team (formerly the Cases and Alerting teams)
Issue #100764 is currently open to figure out if we should "disable" alert tasks when the alert saved object is not found. We're not sure. We'll need to evaluate why we did this in the first place, since if we decide to NOT disable the alert task, presumably other bad things will happen.
But we believe we are seeing these disabled alert tasks today, in the field, due to transient networking issues. For situations like that, we don't really want to disable the alert task, we would like to retry - but even figuring out "when" to retry seems non-trivial.
One thing we can do today, is log a warning when we disable these alerts. It appears to be this code:
kibana/x-pack/plugins/alerting/server/task_runner/task_runner.ts
Lines 580 to 585 in 8e48d48
So, interestingly, it's not that the alert is "disabled" or the task is deleted, it's just that it's not scheduled to run again. Presumably it's an idle state at that point? We were wondering if we could collect metrics on these kinda zombi-fied alerts, perhaps there's enough unique state here that we can.
In any case, to help with diagnosing cases where this DOES happen, seems like we should be logging a message. And we do! Except it's a debug log message (line 573 below) - presumably to mask cases where the alert is deleted after the task is claimed but before it's finished completely running:
kibana/x-pack/plugins/alerting/server/task_runner/task_runner.ts
Lines 568 to 578 in 8e48d48
So, seems like it shouldn't be a debug message, perhaps a warning would be slightly better than error? Maybe the message could be a little clearer about what's going on, and that the task will not be rescheduled? Perhaps it would be better to log this message in the code where the scheduling is actually done, compared to where it is now in the
state
calculation?The text was updated successfully, but these errors were encountered: