[alerting] log warning when alert tasks are disabled due to saved object not found #101227

pmuellr · 2021-06-02T21:37:17Z

Issue #100764 is currently open to figure out if we should "disable" alert tasks when the alert saved object is not found. We're not sure. We'll need to evaluate why we did this in the first place, since if we decide to NOT disable the alert task, presumably other bad things will happen.

But we believe we are seeing these disabled alert tasks today, in the field, due to transient networking issues. For situations like that, we don't really want to disable the alert task, we would like to retry - but even figuring out "when" to retry seems non-trivial.

One thing we can do today, is log a warning when we disable these alerts. It appears to be this code:

kibana/x-pack/plugins/alerting/server/task_runner/task_runner.ts

Lines 580 to 585 in 8e48d48

    
           schedule: resolveErr<IntervalSchedule | undefined, Error>(schedule, (error) => { 
        
             if (isAlertSavedObjectNotFoundError(error, alertId)) { 
        
               throwUnrecoverableError(error); 
        
             } 
        
             return { interval: taskSchedule?.interval ?? FALLBACK_RETRY_INTERVAL }; 
        
           }),

So, interestingly, it's not that the alert is "disabled" or the task is deleted, it's just that it's not scheduled to run again. Presumably it's an idle state at that point? We were wondering if we could collect metrics on these kinda zombi-fied alerts, perhaps there's enough unique state here that we can.

In any case, to help with diagnosing cases where this DOES happen, seems like we should be logging a message. And we do! Except it's a debug log message (line 573 below) - presumably to mask cases where the alert is deleted after the task is claimed but before it's finished completely running:

kibana/x-pack/plugins/alerting/server/task_runner/task_runner.ts

Lines 568 to 578 in 8e48d48

    
                   (err: ElasticsearchError) => { 
        
                     const message = `Executing Alert "${alertId}" has resulted in Error: ${getEsErrorMessage( 
        
                       err 
        
                     )}`; 
        
                     if (isAlertSavedObjectNotFoundError(err, alertId)) { 
        
                       this.logger.debug(message); 
        
                     } else { 
        
                       this.logger.error(message); 
        
                     } 
        
                     return originalState; 
        
                   }

So, seems like it shouldn't be a debug message, perhaps a warning would be slightly better than error? Maybe the message could be a little clearer about what's going on, and that the task will not be rescheduled? Perhaps it would be better to log this message in the code where the scheduling is actually done, compared to where it is now in the state calculation?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-06-02T21:37:19Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

mikecote · 2021-06-03T11:24:37Z

Added to To-Do near the top.

pmuellr added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jun 2, 2021

pmuellr mentioned this issue Jun 2, 2021

[Alerting] Should we retry alerting tasks that fail with Saved object not found errors #100764

Closed

ymao1 self-assigned this Jun 4, 2021

This was referenced Jun 8, 2021

[Alerting] Log warning when rules are not rescheduled due to Saved Object not found error #101589

Closed

[Alerting] Log warning when rules are not rescheduled due to Saved Object not found error #101591

Merged

ymao1 closed this as completed in #101591 Jun 9, 2021

gmmorris mentioned this issue Jun 15, 2021

[RAC][Epic] Observability of the alerting framework phase 1 #98902

Closed

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[alerting] log warning when alert tasks are disabled due to saved object not found #101227

[alerting] log warning when alert tasks are disabled due to saved object not found #101227

pmuellr commented Jun 2, 2021

elasticmachine commented Jun 2, 2021

mikecote commented Jun 3, 2021

[alerting] log warning when alert tasks are disabled due to saved object not found #101227

[alerting] log warning when alert tasks are disabled due to saved object not found #101227

Comments

pmuellr commented Jun 2, 2021

elasticmachine commented Jun 2, 2021

mikecote commented Jun 3, 2021