Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[alerting] log warning when alert tasks are disabled due to saved object not found #101227

Closed
pmuellr opened this issue Jun 2, 2021 · 2 comments · Fixed by #101591
Closed

[alerting] log warning when alert tasks are disabled due to saved object not found #101227

pmuellr opened this issue Jun 2, 2021 · 2 comments · Fixed by #101591
Assignees
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Jun 2, 2021

Issue #100764 is currently open to figure out if we should "disable" alert tasks when the alert saved object is not found. We're not sure. We'll need to evaluate why we did this in the first place, since if we decide to NOT disable the alert task, presumably other bad things will happen.

But we believe we are seeing these disabled alert tasks today, in the field, due to transient networking issues. For situations like that, we don't really want to disable the alert task, we would like to retry - but even figuring out "when" to retry seems non-trivial.

One thing we can do today, is log a warning when we disable these alerts. It appears to be this code:

schedule: resolveErr<IntervalSchedule | undefined, Error>(schedule, (error) => {
if (isAlertSavedObjectNotFoundError(error, alertId)) {
throwUnrecoverableError(error);
}
return { interval: taskSchedule?.interval ?? FALLBACK_RETRY_INTERVAL };
}),

So, interestingly, it's not that the alert is "disabled" or the task is deleted, it's just that it's not scheduled to run again. Presumably it's an idle state at that point? We were wondering if we could collect metrics on these kinda zombi-fied alerts, perhaps there's enough unique state here that we can.

In any case, to help with diagnosing cases where this DOES happen, seems like we should be logging a message. And we do! Except it's a debug log message (line 573 below) - presumably to mask cases where the alert is deleted after the task is claimed but before it's finished completely running:

(err: ElasticsearchError) => {
const message = `Executing Alert "${alertId}" has resulted in Error: ${getEsErrorMessage(
err
)}`;
if (isAlertSavedObjectNotFoundError(err, alertId)) {
this.logger.debug(message);
} else {
this.logger.error(message);
}
return originalState;
}

So, seems like it shouldn't be a debug message, perhaps a warning would be slightly better than error? Maybe the message could be a little clearer about what's going on, and that the task will not be rescheduled? Perhaps it would be better to log this message in the code where the scheduling is actually done, compared to where it is now in the state calculation?

@pmuellr pmuellr added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jun 2, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote
Copy link
Contributor

mikecote commented Jun 3, 2021

Added to To-Do near the top.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
5 participants