Test alerting and action error handling to ensure it works as designed #53650

mikecote · 2019-12-19T22:19:31Z

Test alerting and action error handling to ensure it works like discussed here: #39349. We found some gaps and want to make sure there aren't any others.

mikecote · 2019-12-19T22:32:23Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

mikecote · 2020-09-30T17:58:39Z

I've finished my testing and encountered some issues.

1. Alert dies after 3 consecutive timeouts

If ever the alert times out 3 times in a row (doesn't complete running within 5 minutes), the alert will stop running completely due to the task having reached maxAttempts of 3 (task manager default).

2. Task remains in "running" state when last attempt timed out

This is outlined in the documentation under limitations, just very strange once we have a Task Manager UI.

3. Alerts execution that times out will try again 10 minutes later

The 10 minutes comes from the default 5 minute timeout + 5 minute backoff attempt multiple. This doesn't work well for alerts that run every 10 seconds or 1 minute. The expectation would be to simply run at the next interval. (Reason why #46001 is higher in priority but needs discussion).

4. Alert executors are not provided the date and time of the last successful execution

The previousStartedAt keeps changing when the previous execution failed. The documentation states this should be the last successful execution.

5. Actions are not able to configure a max number of attempts

The framework allows such configuration but is disregarded due to custom getRetry logic that indicates to stop trying after the first attempt.

mikecote · 2020-10-01T17:00:24Z

Some issues have been created for the above:

Task remains in "running" state when last attempt timed out Task remains in "running" state when last attempt timed out #79165
Alert executors are not provided the date and time of the last successful execution Alert executors are not provided the date and time of the last successful execution #79166
Actions are not able to configure a max number of attempts Actions are not able to configure a max number of attempts #79169

The issue 1 and 3 are the reason why Convert alerts to use task manager intervals #46001 is already in the To-Do column.

I'm closing the issue now that testing is complete.

mikecote added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Dec 19, 2019

mikecote mentioned this issue Apr 23, 2020

Alert types to have customizable execution timeouts / retries #63188

Closed

mikecote mentioned this issue Aug 11, 2020

Alerting GA #74788

Closed

36 tasks

mikecote self-assigned this Sep 30, 2020

mikecote closed this as completed Oct 1, 2020

mikecote mentioned this issue Oct 1, 2020

Convert alerts to use task manager intervals #46001

Closed

gmmorris mentioned this issue Oct 30, 2020

[Task Manager] Changed alerts schedule logic to use Task Manager internals #80149

Merged

3 tasks

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test alerting and action error handling to ensure it works as designed #53650

Test alerting and action error handling to ensure it works as designed #53650

mikecote commented Dec 19, 2019

mikecote commented Dec 19, 2019

mikecote commented Sep 30, 2020 •

edited

Loading

mikecote commented Oct 1, 2020

Test alerting and action error handling to ensure it works as designed #53650

Test alerting and action error handling to ensure it works as designed #53650

Comments

mikecote commented Dec 19, 2019

mikecote commented Dec 19, 2019

mikecote commented Sep 30, 2020 • edited Loading

1. Alert dies after 3 consecutive timeouts

2. Task remains in "running" state when last attempt timed out

3. Alerts execution that times out will try again 10 minutes later

4. Alert executors are not provided the date and time of the last successful execution

5. Actions are not able to configure a max number of attempts

mikecote commented Oct 1, 2020

mikecote commented Sep 30, 2020 •

edited

Loading