Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test alerting and action error handling to ensure it works as designed #53650

Closed
mikecote opened this issue Dec 19, 2019 · 3 comments
Closed
Assignees
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

Test alerting and action error handling to ensure it works like discussed here: #39349. We found some gaps and want to make sure there aren't any others.

@mikecote mikecote added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Dec 19, 2019
@mikecote
Copy link
Contributor Author

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote
Copy link
Contributor Author

mikecote commented Sep 30, 2020

I've finished my testing and encountered some issues.

1. Alert dies after 3 consecutive timeouts

If ever the alert times out 3 times in a row (doesn't complete running within 5 minutes), the alert will stop running completely due to the task having reached maxAttempts of 3 (task manager default).

2. Task remains in "running" state when last attempt timed out

This is outlined in the documentation under limitations, just very strange once we have a Task Manager UI.

3. Alerts execution that times out will try again 10 minutes later

The 10 minutes comes from the default 5 minute timeout + 5 minute backoff attempt multiple. This doesn't work well for alerts that run every 10 seconds or 1 minute. The expectation would be to simply run at the next interval. (Reason why #46001 is higher in priority but needs discussion).

4. Alert executors are not provided the date and time of the last successful execution

The previousStartedAt keeps changing when the previous execution failed. The documentation states this should be the last successful execution.

5. Actions are not able to configure a max number of attempts

The framework allows such configuration but is disregarded due to custom getRetry logic that indicates to stop trying after the first attempt.

@mikecote
Copy link
Contributor Author

mikecote commented Oct 1, 2020

Some issues have been created for the above:

The issue 1 and 3 are the reason why Convert alerts to use task manager intervals #46001 is already in the To-Do column.

I'm closing the issue now that testing is complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

2 participants