Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actions are not able to configure a max number of attempts #79169

Closed
mikecote opened this issue Oct 1, 2020 · 6 comments · Fixed by #138845
Closed

Actions are not able to configure a max number of attempts #79169

mikecote opened this issue Oct 1, 2020 · 6 comments · Fixed by #138845
Assignees
Labels
bug Fixes for quality problems that affect the customer experience estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Actions/Framework Issues related to the Actions Framework Feature:Actions Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Oct 1, 2020

The framework allows such configuration but is disregarded due to custom getRetry logic that indicates to stop trying after the first attempt.

The framework allows action type authors to specify custom retry logic in case they do want the action to run multiple tries when it fails. One way is for the action type to throw an ExecutorError type error and specify what to do there (though task manager will still disregard the next run when max attempts is reached). The other way is to set maxAttempts in the action type definition and will indicate how many attempts task manager should do. The getRetry returns false without looking at attempts and maybe that is the fix?

Steps to reproduce:

  • Add maxAttempts: 3, here
  • Add throw new Error('fail'); here
  • Create an alert that finds an instance and add server log as an action
  • Wait for alert to run and notice log stating failure
  • Look in .kibana_task_manager index and notice the task has status of failed and attempts of 1 but should have tried two more times
@mikecote mikecote added bug Fixes for quality problems that affect the customer experience Feature:Actions Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Oct 1, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote
Copy link
Contributor Author

mikecote commented Feb 4, 2021

Moving from 7.x - Candidates to 8.x - Candidates (Backlog) after the latest 7.x planning session.

@gmmorris gmmorris added the Feature:Actions/Framework Issues related to the Actions Framework label Jul 1, 2021
@gmmorris gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jul 14, 2021
@gmmorris gmmorris added the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Aug 18, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@doakalexi doakalexi self-assigned this Aug 10, 2022
@doakalexi doakalexi moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Aug 10, 2022
@doakalexi
Copy link
Contributor

doakalexi commented Aug 10, 2022

Should we always be retrying for executor errors instead of relying on the connector type to specify?

@pmuellr
Copy link
Member

pmuellr commented Aug 10, 2022

Should we always be retrying for executor errors instead of relying on the connector type to specify?

Is that in reference to the getRetry() function used when connector tasks are registered, shown below?

Feels like we should leave this in - I guess the code might get a little less complicated if we had some other fixed behavior. Leaving it the way it is would allow us to eventually expose this in the connector type registration, which we don't need right now - and not clear we ever would.

It's actually somewhat curious that getRetry() function ignores attempts!

So, not sure. I guess I'd just as soon leave it in, unless it does actually clean up a bunch of other code, improve performance, ... something. Doesn't seem like it would though.

this.taskManager.registerTaskDefinitions({
[`actions:${actionType.id}`]: {
title: actionType.name,
maxAttempts: actionType.maxAttempts || 1,
getRetry(attempts: number, error: unknown) {
if (error instanceof ExecutorError) {
return error.retry == null ? false : error.retry;
}
// Don't retry other kinds of errors
return false;
},
createTaskRunner: (context: RunContext) =>
this.taskRunnerFactory.create(context, actionType.maxAttempts),
},
});

@ymao1
Copy link
Contributor

ymao1 commented Aug 10, 2022

I think we should definitely be checking for attempts < maxAttempts in getRetry for non-executor errors, but for Executor errors, we are currently letting each connector type determine whether to retry by whether or not they set retry: true in the response that that connector execution returns defined in ActionTypeExecutorResult

export interface ActionTypeExecutorResult<Data> {
actionId: string;
status: ActionTypeExecutorResultStatus;
message?: string;
serviceMessage?: string;
data?: Data;
retry?: null | boolean | Date;
}

Do we want to keep that behavior or always retry on these errors?

@pmuellr
Copy link
Member

pmuellr commented Aug 11, 2022

Do we want to keep that behavior or always retry on these errors?

Looks like Slack is maybe the only connector that makes use of this - and uses the Date form of the retry (for when it gets 429 response codes). Even though I think this particular retry may be broken right now, seems like we wouldn't be able to do this if we just always retry in these cases. And I suspect there are actually other cases where we would want to use retry: false - 4xx errors are probably a case for that - that we aren't doing right now, but should.

So I think we should continue to allow connector types to pass back the retry info and make use of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Actions/Framework Issues related to the Actions Framework Feature:Actions Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

7 participants