Actions are not able to configure a max number of attempts #79169

mikecote · 2020-10-01T16:57:59Z

The framework allows such configuration but is disregarded due to custom getRetry logic that indicates to stop trying after the first attempt.

The framework allows action type authors to specify custom retry logic in case they do want the action to run multiple tries when it fails. One way is for the action type to throw an ExecutorError type error and specify what to do there (though task manager will still disregard the next run when max attempts is reached). The other way is to set maxAttempts in the action type definition and will indicate how many attempts task manager should do. The getRetry returns false without looking at attempts and maybe that is the fix?

Steps to reproduce:

Add maxAttempts: 3, here
Add throw new Error('fail'); here
Create an alert that finds an instance and add server log as an action
Wait for alert to run and notice log stating failure
Look in .kibana_task_manager index and notice the task has status of failed and attempts of 1 but should have tried two more times

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-10-01T16:58:01Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

mikecote · 2021-02-04T23:59:45Z

Moving from 7.x - Candidates to 8.x - Candidates (Backlog) after the latest 7.x planning session.

doakalexi · 2022-08-10T18:02:11Z

Should we always be retrying for executor errors instead of relying on the connector type to specify?

pmuellr · 2022-08-10T19:08:29Z

Should we always be retrying for executor errors instead of relying on the connector type to specify?

Is that in reference to the getRetry() function used when connector tasks are registered, shown below?

Feels like we should leave this in - I guess the code might get a little less complicated if we had some other fixed behavior. Leaving it the way it is would allow us to eventually expose this in the connector type registration, which we don't need right now - and not clear we ever would.

It's actually somewhat curious that getRetry() function ignores attempts!

So, not sure. I guess I'd just as soon leave it in, unless it does actually clean up a bunch of other code, improve performance, ... something. Doesn't seem like it would though.

kibana/x-pack/plugins/actions/server/action_type_registry.ts

Lines 151 to 165 in c70c4be

    
           this.taskManager.registerTaskDefinitions({ 
        
             [`actions:${actionType.id}`]: { 
        
               title: actionType.name, 
        
               maxAttempts: actionType.maxAttempts || 1, 
        
               getRetry(attempts: number, error: unknown) { 
        
                 if (error instanceof ExecutorError) { 
        
                   return error.retry == null ? false : error.retry; 
        
                 } 
        
                 // Don't retry other kinds of errors 
        
                 return false; 
        
               }, 
        
               createTaskRunner: (context: RunContext) => 
        
                 this.taskRunnerFactory.create(context, actionType.maxAttempts), 
        
             }, 
        
           });

ymao1 · 2022-08-10T19:19:30Z

I think we should definitely be checking for attempts < maxAttempts in getRetry for non-executor errors, but for Executor errors, we are currently letting each connector type determine whether to retry by whether or not they set retry: true in the response that that connector execution returns defined in ActionTypeExecutorResult

kibana/x-pack/plugins/actions/common/types.ts

Lines 45 to 52 in aacb083

    
           export interface ActionTypeExecutorResult<Data> { 
        
             actionId: string; 
        
             status: ActionTypeExecutorResultStatus; 
        
             message?: string; 
        
             serviceMessage?: string; 
        
             data?: Data; 
        
             retry?: null | boolean | Date; 
        
           }

Do we want to keep that behavior or always retry on these errors?

pmuellr · 2022-08-11T19:34:32Z

Do we want to keep that behavior or always retry on these errors?

Looks like Slack is maybe the only connector that makes use of this - and uses the Date form of the retry (for when it gets 429 response codes). Even though I think this particular retry may be broken right now, seems like we wouldn't be able to do this if we just always retry in these cases. And I suspect there are actually other cases where we would want to use retry: false - 4xx errors are probably a case for that - that we aren't doing right now, but should.

So I think we should continue to allow connector types to pass back the retry info and make use of it.

mikecote added bug Fixes for quality problems that affect the customer experience Feature:Actions Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Oct 1, 2020

mikecote mentioned this issue Oct 1, 2020

Test alerting and action error handling to ensure it works as designed #53650

Closed

mikecote mentioned this issue Feb 10, 2021

The execution of actions is often delayed after a rule schedule them due to capacity and persistence overhead #90888

Closed

gmmorris added the Feature:Actions/Framework Issues related to the Actions Framework label Jul 1, 2021

gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jul 14, 2021

gmmorris added the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Aug 18, 2021

gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021

mikecote added this to AppEx: ResponseOps - Execution & Connectors Jan 6, 2022

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

mikecote moved this to Todo in AppEx: ResponseOps - Execution & Connectors Jul 21, 2022

doakalexi self-assigned this Aug 10, 2022

doakalexi moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Aug 10, 2022

doakalexi mentioned this issue Aug 16, 2022

[ResponseOps][Alerting] Actions are not able to configure a max number of attempts #138845

Merged

1 task

doakalexi moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Aug 16, 2022

doakalexi closed this as completed in #138845 Aug 22, 2022

Repository owner moved this from In Review to Done in AppEx: ResponseOps - Execution & Connectors Aug 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actions are not able to configure a max number of attempts #79169

Actions are not able to configure a max number of attempts #79169

mikecote commented Oct 1, 2020

elasticmachine commented Oct 1, 2020

mikecote commented Feb 4, 2021

doakalexi commented Aug 10, 2022 •

edited

Loading

pmuellr commented Aug 10, 2022

ymao1 commented Aug 10, 2022

pmuellr commented Aug 11, 2022

Actions are not able to configure a max number of attempts #79169

Actions are not able to configure a max number of attempts #79169

Comments

mikecote commented Oct 1, 2020

elasticmachine commented Oct 1, 2020

mikecote commented Feb 4, 2021

doakalexi commented Aug 10, 2022 • edited Loading

pmuellr commented Aug 10, 2022

ymao1 commented Aug 10, 2022

pmuellr commented Aug 11, 2022

doakalexi commented Aug 10, 2022 •

edited

Loading