Failed Task Manager task documents are never cleaned up bloating the index #79977

gmmorris · 2020-10-08T09:57:20Z

As we prepare to make Task Manager replaceable with a queue and scheduler service, we should limit the internals of Task Manager that are exposed to downstream plugins. We've done numerous changes over time to cleanup failed task documents (#109655 & #96971) and now we should change Task Manager to handle this first hand instead of the downstream plugins (cleanup code).

As part of this effort, we should:

1. Make Task Manager delete tasks instead of marking them as failed (delete right away, like the actions hack does [Actions] Treat failures as successes for Task Manager #109655)
2. Remove the recurring task cleaning up failed tasks periodically (Create task to cleanup action execution failures #96971)
3. [Stretch Goal] Remove the actions code treating failures as success when the task won't retry ([Actions] Treat failures as successes for Task Manager #109655). However, see note below..

NOTE: We also need to find a way to still delete action_task_params when the task gets deleted. We should find a way to make this happen without having to handle this determination logic in the actions plugin. Some options include:

A cleanup hook that the actions plugin can use to delete downstream SOs
Functionality off SO references that can cascade delete
Make the task SO capable of storing the API key encrypted

Original description

Should we consider cleaning up the index after a while?
Should we rely on the Event Log to keep track of failed Tasks which would allow us to then purge them from the index?

With the upcoming TM observability story (#77456) which executes scheduled queries against the whole index, it might be worth considering the potential size of this index.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-10-08T09:57:34Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2020-10-08T14:05:10Z

Initial feelz: we should have a recurring task that deletes old failed tasks. Is that the main source of "garbage" in the index? I'd think we could do something like delete them if > 1 week old (maybe 2, that seems kinda standard). Maybe run that once/day?

gmmorris · 2020-10-09T11:54:22Z

Initial feelz: we should have a recurring task that deletes old failed tasks. Is that the main source of "garbage" in the index? I'd think we could do something like delete them if > 1 week old (maybe 2, that seems kinda standard). Maybe run that once/day?

Yeah that was my initial gut feeling as well but we should do some research...

mikecote · 2021-03-26T11:07:48Z

With #90888 opened, this problem won't be as common after regarding action execution tasks.

mikecote · 2022-08-11T15:44:46Z

Linking with #138344

rudolf · 2023-01-30T19:44:12Z

We have the following cleanup logic for tasks and task_action_params:

https://github.com/elastic/kibana/blob/main/x-pack/plugins/task_manager/server/saved_objects/index.ts#L29-L72
https://github.com/elastic/kibana/blob/main/x-pack/plugins/actions/server/saved_objects/index.ts#L82-L94

I'd imagine we'd use very similar if not the same logic to periodically cleanup outside of migrations as well.

rudolf · 2023-01-30T20:00:09Z

As part of #147237 we would no longer be performing a reindex during upgrade migrations. This means that excludeOnUpgrade would no longer be "free", instead of just leaving the SOs we wanted to exclude we now have to actively delete them to clean up. This means cleaning up task_action_params with excludeOnUpgrade can actually increase the upgrade downtime instead of decreasing them.

mikecote · 2023-03-02T18:06:25Z

From #152223 (comment), let's make sure to cleanup isRetryableBasedOnAttempts.

…52841) Part of #79977 (step 1 and 3). In this PR, I'm making Task Manager remove tasks instead of updating them with `status: failed` whenever a task is out of attempts. I've also added an optional `cleanup` hook to the task runner that can be defined if additional cleanup is necessary whenever a task has been deleted (ex: delete `action_task_params`). ## To verify an ad-hoc task that always fails 1. With this PR codebase, modify an action to always throw an error 2. Create an alerting rule that will invoke the action once 3. See the action fail three times 4. Observe the task SO is deleted (search by task type / action type) alongside the action_task_params SO ## To verify Kibana crashing on the last ad-hoc task attempt 1. With this PR codebase, modify an action to always throw an error (similar to scenario above) but also add a delay of 10s before the error is thrown (`await new Promise((resolve) => setTimeout(resolve, 10000));` and a log message before the delay begins 2. Create an alerting rule that will invoke the action once 3. See the action fail twice 4. On the third run, crash Kibana while the action is waiting for the 10s delay, this will cause the action to still be marked as running while it no longer is 5. Restart Kibana 6. Wait 5-10m until the task's retryAt is overdue 7. Observe the task getting deleted and the action_task_params getting deleted ## To verify recurring tasks that continuously fail 1. With this PR codebase, modify a rule type to always throw an error when it runs 2. Create an alerting rule of that type (with a short interval) 3. Observe the rule continuously running and not getting trapped into the PR changes Flaky test runner: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2036 --------- Co-authored-by: kibanamachine <[email protected]>

Part of #79977 (step 2). Resolves #79977. In this PR, I'm removing the recurring task defined by the actions plugin that removes unused `action_task_params` SOs. With the #152841 PR, tasks will no longer get marked as failed and we have a migration script (`excludeOnUpgrade`) that removes all tasks and action_task_params that are leftover during the migration https://github.com/elastic/kibana/blob/main/x-pack/plugins/actions/server/saved_objects/index.ts#L81-L94. ~~NOTE: I will hold off merging this PR until #152841 is merged.~~ (merged) ## To verify Not much to test here, but on a Kibana from `main` there will be this task type running in the background and moving to this PR will cause the task to get deleted because it is part of the `REMOVED_TYPES` array in Task Manager. --------- Co-authored-by: Kibana Machine <[email protected]>

…1873) Part of elastic#79977 (step 2). Resolves elastic#79977. In this PR, I'm removing the recurring task defined by the actions plugin that removes unused `action_task_params` SOs. With the elastic#152841 PR, tasks will no longer get marked as failed and we have a migration script (`excludeOnUpgrade`) that removes all tasks and action_task_params that are leftover during the migration https://github.com/elastic/kibana/blob/main/x-pack/plugins/actions/server/saved_objects/index.ts#L81-L94. ~~NOTE: I will hold off merging this PR until elastic#152841 is merged.~~ (merged) ## To verify Not much to test here, but on a Kibana from `main` there will be this task type running in the background and moving to this PR will cause the task to get deleted because it is part of the `REMOVED_TYPES` array in Task Manager. --------- Co-authored-by: Kibana Machine <[email protected]>

gmmorris added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Oct 8, 2020

mikecote added the discuss label Oct 13, 2020

ymao1 mentioned this issue Mar 18, 2021

Alert related task manager tasks should clear out when deleting a space #50248

Closed

gmmorris added loe:needs-research This issue requires some research before it can be worked on or estimated resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility labels Jul 14, 2021

gmmorris added the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Aug 18, 2021

gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021

gmmorris added the impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. label Sep 16, 2021

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

mikecote mentioned this issue Aug 11, 2022

[ResponseOps] large number of failed rule executions from space deleting leaves Kibana at a warning level #138344

Closed

mikecote self-assigned this Feb 22, 2023

This was referenced Feb 22, 2023

Remove tasks instead of marking them as failed #151870

Closed

Remove task to cleanup action_task_params of failed tasks #151873

Merged

This was referenced Mar 2, 2023

[ResponseOps] Move task retry determination to Task Manager #152223

Merged

Remove tasks with cleanup logic instead of marking them as failed #152841

Merged

mikecote closed this as completed in #151873 Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed Task Manager task documents are never cleaned up bloating the index #79977

Failed Task Manager task documents are never cleaned up bloating the index #79977

gmmorris commented Oct 8, 2020 •

edited by mikecote

Loading

elasticmachine commented Oct 8, 2020

pmuellr commented Oct 8, 2020

gmmorris commented Oct 9, 2020

mikecote commented Mar 26, 2021

mikecote commented Aug 11, 2022

rudolf commented Jan 30, 2023

rudolf commented Jan 30, 2023

mikecote commented Mar 2, 2023

Failed Task Manager task documents are never cleaned up bloating the index #79977

Failed Task Manager task documents are never cleaned up bloating the index #79977

Comments

gmmorris commented Oct 8, 2020 • edited by mikecote Loading

elasticmachine commented Oct 8, 2020

pmuellr commented Oct 8, 2020

gmmorris commented Oct 9, 2020

mikecote commented Mar 26, 2021

mikecote commented Aug 11, 2022

rudolf commented Jan 30, 2023

rudolf commented Jan 30, 2023

mikecote commented Mar 2, 2023

gmmorris commented Oct 8, 2020 •

edited by mikecote

Loading