-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove tasks with cleanup logic instead of marking them as failed #152841
Remove tasks with cleanup logic instead of marking them as failed #152841
Conversation
…ndle-failed-tasks
…-ref HEAD~1..HEAD --fix'
…e/kibana into task-manager/handle-failed-tasks
…ndle-failed-tasks
@elasticmachine merge upstream |
@elasticmachine merge upstream |
@elasticmachine merge upstream |
@@ -894,38 +889,6 @@ export default function ({ getService }: FtrProviderContext) { | |||
}); | |||
}); | |||
|
|||
it('should allow a failed task to be rerun using runSoon', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowing failed ad-hoc tasks to be re-run was a feature that was never used, so it no longer works after this PR.
if (ctx._source.task.schedule != null || ctx._source.task.attempts < params.taskMaxAttempts[ctx._source.task.taskType]) { | ||
${setScheduledAtAndMarkAsClaimed} | ||
} else { | ||
ctx._source.task.status = "failed"; | ||
} | ||
${setScheduledAtAndMarkAsClaimed} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic to mark tasks as failed is moved into x-pack/plugins/task_manager/server/polling_lifecycle.ts
so the cleanup
hook can also get called when necessary.
@@ -275,7 +271,7 @@ export class TaskManagerPlugin | |||
taskStore.aggregate(opts), | |||
get: (id: string) => taskStore.get(id), | |||
remove: (id: string) => taskStore.remove(id), | |||
bulkRemoveIfExist: (ids: string[]) => bulkRemoveIfExist(taskStore, ids), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing bulkRemoveIfExists
with bulkRemove
since the error path within x-pack/plugins/task_manager/server/lib/bulk_remove_if_exist.ts
would never get reached in a bulk request (you have to loop the responses, no errors thrown in 404 scenario). I added code to handle 404 within x-pack/plugins/alerting/server/rules_client/common/try_to_remove_tasks.ts
.
getUnsecuredSavedObjectsClient: (request: KibanaRequest) => SavedObjectsClientContract; | ||
savedObjectsRepository: ISavedObjectsRepository; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Task runner factory now works with the savedObjectsRepository
given we no longer have a request
object within the cleanup
function. There are comments about having this operation secured but, after thinking of it, it's an implementation detail the system should manage (no need for RBAC, RBAC is on the action SO logic).
@@ -115,12 +118,6 @@ export class TaskRunnerFactory { | |||
const request = getFakeRequest(apiKey); | |||
basePathService.set(request, path); | |||
|
|||
// TM will treat a task as a failure if `attempts >= maxAttempts` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Task runner no longer cares if the task is retried or not. It will throw the appropriate error, log errors messages and leave it up to Task Manager to determine if it's done retrying or not.
@elasticmachine merge upstream |
This reverts commit b0b3d68.
…e/kibana into task-manager/handle-failed-tasks
💚 Build Succeeded
Metrics [docs]Unknown metric groupsESLint disabled line counts
Total ESLint disabled count
History
To update your PR or re-run it, just comment with: cc @mikecote |
// Not 100% sure why, seems the rules need to be loaded separately to avoid the task | ||
// failing to load the rule during execution and deleting itself. Otherwise | ||
// we have flakiness |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flaky test runner happy with this change: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2036
Flaky test runner not happy without this change: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2030
Pinging @elastic/response-ops (Team:ResponseOps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Verified everything works as expected.
…together (#153803) Resolves #153800 Resolves #142704 Resolves #153801 Resolves #142947 Resolves #140867 Similar to #152841 (comment), the rule and tasks archives don't seem to play nicely when combined. The flakiness goes away when loading the rules then the tasks in sequence. Otherwise, the tasks sometimes run before it can find the rule, causing the task to delete itself. I took a look at why the task would run an not be able to find the rule. My best guess after looking at a failing flaky test is that the task manager migration completes before the .kibana. And while .kibana migrates, the task runs and fails to load the task because the .kibana index is in an interim state. Flaky test runner: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2045 --------- Co-authored-by: Kibana Machine <[email protected]>
Part of #79977 (step 2). Resolves #79977. In this PR, I'm removing the recurring task defined by the actions plugin that removes unused `action_task_params` SOs. With the #152841 PR, tasks will no longer get marked as failed and we have a migration script (`excludeOnUpgrade`) that removes all tasks and action_task_params that are leftover during the migration https://github.com/elastic/kibana/blob/main/x-pack/plugins/actions/server/saved_objects/index.ts#L81-L94. ~~NOTE: I will hold off merging this PR until #152841 is merged.~~ (merged) ## To verify Not much to test here, but on a Kibana from `main` there will be this task type running in the background and moving to this PR will cause the task to get deleted because it is part of the `REMOVED_TYPES` array in Task Manager. --------- Co-authored-by: Kibana Machine <[email protected]>
…together (elastic#153803) Resolves elastic#153800 Resolves elastic#142704 Resolves elastic#153801 Resolves elastic#142947 Resolves elastic#140867 Similar to elastic#152841 (comment), the rule and tasks archives don't seem to play nicely when combined. The flakiness goes away when loading the rules then the tasks in sequence. Otherwise, the tasks sometimes run before it can find the rule, causing the task to delete itself. I took a look at why the task would run an not be able to find the rule. My best guess after looking at a failing flaky test is that the task manager migration completes before the .kibana. And while .kibana migrates, the task runs and fails to load the task because the .kibana index is in an interim state. Flaky test runner: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2045 --------- Co-authored-by: Kibana Machine <[email protected]>
…1873) Part of elastic#79977 (step 2). Resolves elastic#79977. In this PR, I'm removing the recurring task defined by the actions plugin that removes unused `action_task_params` SOs. With the elastic#152841 PR, tasks will no longer get marked as failed and we have a migration script (`excludeOnUpgrade`) that removes all tasks and action_task_params that are leftover during the migration https://github.com/elastic/kibana/blob/main/x-pack/plugins/actions/server/saved_objects/index.ts#L81-L94. ~~NOTE: I will hold off merging this PR until elastic#152841 is merged.~~ (merged) ## To verify Not much to test here, but on a Kibana from `main` there will be this task type running in the background and moving to this PR will cause the task to get deleted because it is part of the `REMOVED_TYPES` array in Task Manager. --------- Co-authored-by: Kibana Machine <[email protected]>
Part of #79977 (step 1 and 3).
In this PR, I'm making Task Manager remove tasks instead of updating them with
status: failed
whenever a task is out of attempts. I've also added an optionalcleanup
hook to the task runner that can be defined if additional cleanup is necessary whenever a task has been deleted (ex: deleteaction_task_params
).To verify an ad-hoc task that always fails
To verify Kibana crashing on the last ad-hoc task attempt
await new Promise((resolve) => setTimeout(resolve, 10000));
and a log message before the delay beginsTo verify recurring tasks that continuously fail
Flaky test runner: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2036