Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps] Retry bulk update conflicts in task manager #147808

Merged
merged 6 commits into from
Dec 21, 2022

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented Dec 19, 2022

Resolves #145316, #141849, #141864

Summary

Adds a retry on conflict error to the saved objects bulk update call made by task manager. Errors are returned by the saved object client inside an array (with a success response). Previously, we were not inspecting the response array, just returning the full data. With this PR, we are inspecting the response array specifically for conflict errors and retrying the update for just those tasks.

This bulkUpdate function is used both internally by task manager and externally by the rules client. I default the number of retries to 0 for bulk updates from the task manager in order to preserve existing behavior (and in order not to increase the amount of time it takes for task manager to run) but use 3 retries when used externally.

Also unskipped the two flaky disable tests and ran them through the flaky test runner 400 times with no failures.

@ymao1 ymao1 changed the title Alerting/check bulk results [ResponseOps] Retry bulk update conflicts in task manager Dec 20, 2022
@ymao1 ymao1 self-assigned this Dec 20, 2022
@ymao1 ymao1 added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.7.0 release_note:skip Skip the PR/issue when compiling release notes labels Dec 20, 2022
@ymao1 ymao1 marked this pull request as ready for review December 20, 2022 13:25
@ymao1 ymao1 requested a review from a team as a code owner December 20, 2022 13:25
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@ymao1
Copy link
Contributor Author

ymao1 commented Dec 21, 2022

@elasticmachine merge upstream

Copy link
Contributor

@doakalexi doakalexi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ymao1
Copy link
Contributor Author

ymao1 commented Dec 21, 2022

@elasticmachine merge upstream

const exponentialDelayMultiplier = getExponentialDelayMultiplier(retries);

await new Promise((resolve) =>
setTimeout(resolve, RETRY_IF_CONFLICTS_DELAY * exponentialDelayMultiplier + randomDelayMs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC why do we need randomDelayMs here?

Copy link
Contributor

@ersin-erdal ersin-erdal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

ESLint disabled in files

id before after diff
osquery 1 2 +1

ESLint disabled line counts

id before after diff
enterpriseSearch 19 21 +2
fleet 61 67 +6
osquery 109 115 +6
securitySolution 439 445 +6
total +20

Total ESLint disabled count

id before after diff
enterpriseSearch 20 22 +2
fleet 70 76 +6
osquery 110 117 +7
securitySolution 515 521 +6
total +21

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

mikecote added a commit that referenced this pull request Jan 26, 2023
Resolves #148914
Resolves #149090
Resolves #149091
Resolves #149092

In this PR, I'm making the following Task Manager bulk APIs retry
whenever conflicts are encountered: `bulkEnable`, `bulkDisable`, and
`bulkUpdateSchedules`.

To accomplish this, the following had to be done:
- Revert the original PR (#147808)
because the retries didn't load the updated documents whenever version
conflicts were encountered and the approached had to be redesigned.
- Create a `retryableBulkUpdate` function that can be re-used among the
bulk APIs.
- Fix a bug in `task_store.ts` where `version` field wasn't passed
through properly (no type safety for some reason)
- Remove `entity` from being returned on bulk update errors. This helped
re-use the same response structure when objects weren't found
- Create a `bulkGet` API on the task store so we get the latest
documents prior to a ES refresh happening
- Create a single mock task function that mocks task manager tasks for
unit test purposes. This was necessary as other places were doing `as
unknown as BulkUpdateTaskResult` and escaping type safety

Flaky test runs:
- [Framework]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1776
- [Kibana Security]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1786

Co-authored-by: kibanamachine <[email protected]>
kqualters-elastic pushed a commit to kqualters-elastic/kibana that referenced this pull request Feb 6, 2023
Resolves elastic#148914
Resolves elastic#149090
Resolves elastic#149091
Resolves elastic#149092

In this PR, I'm making the following Task Manager bulk APIs retry
whenever conflicts are encountered: `bulkEnable`, `bulkDisable`, and
`bulkUpdateSchedules`.

To accomplish this, the following had to be done:
- Revert the original PR (elastic#147808)
because the retries didn't load the updated documents whenever version
conflicts were encountered and the approached had to be redesigned.
- Create a `retryableBulkUpdate` function that can be re-used among the
bulk APIs.
- Fix a bug in `task_store.ts` where `version` field wasn't passed
through properly (no type safety for some reason)
- Remove `entity` from being returned on bulk update errors. This helped
re-use the same response structure when objects weren't found
- Create a `bulkGet` API on the task store so we get the latest
documents prior to a ES refresh happening
- Create a single mock task function that mocks task manager tasks for
unit test purposes. This was necessary as other places were doing `as
unknown as BulkUpdateTaskResult` and escaping type safety

Flaky test runs:
- [Framework]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1776
- [Kibana Security]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1786

Co-authored-by: kibanamachine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Task Manager release_note:skip Skip the PR/issue when compiling release notes reverted Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ResponseOps] results from taskManager.bulk*() calls ignored in rules client
7 participants