-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ResponseOps] Retry bulk update conflicts in task manager #147808
Conversation
Pinging @elastic/response-ops (Team:ResponseOps) |
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@elasticmachine merge upstream |
const exponentialDelayMultiplier = getExponentialDelayMultiplier(retries); | ||
|
||
await new Promise((resolve) => | ||
setTimeout(resolve, RETRY_IF_CONFLICTS_DELAY * exponentialDelayMultiplier + randomDelayMs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOC why do we need randomDelayMs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
💚 Build Succeeded
Metrics [docs]Unknown metric groupsESLint disabled in files
ESLint disabled line counts
Total ESLint disabled count
History
To update your PR or re-run it, just comment with: cc @ymao1 |
…astic#147808)" This reverts commit a6232c4.
Resolves #148914 Resolves #149090 Resolves #149091 Resolves #149092 In this PR, I'm making the following Task Manager bulk APIs retry whenever conflicts are encountered: `bulkEnable`, `bulkDisable`, and `bulkUpdateSchedules`. To accomplish this, the following had to be done: - Revert the original PR (#147808) because the retries didn't load the updated documents whenever version conflicts were encountered and the approached had to be redesigned. - Create a `retryableBulkUpdate` function that can be re-used among the bulk APIs. - Fix a bug in `task_store.ts` where `version` field wasn't passed through properly (no type safety for some reason) - Remove `entity` from being returned on bulk update errors. This helped re-use the same response structure when objects weren't found - Create a `bulkGet` API on the task store so we get the latest documents prior to a ES refresh happening - Create a single mock task function that mocks task manager tasks for unit test purposes. This was necessary as other places were doing `as unknown as BulkUpdateTaskResult` and escaping type safety Flaky test runs: - [Framework] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1776 - [Kibana Security] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1786 Co-authored-by: kibanamachine <[email protected]>
Resolves elastic#148914 Resolves elastic#149090 Resolves elastic#149091 Resolves elastic#149092 In this PR, I'm making the following Task Manager bulk APIs retry whenever conflicts are encountered: `bulkEnable`, `bulkDisable`, and `bulkUpdateSchedules`. To accomplish this, the following had to be done: - Revert the original PR (elastic#147808) because the retries didn't load the updated documents whenever version conflicts were encountered and the approached had to be redesigned. - Create a `retryableBulkUpdate` function that can be re-used among the bulk APIs. - Fix a bug in `task_store.ts` where `version` field wasn't passed through properly (no type safety for some reason) - Remove `entity` from being returned on bulk update errors. This helped re-use the same response structure when objects weren't found - Create a `bulkGet` API on the task store so we get the latest documents prior to a ES refresh happening - Create a single mock task function that mocks task manager tasks for unit test purposes. This was necessary as other places were doing `as unknown as BulkUpdateTaskResult` and escaping type safety Flaky test runs: - [Framework] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1776 - [Kibana Security] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1786 Co-authored-by: kibanamachine <[email protected]>
Resolves #145316, #141849, #141864
Summary
Adds a retry on conflict error to the saved objects bulk update call made by task manager. Errors are returned by the saved object client inside an array (with a success response). Previously, we were not inspecting the response array, just returning the full data. With this PR, we are inspecting the response array specifically for conflict errors and retrying the update for just those tasks.
This
bulkUpdate
function is used both internally by task manager and externally by the rules client. I default the number of retries to 0 for bulk updates from the task manager in order to preserve existing behavior (and in order not to increase the amount of time it takes for task manager to run) but use 3 retries when used externally.Also unskipped the two flaky disable tests and ran them through the flaky test runner 400 times with no failures.