Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution] Add retryIfConflict util for 409 conflicts in Integration tests #174185

Merged
merged 9 commits into from
Jan 11, 2024

Conversation

jpdjere
Copy link
Contributor

@jpdjere jpdjere commented Jan 3, 2024

Summary

Fixes: #171428

NOTE: the test where this was reported wasn't skipped, so this PR does not unskip any tests. However, the Flaky Test Runs help us determine that the issue is no longer reproducible.

The deleteAllPrebuiltRuleAssets utility reported a 409 Conflict, presumably from security-rule assets that were attempted to be deleted while they were being updated by a parallel process.

This PR wraps the es.deleteByQuery calls in the utils deleteAllPrebuiltRuleAssets and deleteAllTimelines with a new retryIfConflict helper, that will retry the operation if the ES request fails with a 409.

Flaky test run

bundled_prebuilt_rules_package - ESS and Serverless: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4790

large_prebuilt_rules_package - ESS and Serverless: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4791

update_prebuilt_rules_package - ESS and Serverless: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4792

management - ESS and Serverless: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4793

For maintainers

@jpdjere jpdjere changed the title Unskip Integration test - Add retryIfConflict util [Security Solution] Add retryIfConflict util for 409 conflicts in Integration tests Jan 3, 2024
@jpdjere jpdjere marked this pull request as ready for review January 3, 2024 16:39
@jpdjere jpdjere requested review from a team as code owners January 3, 2024 16:39
@jpdjere jpdjere requested a review from banderror January 3, 2024 16:39
@jpdjere jpdjere self-assigned this Jan 3, 2024
@jpdjere jpdjere added test release_note:skip Skip the PR/issue when compiling release notes Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Rule Management Security Detection Rule Management Team Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area labels Jan 3, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detection-rule-management (Team:Detection Rule Management)

Copy link
Contributor

@banderror banderror left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think retryIfConflicts does what it should 🙂

@jpdjere
Copy link
Contributor Author

jpdjere commented Jan 8, 2024

@banderror

Ok, so I had to rethink this a bit.

Responses from the Elasticsearch client vary from one type of operation to another, so there's no status or statusCode property that I can read from all different types of ES responses to check if they are 409. The response from DeleteByQuery looks like:

export interface DeleteByQueryResponse {
    batches?: long;
    deleted?: long;
    failures?: BulkIndexByScrollFailure[];
    noops?: long;
    requests_per_second?: float;
    // other stuff......
    total?: long;
    version_conflicts?: long;
}

and as you can see from the error in the ticket, the 409 error is reported in the failures array of the request body. And this failure prop doesn't exist in other ES responses.

(Also, I cannot get access to anything else apart from the body when doing requests directly to ES in integration tests)

So I refactored the code so that this utility is specific to retry DeleteByQuery operations if it hits 409 conflict errors, which seems to be, at least until now, the ES operation where this conflict is happening.

I think we can try to generalize this utility, or create other specific version of retries if we need them in the future, and use this as a first step. Let me know what you think.

Also, rerunning the Flaky test runner.

@jpdjere jpdjere requested a review from banderror January 9, 2024 11:15
Comment on lines 33 to 39
for (const failure of operationResult?.failures) {
if (failure.status === 409) {
// if no retries left, throw it
if (retries <= 0) {
logger.error(`${name} conflict, exceeded retries`);
throw new Error(`${name} conflict, exceeded retries`);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had this in the deleteAllPrebuiltRuleAssets function where we call the delete by query method, we could use the generic retry function and get rid of this retryIfDeleteByQueryConflicts.

If you think that retryIfDeleteByQueryConflicts has enough use cases to exist, maybe we could use the retry function inside of it to avoid conceptual code duplication between retry and retryIfDeleteByQueryConflicts.

Copy link
Contributor

@banderror banderror left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Flaky Test Runner is happy. Approving the PR to not being a blocker for it - posted only a bunch of minor comments. Thanks for addressing the feedback, LGTM 👍

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @jpdjere

@jpdjere jpdjere merged commit b8c7306 into elastic:main Jan 11, 2024
20 checks passed
@kibanamachine
Copy link
Contributor

💔 All backports failed

Status Branch Result
8.12 Backport failed because of merge conflicts

Manual backport

To create the backport manually run:

node scripts/backport --pr 174185

Questions ?

Please refer to the Backport tool documentation

jpdjere added a commit to jpdjere/kibana that referenced this pull request Jan 12, 2024
… Integration tests (elastic#174185)

## Summary

Fixes: elastic#171428

**NOTE: the test where this was reported wasn't skipped, so this PR does
not unskip any tests.** However, the Flaky Test Runs help us determine
that the issue is no longer reproducible.

The `deleteAllPrebuiltRuleAssets` utility reported a `409 Conflict`,
presumably from `security-rule` assets that were attempted to be deleted
while they were being updated by a parallel process.

This PR wraps the `es.deleteByQuery` calls in the utils
`deleteAllPrebuiltRuleAssets` and `deleteAllTimelines` with a new
`retryIfConflict` helper, that will retry the operation if the ES
request fails with a `409`.

## Flaky test run

`bundled_prebuilt_rules_package` - **ESS** and **Serverless**:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4790

`large_prebuilt_rules_package` - **ESS** and **Serverless**:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4791

`update_prebuilt_rules_package` - **ESS** and **Serverless**:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4792

`management` - **ESS** and **Serverless**:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4793

### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

(cherry picked from commit b8c7306)

# Conflicts:
#	x-pack/test/security_solution_api_integration/package.json
@jpdjere
Copy link
Contributor Author

jpdjere commented Jan 12, 2024

💚 All backports created successfully

Status Branch Result
8.12

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

@kibanamachine kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label Jan 12, 2024
@kibanamachine
Copy link
Contributor

Looks like this PR has a backport PR but it still hasn't been merged. Please merge it ASAP to keep the branches relatively in sync.

jpdjere added a commit that referenced this pull request Jan 15, 2024
…icts in Integration tests (#174185) (#174762)

# Backport

This will backport the following commits from `main` to `8.12`:
- [[Security Solution] Add `retryIfConflict` util for `409` conflicts in
Integration tests
(#174185)](#174185)

<!--- Backport version: 8.9.8 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Juan Pablo
Djeredjian","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-01-11T12:39:45Z","message":"[Security
Solution] Add `retryIfConflict` util for `409` conflicts in Integration
tests (#174185)\n\n## Summary\r\n\r\nFixes:
https://github.com/elastic/kibana/issues/171428\r\n\r\n**NOTE: the test
where this was reported wasn't skipped, so this PR does\r\nnot unskip
any tests.** However, the Flaky Test Runs help us determine\r\nthat the
issue is no longer reproducible.\r\n\r\nThe
`deleteAllPrebuiltRuleAssets` utility reported a `409
Conflict`,\r\npresumably from `security-rule` assets that were attempted
to be deleted\r\nwhile they were being updated by a parallel
process.\r\n\r\nThis PR wraps the `es.deleteByQuery` calls in the
utils\r\n`deleteAllPrebuiltRuleAssets` and `deleteAllTimelines` with a
new\r\n`retryIfConflict` helper, that will retry the operation if the
ES\r\nrequest fails with a `409`.\r\n\r\n## Flaky test
run\r\n\r\n`bundled_prebuilt_rules_package` - **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4790\r\n\r\n`large_prebuilt_rules_package`
- **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4791\r\n\r\n`update_prebuilt_rules_package`
- **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4792\r\n\r\n`management`
- **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4793\r\n\r\n###
For maintainers\r\n\r\n- [ ] This was checked for breaking API changes
and was
[labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)","sha":"b8c7306d241807b68bedbd477dcec232e203f6ad","branchLabelMapping":{"^v8.13.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["test","release_note:skip","Team:Detections
and Resp","Team: SecuritySolution","Team:Detection Rule
Management","Feature:Prebuilt Detection
Rules","v8.12.0","v8.12.1","v8.13.0"],"number":174185,"url":"https://github.com/elastic/kibana/pull/174185","mergeCommit":{"message":"[Security
Solution] Add `retryIfConflict` util for `409` conflicts in Integration
tests (#174185)\n\n## Summary\r\n\r\nFixes:
https://github.com/elastic/kibana/issues/171428\r\n\r\n**NOTE: the test
where this was reported wasn't skipped, so this PR does\r\nnot unskip
any tests.** However, the Flaky Test Runs help us determine\r\nthat the
issue is no longer reproducible.\r\n\r\nThe
`deleteAllPrebuiltRuleAssets` utility reported a `409
Conflict`,\r\npresumably from `security-rule` assets that were attempted
to be deleted\r\nwhile they were being updated by a parallel
process.\r\n\r\nThis PR wraps the `es.deleteByQuery` calls in the
utils\r\n`deleteAllPrebuiltRuleAssets` and `deleteAllTimelines` with a
new\r\n`retryIfConflict` helper, that will retry the operation if the
ES\r\nrequest fails with a `409`.\r\n\r\n## Flaky test
run\r\n\r\n`bundled_prebuilt_rules_package` - **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4790\r\n\r\n`large_prebuilt_rules_package`
- **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4791\r\n\r\n`update_prebuilt_rules_package`
- **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4792\r\n\r\n`management`
- **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4793\r\n\r\n###
For maintainers\r\n\r\n- [ ] This was checked for breaking API changes
and was
[labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)","sha":"b8c7306d241807b68bedbd477dcec232e203f6ad"}},"sourceBranch":"main","suggestedTargetBranches":["8.12"],"targetPullRequestStates":[{"branch":"8.12","label":"v8.12.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.13.0","labelRegex":"^v8.13.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/174185","number":174185,"mergeCommit":{"message":"[Security
Solution] Add `retryIfConflict` util for `409` conflicts in Integration
tests (#174185)\n\n## Summary\r\n\r\nFixes:
https://github.com/elastic/kibana/issues/171428\r\n\r\n**NOTE: the test
where this was reported wasn't skipped, so this PR does\r\nnot unskip
any tests.** However, the Flaky Test Runs help us determine\r\nthat the
issue is no longer reproducible.\r\n\r\nThe
`deleteAllPrebuiltRuleAssets` utility reported a `409
Conflict`,\r\npresumably from `security-rule` assets that were attempted
to be deleted\r\nwhile they were being updated by a parallel
process.\r\n\r\nThis PR wraps the `es.deleteByQuery` calls in the
utils\r\n`deleteAllPrebuiltRuleAssets` and `deleteAllTimelines` with a
new\r\n`retryIfConflict` helper, that will retry the operation if the
ES\r\nrequest fails with a `409`.\r\n\r\n## Flaky test
run\r\n\r\n`bundled_prebuilt_rules_package` - **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4790\r\n\r\n`large_prebuilt_rules_package`
- **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4791\r\n\r\n`update_prebuilt_rules_package`
- **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4792\r\n\r\n`management`
- **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4793\r\n\r\n###
For maintainers\r\n\r\n- [ ] This was checked for breaking API changes
and was
[labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)","sha":"b8c7306d241807b68bedbd477dcec232e203f6ad"}}]}]
BACKPORT-->
@kibanamachine kibanamachine removed the backport missing Added to PRs automatically when the are determined to be missing a backport. label Jan 15, 2024
@mistic mistic removed the v8.12.0 label Jan 17, 2024
@mistic
Copy link
Member

mistic commented Jan 17, 2024

This PR didn't make it into the latest BC for 8.12.0. Updating the labels.

ElenaStoeva added a commit that referenced this pull request Aug 14, 2024
…sts (#189813)

Fixes #176445

## Summary

Flaky test runner:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6679

This PR fixes the advanced settings API integration tests that seem to
be flaky. The reason for the occasional failures is most likely a
`version_conflict_engine_exception` which is thrown when another node
indexes the same documents. This can happen when we save an advanced
setting since the settings API uses saved objects under the hood, and in
CI, multiple nodes can try to save an advanced setting simultaneously.

The solution in this PR is to retry the request if we encounter a 409
error. This is adapted from the solution in
#174185 which resolves a similar
failure.
ElenaStoeva added a commit to ElenaStoeva/kibana that referenced this pull request Aug 19, 2024
…sts (elastic#189813)

Fixes elastic#176445

Flaky test runner:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6679

This PR fixes the advanced settings API integration tests that seem to
be flaky. The reason for the occasional failures is most likely a
`version_conflict_engine_exception` which is thrown when another node
indexes the same documents. This can happen when we save an advanced
setting since the settings API uses saved objects under the hood, and in
CI, multiple nodes can try to save an advanced setting simultaneously.

The solution in this PR is to retry the request if we encounter a 409
error. This is adapted from the solution in
elastic#174185 which resolves a similar
failure.

(cherry picked from commit cc29eea)
ElenaStoeva added a commit to ElenaStoeva/kibana that referenced this pull request Aug 19, 2024
…sts (elastic#189813)

Fixes elastic#176445

Flaky test runner:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6679

This PR fixes the advanced settings API integration tests that seem to
be flaky. The reason for the occasional failures is most likely a
`version_conflict_engine_exception` which is thrown when another node
indexes the same documents. This can happen when we save an advanced
setting since the settings API uses saved objects under the hood, and in
CI, multiple nodes can try to save an advanced setting simultaneously.

The solution in this PR is to retry the request if we encounter a 409
error. This is adapted from the solution in
elastic#174185 which resolves a similar
failure.

(cherry picked from commit cc29eea)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment