Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing test: Detection Engine Serverless / Bundled Prebuilte Rules Package API Integration Tests.x-pack/test/security_solution_api_integration/test_suites/detections_response/default_license/prebuilt_rules/bundled_prebuilt_rules_package/prerelease_packages·ts - Detection Engine API - Bundled Prebuilt Rules Package @ess @serverless @skipInQA prerelease_packages "before each" hook for "should install latest stable version and ignore prerelease packages" #171428

Closed
Tracked by #173469 ...
kibanamachine opened this issue Nov 16, 2023 · 6 comments · Fixed by #174185
Assignees
Labels
8.13 candidate failed-test A test failure on a tracked branch, potentially flaky-test Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area legit-flake Test was triaged and marked as an actual flake. Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc.

Comments

@kibanamachine
Copy link
Contributor

A test failed on a tracked branch

ResponseError: {"took":11,"timed_out":false,"total":1,"deleted":0,"batches":1,"version_conflicts":1,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[{"index":".kibana_security_solution_1","id":"security-rule:rule_99.0.0","cause":{"type":"version_conflict_engine_exception","reason":"[security-rule:rule_99.0.0]: version conflict, required seqNo [2], primary term [1]. but no document was found","index_uuid":"MUu0D6MmQ0GCQ9BPbptfng","shard":"0","index":".kibana_security_solution_1"},"status":409}]}
    at SniffingTransport.request (node_modules/@elastic/transport/src/Transport.ts:535:17)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at Client.DeleteByQueryApi [as deleteByQuery] (node_modules/@elastic/elasticsearch/src/api/api/delete_by_query.ts:75:10)
    at deleteAllPrebuiltRuleAssets (delete_all_prebuilt_rule_assets.ts:16:3)
    at Context.<anonymous> (prerelease_packages.ts:36:7)
    at Object.apply (wrap_function.js:73:16) {
  meta: {
    body: {
      took: 11,
      timed_out: false,
      total: 1,
      deleted: 0,
      batches: 1,
      version_conflicts: 1,
      noops: 0,
      retries: [Object],
      throttled_millis: 0,
      requests_per_second: -1,
      throttled_until_millis: 0,
      failures: [Array]
    },
    statusCode: 409,
    headers: {
      'x-elastic-product': 'Elasticsearch',
      warning: '299 Elasticsearch-3c6875f81ed7af335f887623d898094a1e572811 "this request accesses system indices: [.kibana_security_solution_1], but in a future major version, direct access to system indices will be prevented by default"',
      'content-type': 'application/vnd.elasticsearch+json;compatible-with=8',
      'content-length': '564'
    },
    meta: {
      context: null,
      request: [Object],
      name: 'elasticsearch-js',
      connection: [Object],
      attempts: 0,
      aborted: false
    },
    warnings: [Getter]
  }
}

First failure: CI Build - main

@kibanamachine kibanamachine added the failed-test A test failure on a tracked branch, potentially flaky-test label Nov 16, 2023
@botelastic botelastic bot added the needs-team Issues missing a team label label Nov 16, 2023
@kibanamachine kibanamachine added the Team:Detection Rule Management Security Detection Rule Management Team label Nov 16, 2023
@botelastic botelastic bot removed the needs-team Issues missing a team label label Nov 16, 2023
@banderror banderror added triage_needed Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. labels Dec 15, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@banderror
Copy link
Contributor

This seems to be a legit flake, but could be hard to reproduce.

The error happened in this function:

/**
 * Remove all prebuilt rule assets from the security solution savedObjects index
 * @param es The ElasticSearch handle
 */
export const deleteAllPrebuiltRuleAssets = async (es: Client): Promise<void> => {
  await es.deleteByQuery({
    index: SECURITY_SOLUTION_SAVED_OBJECT_INDEX,
    q: 'type:security-rule',
    wait_for_completion: true,
    refresh: true,
    body: {},
  });
};

The error was:

"failures":[{"index":".kibana_security_solution_1","id":"security-rule:rule_99.0.0","cause":{"type":"version_conflict_engine_exception","reason":"[security-rule:rule_99.0.0]: version conflict, required seqNo [2], primary term [1]. but no document was found"

Could Fleet or our code in the app be installing or updating security-rule objects in parallel with the code of the test that calls deleteAllPrebuiltRuleAssets?

@banderror
Copy link
Contributor

I'm not skipping this test because:

  • The flake seems to be rare and hard to reproduce.
  • The deleteAllPrebuiltRuleAssets function is used in many tests. We'd probably need to skip all of them which would bring more harm than value.

@banderror banderror added legit-flake Test was triaged and marked as an actual flake. and removed triage_needed labels Dec 20, 2023
@banderror banderror removed their assignment Dec 20, 2023
@banderror banderror added the Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area label Dec 21, 2023
@jpdjere
Copy link
Contributor

jpdjere commented Dec 27, 2023

@MadameSheema Hi Glo, maybe you can help us elucidate this failing test.

What failed in this particular test suite is the deleteAllPrebuiltRuleAssets helper function, which makes an "delete by query" call directly to ES, to delete all security-rule assets (Prebuilt Rules assets, before they are installed as actual rules in the user's SIEM).

It failed here with a 409 Conflict error, which makes us suspect that the security rule assets might have been being updated at the same time we were trying to delete them.

However, since this util is called only once in the whole test file , I would like to understand which and how many test suites are run in parallel in these FTR Integration tests.

For this specific directory, and their associated config file, we have only two test suites: prerelease_packages.ts (where this failing test happened) and install_latest_bundled_prebuilt_rules.ts:

image

So a couple of questions:

  1. Do these two files (or even more) run in parallel?
  2. If they do, are they instantiated on different ES and Kibana instances? Or are they shared?

EDIT: I could only reproduce this issue in Serverless (1 in 500 runs), maybe the env it runs on is related?

@MadameSheema
Copy link
Member

@jpdjere we don't own the scripts that trigers the API integration tests. The logic behind how groups are picked seems to be living in .buildkite/pipeline-utils/ci-stats/pick_test_group_run_order.ts.

@pheyos can you please help to clarify @jpdjere questions? Thanks!

jpdjere added a commit that referenced this issue Jan 11, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
… Integration tests (#174185)

## Summary

Fixes: #171428

**NOTE: the test where this was reported wasn't skipped, so this PR does
not unskip any tests.** However, the Flaky Test Runs help us determine
that the issue is no longer reproducible.

The `deleteAllPrebuiltRuleAssets` utility reported a `409 Conflict`,
presumably from `security-rule` assets that were attempted to be deleted
while they were being updated by a parallel process.

This PR wraps the `es.deleteByQuery` calls in the utils
`deleteAllPrebuiltRuleAssets` and `deleteAllTimelines` with a new
`retryIfConflict` helper, that will retry the operation if the ES
request fails with a `409`.

## Flaky test run

`bundled_prebuilt_rules_package` - **ESS** and **Serverless**:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4790

`large_prebuilt_rules_package` - **ESS** and **Serverless**:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4791

`update_prebuilt_rules_package` - **ESS** and **Serverless**:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4792

`management` - **ESS** and **Serverless**:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4793

### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
jpdjere added a commit to jpdjere/kibana that referenced this issue Jan 12, 2024
… Integration tests (elastic#174185)

## Summary

Fixes: elastic#171428

**NOTE: the test where this was reported wasn't skipped, so this PR does
not unskip any tests.** However, the Flaky Test Runs help us determine
that the issue is no longer reproducible.

The `deleteAllPrebuiltRuleAssets` utility reported a `409 Conflict`,
presumably from `security-rule` assets that were attempted to be deleted
while they were being updated by a parallel process.

This PR wraps the `es.deleteByQuery` calls in the utils
`deleteAllPrebuiltRuleAssets` and `deleteAllTimelines` with a new
`retryIfConflict` helper, that will retry the operation if the ES
request fails with a `409`.

## Flaky test run

`bundled_prebuilt_rules_package` - **ESS** and **Serverless**:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4790

`large_prebuilt_rules_package` - **ESS** and **Serverless**:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4791

`update_prebuilt_rules_package` - **ESS** and **Serverless**:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4792

`management` - **ESS** and **Serverless**:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4793

### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

(cherry picked from commit b8c7306)

# Conflicts:
#	x-pack/test/security_solution_api_integration/package.json
jpdjere referenced this issue Jan 15, 2024
…icts in Integration tests (#174185) (#174762)

# Backport

This will backport the following commits from `main` to `8.12`:
- [[Security Solution] Add `retryIfConflict` util for `409` conflicts in
Integration tests
(#174185)](#174185)

<!--- Backport version: 8.9.8 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Juan Pablo
Djeredjian","email":"jpdjeredjian@gmail.com"},"sourceCommit":{"committedDate":"2024-01-11T12:39:45Z","message":"[Security
Solution] Add `retryIfConflict` util for `409` conflicts in Integration
tests (#174185)\n\n## Summary\r\n\r\nFixes:
https://github.com/elastic/kibana/issues/171428\r\n\r\n**NOTE: the test
where this was reported wasn't skipped, so this PR does\r\nnot unskip
any tests.** However, the Flaky Test Runs help us determine\r\nthat the
issue is no longer reproducible.\r\n\r\nThe
`deleteAllPrebuiltRuleAssets` utility reported a `409
Conflict`,\r\npresumably from `security-rule` assets that were attempted
to be deleted\r\nwhile they were being updated by a parallel
process.\r\n\r\nThis PR wraps the `es.deleteByQuery` calls in the
utils\r\n`deleteAllPrebuiltRuleAssets` and `deleteAllTimelines` with a
new\r\n`retryIfConflict` helper, that will retry the operation if the
ES\r\nrequest fails with a `409`.\r\n\r\n## Flaky test
run\r\n\r\n`bundled_prebuilt_rules_package` - **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4790\r\n\r\n`large_prebuilt_rules_package`
- **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4791\r\n\r\n`update_prebuilt_rules_package`
- **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4792\r\n\r\n`management`
- **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4793\r\n\r\n###
For maintainers\r\n\r\n- [ ] This was checked for breaking API changes
and was
[labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)","sha":"b8c7306d241807b68bedbd477dcec232e203f6ad","branchLabelMapping":{"^v8.13.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["test","release_note:skip","Team:Detections
and Resp","Team: SecuritySolution","Team:Detection Rule
Management","Feature:Prebuilt Detection
Rules","v8.12.0","v8.12.1","v8.13.0"],"number":174185,"url":"https://github.com/elastic/kibana/pull/174185","mergeCommit":{"message":"[Security
Solution] Add `retryIfConflict` util for `409` conflicts in Integration
tests (#174185)\n\n## Summary\r\n\r\nFixes:
https://github.com/elastic/kibana/issues/171428\r\n\r\n**NOTE: the test
where this was reported wasn't skipped, so this PR does\r\nnot unskip
any tests.** However, the Flaky Test Runs help us determine\r\nthat the
issue is no longer reproducible.\r\n\r\nThe
`deleteAllPrebuiltRuleAssets` utility reported a `409
Conflict`,\r\npresumably from `security-rule` assets that were attempted
to be deleted\r\nwhile they were being updated by a parallel
process.\r\n\r\nThis PR wraps the `es.deleteByQuery` calls in the
utils\r\n`deleteAllPrebuiltRuleAssets` and `deleteAllTimelines` with a
new\r\n`retryIfConflict` helper, that will retry the operation if the
ES\r\nrequest fails with a `409`.\r\n\r\n## Flaky test
run\r\n\r\n`bundled_prebuilt_rules_package` - **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4790\r\n\r\n`large_prebuilt_rules_package`
- **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4791\r\n\r\n`update_prebuilt_rules_package`
- **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4792\r\n\r\n`management`
- **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4793\r\n\r\n###
For maintainers\r\n\r\n- [ ] This was checked for breaking API changes
and was
[labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)","sha":"b8c7306d241807b68bedbd477dcec232e203f6ad"}},"sourceBranch":"main","suggestedTargetBranches":["8.12"],"targetPullRequestStates":[{"branch":"8.12","label":"v8.12.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.13.0","labelRegex":"^v8.13.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/174185","number":174185,"mergeCommit":{"message":"[Security
Solution] Add `retryIfConflict` util for `409` conflicts in Integration
tests (#174185)\n\n## Summary\r\n\r\nFixes:
https://github.com/elastic/kibana/issues/171428\r\n\r\n**NOTE: the test
where this was reported wasn't skipped, so this PR does\r\nnot unskip
any tests.** However, the Flaky Test Runs help us determine\r\nthat the
issue is no longer reproducible.\r\n\r\nThe
`deleteAllPrebuiltRuleAssets` utility reported a `409
Conflict`,\r\npresumably from `security-rule` assets that were attempted
to be deleted\r\nwhile they were being updated by a parallel
process.\r\n\r\nThis PR wraps the `es.deleteByQuery` calls in the
utils\r\n`deleteAllPrebuiltRuleAssets` and `deleteAllTimelines` with a
new\r\n`retryIfConflict` helper, that will retry the operation if the
ES\r\nrequest fails with a `409`.\r\n\r\n## Flaky test
run\r\n\r\n`bundled_prebuilt_rules_package` - **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4790\r\n\r\n`large_prebuilt_rules_package`
- **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4791\r\n\r\n`update_prebuilt_rules_package`
- **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4792\r\n\r\n`management`
- **ESS** and
**Serverless**:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4793\r\n\r\n###
For maintainers\r\n\r\n- [ ] This was checked for breaking API changes
and was
[labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)","sha":"b8c7306d241807b68bedbd477dcec232e203f6ad"}}]}]
BACKPORT-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.13 candidate failed-test A test failure on a tracked branch, potentially flaky-test Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area legit-flake Test was triaged and marked as an actual flake. Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants