[Security Solution][Detections] Fixes Rule Execution Log events potentially being out of order when providing status filters and max events are hit #131675

spong · 2022-05-05T21:13:47Z

Summary

Adds an explicit sort on @timestamp to the initial query (when 1-2 status) filters are applied as when we currently overflow past 1k docs the docs returned are going to be ordered by descending _count, which can cause Failed execution to be past the overflow limit as they often have less aggregate documents .

…overflow exists

elasticmachine · 2022-05-05T21:13:49Z

Pinging @elastic/security-detections-response (Team:Detections and Resp)

elasticmachine · 2022-05-05T21:13:49Z

Pinging @elastic/security-solution (Team: SecuritySolution)

banderror

Tested on Cloud and managed to filter by Succeeded and Failed correctly. What I did:

I set up a date range that includes the Failed execution, but is big enough to include more than 1000 executions.
Selected Succeeded + Failed in the Status filter.

So that works! 🙏

What I also noticed when testing is that every table reload is extremely slow: when the table is initially loaded or reloaded due to changes in filters or pagination. I must admit that the whole Details page was slow as well, including the Alerts table, but I think the Rule execution logs table was especially slow. I'm wondering could it be partially caused by the changes in this PR, or if the only reason is the overall slowness of this cloud environment?

@xcrzx could you please test it locally and check the performance with APM?

I think it would also be great to cover this scenario with integration tests.

…ion-log-overflow-sort

spong · 2022-07-08T00:42:43Z

@elasticmachine merge upstream

spong · 2022-07-08T14:57:12Z

@elasticmachine merge upstream

spong · 2022-07-09T00:59:49Z

Alrighty @banderror, I've added a fun API integration test for exercising this bug in 7a657ac, and then setup the cloud deploy for testing perf as well. This rule is running every 5 seconds, so just need to let it run a bit and then put it in an error state and see how things are. TBH, it was really slow getting just that rule configured, similar as you mentioned, so I wouldn't be surprised if it's just the lack of sizing on these test clusters, but we'll see. Hopefully we can edit the cloud config and scale them to test as well.

kibana-ci · 2022-07-11T15:44:44Z

💛 Build succeeded, but was flaky

Failed CI Steps

Security Solution Tests #3

Test Failures

[job] [logs] Security Solution Tests #3 / Inspect Network stats and tables "before all" hook for "inspects the Source IPs Table"

Metrics [docs]

✅ unchanged

History

💚 Build #56466 succeeded e12fecf
💛 Build #56330 was flaky 5df5283
💔 Build #56143 failed 746cf0f
💚 Build #50020 succeeded 0073f02
💛 Build #48971 was flaky 8a9a829
💚 Build #48496 succeeded addf2e2

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @spong

vitaliidm · 2022-07-11T16:38:57Z

Was able to verify fix of the issue on cloud environment(number of succeeded executions > 1000):

failed execution present within a selected timeframe

Observed performance issues are rather related to cloud environment. For example, enabling/disabling a single rule takes there ~ 3-4s.

On a separate note(not related to PR): noticed, one execution was recorded with different security_status and status fields:

duration_ms: 5995
es_search_duration_ms: 0
execution_uuid: "78e7e44a-7dbe-49c1-8650-6ac55f7762ec"
gap_duration_s: 0
indexing_duration_ms: 0
message: "rule executed: siem.queryRule:869a8540-ff21-11ec-9dc0-8fbfdab2f631: 'Testing that overflow sort bug 🙃'"
num_active_alerts: 0
num_errored_actions: 0
num_new_alerts: 0
num_recovered_alerts: 0
num_succeeded_actions: 0
num_triggered_actions: 0
schedule_delay_ms: 3600
search_duration_ms: 105
security_message: "succeeded"
security_status: "succeeded"
status: "failure"
timed_out: false
timestamp: "2022-07-11T07:53:32.434Z"
total_search_duration_ms: 0

Changes looks good to me. Thanks @spong

spong · 2022-07-11T18:30:23Z

Thanks for testing and finding this additional issue @vitaliidm!

So looking at that specific execution, as far as the code is concerned, we're doing what's expected for the given data.

I have confirmed this will happen no matter the code path (either providing filters and pre-fetching ID's, or just fetching all executions), but you can really only tell with the former as the status difference is clear, since it's in conflict with the selected filter.

That said, the reason this is happening is that as you pointed out, there's a mis-match between the platform and solution statuses. The stack is an error, and the solution status was successful.

And since we query against event.outcome and kibana.alert.rule.execution.status, but display the solution status on the UI (and only fallback to the platform status if there is no solution status), this is where any sort of mis-match is going to surface...

I'm thinking the best place to fix this is at the UI layer, and if there's a mismatch between platform/solution status, then just fall back to the platform status (and switch to error.message instead of message if it's an error). Once we start passing our status up to the platform to write a single unified execution status (#130966) that should help here and narrow the chance of a circuit breaker error coming in during task execution a splitting these two statuses like this.

I'm going to go ahead and create a follow-up issue (#136138) for tracking this one, and we can prioritize accordingly. Would be ideal if we could just get #130966 worked, and swap to querying single execution events via the find API (instead of this monster agg), but we'll at least have this issue for tracking if that takes a bit. All-in-all, this should be a low impact since stack/solution statuses should match up in most instances (this being on a resource constrained CI cloud deploy increased our chances of those circuit breakers coming in blowing up the executors).

…tially being out of order when providing status filters and max events are hit (#131675) ## Summary Addresses #131382 Adds an explicit sort on `@timestamp` to the initial query (when 1-2 status) filters are applied as when we currently overflow past 1k docs the docs returned are going to be ordered by [descending _count](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html), which can cause `Failed` execution to be past the overflow limit as they often have less aggregate documents . (cherry picked from commit 7ffe8a7)

kibanamachine · 2022-07-11T18:47:47Z

💚 All backports created successfully

Status	Branch	Result
✅	8.3

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

…tially being out of order when providing status filters and max events are hit (#131675) (#136140) ## Summary Addresses #131382 Adds an explicit sort on `@timestamp` to the initial query (when 1-2 status) filters are applied as when we currently overflow past 1k docs the docs returned are going to be ordered by [descending _count](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html), which can cause `Failed` execution to be past the overflow limit as they often have less aggregate documents . (cherry picked from commit 7ffe8a7) Co-authored-by: Garrett Spong <[email protected]>

banderror · 2022-07-20T08:55:42Z

Thanks for creating #136138 @spong 👍 I think #130966 is def the way to go. It would not only allow us to fix the issues but simplify the implementation as well.

Btw when working on #126063 I noticed another issue with filtering execution results by status: I was able to select only "Succeeded" but got "Warning" results.

Fixes documents being out of order when providing status filters and …

0ab69a1

…overflow exists

spong requested a review from a team as a code owner May 5, 2022 21:13

spong self-assigned this May 5, 2022

spong mentioned this pull request May 5, 2022

[Security Solution]Failed/Partial Failed log got missed under multiple logs state filter #131382

Closed

Sort on timestamp from sub-agg

2908958

banderror reviewed May 6, 2022

View reviewed changes

Merge branch 'main' of github.com:elastic/kibana into fix-rule-execut…

2496de2

…ion-log-overflow-sort

spong removed the v8.2.1 label May 16, 2022

Merge branch 'main' into fix-rule-execution-log-overflow-sort

42e813c

spong removed the ci:deploy-cloud label May 31, 2022

spong added 3 commits May 31, 2022 13:19

Merge branch 'main' into fix-rule-execution-log-overflow-sort

addf2e2

Merge branch 'main' into fix-rule-execution-log-overflow-sort

8a9a829

Merge branch 'main' into fix-rule-execution-log-overflow-sort

0073f02

banderror added v8.4.0 v8.3.1 and removed v8.3.0 labels Jun 20, 2022

spong removed the v8.3.1 label Jul 8, 2022

Merge branch 'main' into fix-rule-execution-log-overflow-sort

746cf0f

kibanamachine and others added 2 commits July 8, 2022 10:57

Merge branch 'main' into fix-rule-execution-log-overflow-sort

5df5283

Adds api integration test

7a657ac

spong requested a review from a team as a code owner July 8, 2022 23:36

spong added the ci:deploy-cloud label Jul 8, 2022

Merge branch 'main' into fix-rule-execution-log-overflow-sort

e12fecf

Merge branch 'main' into fix-rule-execution-log-overflow-sort

4f2eb52

vitaliidm approved these changes Jul 11, 2022

View reviewed changes

spong mentioned this pull request Jul 11, 2022

[Security Solution] Rule Execution Log can display conflicting status when there is a mis-match between platform and solution statuses #136138

Open

spong added the v8.3.3 label Jul 11, 2022

spong merged commit 7ffe8a7 into elastic:main Jul 11, 2022

spong deleted the fix-rule-execution-log-overflow-sort branch July 11, 2022 18:44

kibanamachine mentioned this pull request Jul 11, 2022

[8.3] [Security Solution][Detections] Fixes Rule Execution Log events potentially being out of order when providing status filters and max events are hit (#131675) #136140

Merged

tylersmalley added ci:cloud-deploy Create or update a Cloud deployment and removed ci:deploy-cloud labels Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security Solution][Detections] Fixes Rule Execution Log events potentially being out of order when providing status filters and max events are hit #131675

[Security Solution][Detections] Fixes Rule Execution Log events potentially being out of order when providing status filters and max events are hit #131675

spong commented May 5, 2022 •

edited

Loading

elasticmachine commented May 5, 2022

elasticmachine commented May 5, 2022

banderror left a comment •

edited

Loading

spong commented Jul 8, 2022

spong commented Jul 8, 2022

spong commented Jul 9, 2022

kibana-ci commented Jul 11, 2022

vitaliidm commented Jul 11, 2022

spong commented Jul 11, 2022 •

edited

Loading

kibanamachine commented Jul 11, 2022

banderror commented Jul 20, 2022

[Security Solution][Detections] Fixes Rule Execution Log events potentially being out of order when providing status filters and max events are hit #131675

[Security Solution][Detections] Fixes Rule Execution Log events potentially being out of order when providing status filters and max events are hit #131675

Conversation

spong commented May 5, 2022 • edited Loading

Summary

elasticmachine commented May 5, 2022

elasticmachine commented May 5, 2022

banderror left a comment • edited Loading

Choose a reason for hiding this comment

spong commented Jul 8, 2022

spong commented Jul 8, 2022

spong commented Jul 9, 2022

kibana-ci commented Jul 11, 2022

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

History

vitaliidm commented Jul 11, 2022

spong commented Jul 11, 2022 • edited Loading

kibanamachine commented Jul 11, 2022

💚 All backports created successfully

Questions ?

banderror commented Jul 20, 2022

spong commented May 5, 2022 •

edited

Loading

banderror left a comment •

edited

Loading

spong commented Jul 11, 2022 •

edited

Loading