Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution][Detections] Fixes Rule Execution Log events potentially being out of order when providing status filters and max events are hit #131675

Merged
merged 12 commits into from
Jul 11, 2022

Conversation

spong
Copy link
Member

@spong spong commented May 5, 2022

Summary

Addresses #131382

Adds an explicit sort on @timestamp to the initial query (when 1-2 status) filters are applied as when we currently overflow past 1k docs the docs returned are going to be ordered by descending _count, which can cause Failed execution to be past the overflow limit as they often have less aggregate documents .

@spong spong added bug Fixes for quality problems that affect the customer experience release_note:fix Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. auto-backport Deprecated - use backport:version if exact versions are needed Feature:Rule Monitoring Security Solution Detection Rule Monitoring area Team:Detection Rule Management Security Detection Rule Management Team ci:deploy-cloud v8.3.0 v8.2.1 labels May 5, 2022
@spong spong requested a review from a team as a code owner May 5, 2022 21:13
@spong spong self-assigned this May 5, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

Copy link
Contributor

@banderror banderror left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on Cloud and managed to filter by Succeeded and Failed correctly. What I did:

  • I set up a date range that includes the Failed execution, but is big enough to include more than 1000 executions.
  • Selected Succeeded + Failed in the Status filter.

Screenshot 2022-05-06 at 18 19 38

So that works! 🙏

What I also noticed when testing is that every table reload is extremely slow: when the table is initially loaded or reloaded due to changes in filters or pagination. I must admit that the whole Details page was slow as well, including the Alerts table, but I think the Rule execution logs table was especially slow. I'm wondering could it be partially caused by the changes in this PR, or if the only reason is the overall slowness of this cloud environment?

@xcrzx could you please test it locally and check the performance with APM?

I think it would also be great to cover this scenario with integration tests.

@spong spong removed the v8.2.1 label May 16, 2022
@spong
Copy link
Member Author

spong commented Jul 8, 2022

@elasticmachine merge upstream

@spong
Copy link
Member Author

spong commented Jul 8, 2022

@elasticmachine merge upstream

@spong spong requested a review from a team as a code owner July 8, 2022 23:36
@spong
Copy link
Member Author

spong commented Jul 9, 2022

Alrighty @banderror, I've added a fun API integration test for exercising this bug in 7a657ac, and then setup the cloud deploy for testing perf as well. This rule is running every 5 seconds, so just need to let it run a bit and then put it in an error state and see how things are. TBH, it was really slow getting just that rule configured, similar as you mentioned, so I wouldn't be surprised if it's just the lack of sizing on these test clusters, but we'll see. Hopefully we can edit the cloud config and scale them to test as well.

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Security Solution Tests #3 / Inspect Network stats and tables "before all" hook for "inspects the Source IPs Table"

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @spong

@vitaliidm
Copy link
Contributor

Was able to verify fix of the issue on cloud environment(number of succeeded executions > 1000):

  • failed execution present within a selected timeframe

Screenshot 2022-07-11 at 13 19 37

Observed performance issues are rather related to cloud environment. For example, enabling/disabling a single rule takes there ~ 3-4s.

On a separate note(not related to PR): noticed, one execution was recorded with different security_status and status fields:

duration_ms: 5995
es_search_duration_ms: 0
execution_uuid: "78e7e44a-7dbe-49c1-8650-6ac55f7762ec"
gap_duration_s: 0
indexing_duration_ms: 0
message: "rule executed: siem.queryRule:869a8540-ff21-11ec-9dc0-8fbfdab2f631: 'Testing that overflow sort bug 🙃'"
num_active_alerts: 0
num_errored_actions: 0
num_new_alerts: 0
num_recovered_alerts: 0
num_succeeded_actions: 0
num_triggered_actions: 0
schedule_delay_ms: 3600
search_duration_ms: 105
security_message: "succeeded"
security_status: "succeeded"
status: "failure"
timed_out: false
timestamp: "2022-07-11T07:53:32.434Z"
total_search_duration_ms: 0

Screenshot 2022-07-11 at 13 17 30

Changes looks good to me. Thanks @spong

@spong
Copy link
Member Author

spong commented Jul 11, 2022

Thanks for testing and finding this additional issue @vitaliidm!

So looking at that specific execution, as far as the code is concerned, we're doing what's expected for the given data.

I have confirmed this will happen no matter the code path (either providing filters and pre-fetching ID's, or just fetching all executions), but you can really only tell with the former as the status difference is clear, since it's in conflict with the selected filter.

That said, the reason this is happening is that as you pointed out, there's a mis-match between the platform and solution statuses. The stack is an error, and the solution status was successful.

And since we query against event.outcome and kibana.alert.rule.execution.status, but display the solution status on the UI (and only fallback to the platform status if there is no solution status), this is where any sort of mis-match is going to surface...

I'm thinking the best place to fix this is at the UI layer, and if there's a mismatch between platform/solution status, then just fall back to the platform status (and switch to error.message instead of message if it's an error). Once we start passing our status up to the platform to write a single unified execution status (#130966) that should help here and narrow the chance of a circuit breaker error coming in during task execution a splitting these two statuses like this.

I'm going to go ahead and create a follow-up issue (#136138) for tracking this one, and we can prioritize accordingly. Would be ideal if we could just get #130966 worked, and swap to querying single execution events via the find API (instead of this monster agg), but we'll at least have this issue for tracking if that takes a bit. All-in-all, this should be a low impact since stack/solution statuses should match up in most instances (this being on a resource constrained CI cloud deploy increased our chances of those circuit breakers coming in blowing up the executors).

@spong spong added the v8.3.3 label Jul 11, 2022
@spong spong merged commit 7ffe8a7 into elastic:main Jul 11, 2022
@spong spong deleted the fix-rule-execution-log-overflow-sort branch July 11, 2022 18:44
kibanamachine pushed a commit that referenced this pull request Jul 11, 2022
…tially being out of order when providing status filters and max events are hit (#131675)

## Summary

Addresses #131382

Adds an explicit sort on `@timestamp` to the initial query (when 1-2 status) filters are applied as when we currently overflow past 1k docs the docs returned are going to be ordered by [descending _count](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html), which can cause `Failed` execution to be past the overflow limit as they often have less aggregate documents .

(cherry picked from commit 7ffe8a7)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.3

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Jul 11, 2022
…tially being out of order when providing status filters and max events are hit (#131675) (#136140)

## Summary

Addresses #131382

Adds an explicit sort on `@timestamp` to the initial query (when 1-2 status) filters are applied as when we currently overflow past 1k docs the docs returned are going to be ordered by [descending _count](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html), which can cause `Failed` execution to be past the overflow limit as they often have less aggregate documents .

(cherry picked from commit 7ffe8a7)

Co-authored-by: Garrett Spong <[email protected]>
@banderror
Copy link
Contributor

Thanks for creating #136138 @spong 👍 I think #130966 is def the way to go. It would not only allow us to fix the issues but simplify the implementation as well.

Btw when working on #126063 I noticed another issue with filtering execution results by status: I was able to select only "Succeeded" but got "Warning" results.

@tylersmalley tylersmalley added ci:cloud-deploy Create or update a Cloud deployment and removed ci:deploy-cloud labels Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed bug Fixes for quality problems that affect the customer experience ci:cloud-deploy Create or update a Cloud deployment Feature:Rule Monitoring Security Solution Detection Rule Monitoring area release_note:fix Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.3.3 v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants