-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Security Solution][Detections] Fixes Rule Execution Log events potentially being out of order when providing status filters and max events are hit #131675
[Security Solution][Detections] Fixes Rule Execution Log events potentially being out of order when providing status filters and max events are hit #131675
Conversation
Pinging @elastic/security-detections-response (Team:Detections and Resp) |
Pinging @elastic/security-solution (Team: SecuritySolution) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on Cloud and managed to filter by Succeeded and Failed correctly. What I did:
- I set up a date range that includes the Failed execution, but is big enough to include more than 1000 executions.
- Selected Succeeded + Failed in the Status filter.
So that works! 🙏
What I also noticed when testing is that every table reload is extremely slow: when the table is initially loaded or reloaded due to changes in filters or pagination. I must admit that the whole Details page was slow as well, including the Alerts table, but I think the Rule execution logs table was especially slow. I'm wondering could it be partially caused by the changes in this PR, or if the only reason is the overall slowness of this cloud environment?
@xcrzx could you please test it locally and check the performance with APM?
I think it would also be great to cover this scenario with integration tests.
…ion-log-overflow-sort
@elasticmachine merge upstream |
@elasticmachine merge upstream |
Alrighty @banderror, I've added a fun API integration test for exercising this bug in 7a657ac, and then setup the cloud deploy for testing perf as well. This rule is running every 5 seconds, so just need to let it run a bit and then put it in an error state and see how things are. TBH, it was really slow getting just that rule configured, similar as you mentioned, so I wouldn't be surprised if it's just the lack of sizing on these test clusters, but we'll see. Hopefully we can edit the cloud config and scale them to test as well. |
💛 Build succeeded, but was flakyFailed CI StepsTest Failures
Metrics [docs]
History
To update your PR or re-run it, just comment with: cc @spong |
Was able to verify fix of the issue on cloud environment(number of succeeded executions > 1000):
Observed performance issues are rather related to cloud environment. For example, enabling/disabling a single rule takes there ~ 3-4s. On a separate note(not related to PR): noticed, one execution was recorded with different security_status and status fields:
Changes looks good to me. Thanks @spong |
Thanks for testing and finding this additional issue @vitaliidm! So looking at that specific execution, as far as the code is concerned, we're doing what's expected for the given data. I have confirmed this will happen no matter the code path (either providing filters and pre-fetching ID's, or just fetching all executions), but you can really only tell with the former as the status difference is clear, since it's in conflict with the selected filter. That said, the reason this is happening is that as you pointed out, there's a mis-match between the platform and solution statuses. The stack is an error, and the solution status was successful. And since we query against I'm thinking the best place to fix this is at the UI layer, and if there's a mismatch between platform/solution status, then just fall back to the platform status (and switch to I'm going to go ahead and create a follow-up issue (#136138) for tracking this one, and we can prioritize accordingly. Would be ideal if we could just get #130966 worked, and swap to querying single execution events via the find API (instead of this monster agg), but we'll at least have this issue for tracking if that takes a bit. All-in-all, this should be a low impact since stack/solution statuses should match up in most instances (this being on a resource constrained CI cloud deploy increased our chances of those circuit breakers coming in blowing up the executors). |
…tially being out of order when providing status filters and max events are hit (#131675) ## Summary Addresses #131382 Adds an explicit sort on `@timestamp` to the initial query (when 1-2 status) filters are applied as when we currently overflow past 1k docs the docs returned are going to be ordered by [descending _count](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html), which can cause `Failed` execution to be past the overflow limit as they often have less aggregate documents . (cherry picked from commit 7ffe8a7)
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
…tially being out of order when providing status filters and max events are hit (#131675) (#136140) ## Summary Addresses #131382 Adds an explicit sort on `@timestamp` to the initial query (when 1-2 status) filters are applied as when we currently overflow past 1k docs the docs returned are going to be ordered by [descending _count](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html), which can cause `Failed` execution to be past the overflow limit as they often have less aggregate documents . (cherry picked from commit 7ffe8a7) Co-authored-by: Garrett Spong <[email protected]>
Thanks for creating #136138 @spong 👍 I think #130966 is def the way to go. It would not only allow us to fix the issues but simplify the implementation as well. Btw when working on #126063 I noticed another issue with filtering execution results by status: I was able to select only "Succeeded" but got "Warning" results. |
Summary
Addresses #131382
Adds an explicit sort on
@timestamp
to the initial query (when 1-2 status) filters are applied as when we currently overflow past 1k docs the docs returned are going to be ordered by descending _count, which can causeFailed
execution to be past the overflow limit as they often have less aggregate documents .