Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution] Improve RuleExecutionLog performance #118511

Closed
4 tasks done
Tracked by #118324
xcrzx opened this issue Nov 15, 2021 · 5 comments
Closed
4 tasks done
Tracked by #118324

[Security Solution] Improve RuleExecutionLog performance #118511

xcrzx opened this issue Nov 15, 2021 · 5 comments
Assignees
Labels
8.7 candidate Feature:Rule Monitoring Security Solution Detection Rule Monitoring area performance Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.0.0 v8.1.0 v8.7.0

Comments

@xcrzx
Copy link
Contributor

xcrzx commented Nov 15, 2021

Summary

A large amount of time during rule execution is spent on the rule status updates. In extreme cases, we spent up to 95% of the total execution time logging status changes:

Screenshot 2021-11-11 at 16 05 44

It could negatively affect the rate at which rules could be executed and ultimately lead to execution gaps and rule execution termination when long-running tasks cancellation by timeout becomes active. Therefore, we need to find a way to reduce the performance impact of status change writes on rule execution.

Possible solutions

  • Don't wait for rule execution status change to complete. Currently, a rule execution waits for every status change write operation to complete, but actually, we don't need to do that. Instead, we could write logs to a buffer and perform actual writes to ES once the buffer is full or by timeout. Using a buffer, we can a) unblock the task manager b) reduce ES write operations when a rule executor doesn't take a long time to complete. For example, when a rule execution starts, it adds running status to the buffer. Then if the execution completes shortly after that and adds succeeded status to the buffer, we can safely drop the previous running as it becomes redundant.
  • Investigate possibilities of using Saved Objects Client upsert param to update/create rule execution status in one go. Currently, we update the rule status in two steps: find the current status and then rewrite it.
  • Use refresh: false for create/update/delete operations whenever possible. See also this ticket.
  • Remove redundant writes when updating execution metrics. Currently, it takes four operations: find the current status object, write new status to it, find the current status object again, write execution metrics to it. Instead, we could write execution status + metrics in one operation.
    Has been addressed in this PR: [Security Solution] Optimized rule execution log performance #118925
@xcrzx xcrzx added Feature:Detection Rules Security Solution rules and Detection Engine Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Rule Management Security Detection Rule Management Team labels Nov 15, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@xcrzx xcrzx self-assigned this Nov 17, 2021
@banderror banderror added performance 8.1 candidate Feature:Rule Monitoring Security Solution Detection Rule Monitoring area v8.1.0 and removed 8.1 candidate Feature:Detection Rules Security Solution rules and Detection Engine labels Nov 23, 2021
@xcrzx xcrzx added the v8.0.0 label Nov 25, 2021
banderror added a commit that referenced this issue Jan 20, 2022
)

**Epic:** #118324
**Tickets:** #119603, #119597, #91265, #118511

## Summary

The legacy rule execution logging implementation is replaced by a new one that introduces a new model for execution-related data, a new saved object and a new, cleaner interface and implementation.

- [x] The legacy data model is deleted (`IRuleStatusResponseAttributes`, `IRuleStatusSOAttributes`)
- [x] The legacy `siem-detection-engine-rule-status` saved object type is deleted and marked as deleted in `src/core`
- [x] A new data model is introduced (`x-pack/plugins/security_solution/common/detection_engine/schemas/common/rule_monitoring.ts`). This data model doesn't contain a mixture of successful and failed statuses, which should simplify client-side code (e.g. the code of Rule Management and Monitoring tables, as well as Rule Details page).
- [x] A new `siem-detection-engine-rule-execution-info` saved object is introduced (`x-pack/plugins/security_solution/server/lib/detection_engine/rule_execution_log/rule_execution_info/saved_object.ts`).
  - [x] This SO has 1:1 association with the rule SO, so every rule can have 0 or 1 execution info associated with it. This SO is used in order to 1) update the last execution status and metrics and 2) fetch execution data for N rules more efficiently comparing to the legacy SO.
  - [x] The logic of creating or updating this SOs is based on the "upsert" approach (planned in #118511). It does not fetch the SO by rule id before updating it anymore.
- [x] Rule execution logging logic is rewritten (see `x-pack/plugins/security_solution/server/lib/detection_engine/rule_execution_log`). The previous rule execution log client is split into two objects: `IRuleExecutionLogClient` for using it from route handlers, and `IRuleExecutionLogger` for writing logs from rule executors.
  - [x] `IRuleExecutionLogger` instance is scoped to the currently executing rule and space id. There's no need to pass rule id, name, type etc to `.logStatusChange()` every time.
- [x] Rule executors and related functions are updated.
- [x] API routes are updated, including the rule preview route which uses a special "spy" implementation of `IRuleExecutionLogger`. A rule returned from an API endpoint now has optional `execution_summary` field of type `RuleExecutionSummary`.
- [x] UI is updated to use the new data model of `RuleExecutionSummary`:
  - [x] Rule Management and Monitoring tables
  - [x] Rule Details page
- [x] A new API route is introduced for fetching rule execution events: `/internal/detection_engine/rules/{ruleId}/execution/events`. It is used for rendering the Failure History tab (last 5 failures) and is intended to be used in the coming UI of Rule Execution Log on the Details page.
- [x] Rule Details page and Failure History tab are updated to use the new data models and API routes.
- [x] I used `react-query` for fetching execution events
  - [x] See `x-pack/plugins/security_solution/public/detections/containers/detection_engine/rules/use_rule_execution_events.tsx`
  - [x] The lib is updated to the latest version
- [x] Tests and fixed and updated according to all the changes
- [x] Components related to rule execution statuses are all moved to `x-pack/plugins/security_solution/public/detections/components/rules/rule_execution_status`.
- [x] I left a lot of `// TODO: https://github.com/elastic/kibana/pull/121644` comments in the code which I'm planning to address and remove in a follow-up PR. Lots of clean up work is needed, but I'd like to unblock the work on Rule Execution Log UI.

## In the next episodes

- Address and remove `// TODO: https://github.com/elastic/kibana/pull/121644` comments in the code
- Make sure that SO id generation for `siem-detection-engine-rule-execution-info` is safe and future-proof. Sync with the Core team. If there are risks, we will need to choose between risks and performance (reading the SO before updating it). It would be easy to submit a fix if needed.
- Add APM integration. Use `withSecuritySpan` in methods of `rule_execution_log` citizens.
- Add comments to the code and README.
- Add test coverage.
- Etc...

### Checklist

Delete any items that are not applicable to this PR.

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
- [ ] Any UI touched in this PR is usable by keyboard only (learn more about [keyboard accessibility](https://webaim.org/techniques/keyboard/))
- [ ] Any UI touched in this PR does not create any new axe failures (run axe in browser: [FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/), [Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))
- [x] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [ ] This renders correctly on smaller devices using a responsive layout. (You can test this [in your browser](https://www.browserstack.com/guide/responsive-testing-on-local-server))
- [ ] This was checked for [cross-browser compatibility](https://www.elastic.co/support/matrix#matrix_browsers)

### For maintainers

- [x] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
@banderror banderror added the 8.2 candidate considered, but not committed, for 8.2 release label Feb 15, 2022
@banderror banderror removed the 8.2 candidate considered, but not committed, for 8.2 release label Feb 28, 2022
@banderror
Copy link
Contributor

When #135127 is in progress or done, we will need to make sure that the implementation on the Alerting Framework side is fast and doesn't block rule executors.

@banderror
Copy link
Contributor

We will be able to close this when #147759 is addressed.

@banderror
Copy link
Contributor

The 1st and 3rd items from Possible solutions have been implemented as part of #147759 and #130966

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.7 candidate Feature:Rule Monitoring Security Solution Detection Rule Monitoring area performance Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.0.0 v8.1.0 v8.7.0
Projects
None yet
Development

No branches or pull requests

4 participants