Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution] Rules management performance and scalability issues #134826

Closed
Tracked by #130971
xcrzx opened this issue Jun 21, 2022 · 2 comments · Fixed by #153244
Closed
Tracked by #130971

[Security Solution] Rules management performance and scalability issues #134826

xcrzx opened this issue Jun 21, 2022 · 2 comments · Fixed by #153244
Assignees
Labels
8.8 candidate Feature:ML Rule Security Solution Machine Learning rule type Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area Feature:Rule Management Security Solution Detection Rule Management area performance Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.4.0 v8.8.0

Comments

@xcrzx
Copy link
Contributor

xcrzx commented Jun 21, 2022

Summary

As we've started collecting real user monitoring data, it is now possible to analyze the performance measurements of Security Solution user interactions using this view. Note that there could be not much data as many production clusters were not yet updated to 8.3, but we can conduct an analysis already.

Security Solution transactions by impact

Screenshot 2022-06-21 at 13 14 38

Rules table filtration

One of the most impactful transactions is securitySolution rulesTable filter, which is triggered every time a user navigates to the rules management table and changes the table filter (selects tags, custom rules, etc.). While the average latency of those transactions is not too high (~1200 ms), they get triggered pretty often, making them one of the top transactions by total performance impact.

If we look deeper at the transactions, we can see that their slowness in the majority of cases is due to slow responses from the detection_engine/rules/prepackages/_status endpoint. Examples:

Screenshot 2022-06-21 at 13 35 24

Screenshot 2022-06-21 at 13 36 39

On the rules management page, we use the response from the detection_engine/rules/prepackages/_status to

  1. Display a callout to users when there are updates to prepackaged rules
  2. To display Elastic rules and custom rules counters

For both purposes, we don't need to fetch data on every rule filter value change. For example, it would be enough to fetch prepackaged rule statuses only on the initial page load and cache it. Then, we can revalidate the cache once in some time (let's say 5 minutes) and after rule mutations.

Further, if we look into details of the execution of the detection_engine/rules/prepackages/_status, we'll see more opportunities for optimization.

Screenshot 2022-06-21 at 14 12 38

Almost all asynchronous operations of that endpoint get executed serially. Rewriting them into parallel execution could significantly reduce the total endpoint execution time.

Another option would be to split that endpoint into several smaller ones. For example, one endpoint for statuses of prepackaged rules and another for Elastic/custom rules counters.

Prebuilt rules installation

Analysis of securitySolution rulesTable loadPrebuilt transactions shows this operation has several performance issues.

  1. Prebuilt rule installation operations are not parallelized. We call async method consequently (installPrepackagedRules -> installPrepackagedTimelines -> updatePrepackagedRules, see source) while those methods do not depend on each other and could be called in parallel.
  2. The installPrepackagedRules method currently writes all prepackaged rules in parallel. So it sends > 600 parallel write requests if all rules get installed on a fresh Kibana instance. It could lead to performance problems on less powerful instances. This method should leverage promisePool and respect the MAX_RULES_TO_UPDATE_IN_PARALLEL setting.
  3. Similarly, the updatePrepackagedRules should use promisePool instead of a custom chunk implementation.
  4. Consider implementing bulk rule creation on the RulesClient level.

Requests for installed integrations

Many different transactions captured requests to installed integrations. However, those requests don't seem to be related to user actions. For example:

Screenshot 2022-06-21 at 17 35 35

That probably indicates that we have misconfigured the request cache and fetch installed integrations more frequently than needed. And integrations are not something that often changes, so we can configure stale time to be, for example, 15 minutes so that we fetch that data at most once in 15 minutes.

Duplicated ML Jobs Summary requests

During the initial loading of the rules management page we send 2-3 duplicated POST /api/ml/jobs/jobs_summary API requests:

Screenshot 2022-06-21 at 17 53 48

@xcrzx xcrzx added performance Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Rule Management Security Detection Rule Management Team labels Jun 21, 2022
@xcrzx xcrzx self-assigned this Jun 21, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@banderror banderror added Feature:Rule Management Security Solution Detection Rule Management area Feature:ML Rule Security Solution Machine Learning rule type Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area 8.4 candidate v8.4.0 labels Jun 24, 2022
xcrzx added a commit that referenced this issue Mar 23, 2023
**Resolves: #134826

## Summary

We've recently had a couple of SDHs related to the slow performance of
Security Solution. The performance issues were related to slow responses
from ML API. In this PR, I rewrite data-fetching hooks to React Query to
leverage request deduplication and caching. That significantly reduces
the number of outgoing HTTP requests to ML routes on page load.

**Before**
9 HTTP requests to ML endpoints on initial page load, 5 of which are
duplicates.

![Screenshot 2023-03-20 at 12 09
55](https://user-images.githubusercontent.com/1938181/226322831-b3f594c9-a08a-4c8a-acc9-247df8d4a028.png)

**After**
4 HTTP requests to ML endpoints on initial page load, no duplicates.

![Screenshot 2023-03-20 at 11 59
33](https://user-images.githubusercontent.com/1938181/226322847-a69129a2-afcd-408d-911c-a7b767dc4fdb.png)
@xcrzx xcrzx added the v8.8.0 label Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.8 candidate Feature:ML Rule Security Solution Machine Learning rule type Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area Feature:Rule Management Security Solution Detection Rule Management area performance Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.4.0 v8.8.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants