[Security Solution] Rules management performance and scalability issues #134826

xcrzx · 2022-06-21T12:18:17Z

Summary

As we've started collecting real user monitoring data, it is now possible to analyze the performance measurements of Security Solution user interactions using this view. Note that there could be not much data as many production clusters were not yet updated to 8.3, but we can conduct an analysis already.

Security Solution transactions by impact

Rules table filtration

One of the most impactful transactions is securitySolution rulesTable filter, which is triggered every time a user navigates to the rules management table and changes the table filter (selects tags, custom rules, etc.). While the average latency of those transactions is not too high (~1200 ms), they get triggered pretty often, making them one of the top transactions by total performance impact.

If we look deeper at the transactions, we can see that their slowness in the majority of cases is due to slow responses from the detection_engine/rules/prepackages/_status endpoint. Examples:

On the rules management page, we use the response from the detection_engine/rules/prepackages/_status to

Display a callout to users when there are updates to prepackaged rules
To display Elastic rules and custom rules counters

For both purposes, we don't need to fetch data on every rule filter value change. For example, it would be enough to fetch prepackaged rule statuses only on the initial page load and cache it. Then, we can revalidate the cache once in some time (let's say 5 minutes) and after rule mutations.

Further, if we look into details of the execution of the detection_engine/rules/prepackages/_status, we'll see more opportunities for optimization.

Almost all asynchronous operations of that endpoint get executed serially. Rewriting them into parallel execution could significantly reduce the total endpoint execution time.

Another option would be to split that endpoint into several smaller ones. For example, one endpoint for statuses of prepackaged rules and another for Elastic/custom rules counters.

Prebuilt rules installation

Analysis of securitySolution rulesTable loadPrebuilt transactions shows this operation has several performance issues.

Prebuilt rule installation operations are not parallelized. We call async method consequently (installPrepackagedRules -> installPrepackagedTimelines -> updatePrepackagedRules, see source) while those methods do not depend on each other and could be called in parallel.
The installPrepackagedRules method currently writes all prepackaged rules in parallel. So it sends > 600 parallel write requests if all rules get installed on a fresh Kibana instance. It could lead to performance problems on less powerful instances. This method should leverage promisePool and respect the MAX_RULES_TO_UPDATE_IN_PARALLEL setting.
Similarly, the updatePrepackagedRules should use promisePool instead of a custom chunk implementation.
Consider implementing bulk rule creation on the RulesClient level.

Requests for installed integrations

Many different transactions captured requests to installed integrations. However, those requests don't seem to be related to user actions. For example:

That probably indicates that we have misconfigured the request cache and fetch installed integrations more frequently than needed. And integrations are not something that often changes, so we can configure stale time to be, for example, 15 minutes so that we fetch that data at most once in 15 minutes.

Duplicated ML Jobs Summary requests

During the initial loading of the rules management page we send 2-3 duplicated POST /api/ml/jobs/jobs_summary API requests:

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-06-21T12:18:19Z

Pinging @elastic/security-solution (Team: SecuritySolution)

elasticmachine · 2022-06-21T12:18:21Z

Pinging @elastic/security-detections-response (Team:Detections and Resp)

**Resolves: #134826 ## Summary We've recently had a couple of SDHs related to the slow performance of Security Solution. The performance issues were related to slow responses from ML API. In this PR, I rewrite data-fetching hooks to React Query to leverage request deduplication and caching. That significantly reduces the number of outgoing HTTP requests to ML routes on page load. **Before** 9 HTTP requests to ML endpoints on initial page load, 5 of which are duplicates. ![Screenshot 2023-03-20 at 12 09 55](https://user-images.githubusercontent.com/1938181/226322831-b3f594c9-a08a-4c8a-acc9-247df8d4a028.png) **After** 4 HTTP requests to ML endpoints on initial page load, no duplicates. ![Screenshot 2023-03-20 at 11 59 33](https://user-images.githubusercontent.com/1938181/226322847-a69129a2-afcd-408d-911c-a7b767dc4fdb.png)

xcrzx added performance Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Rule Management Security Detection Rule Management Team labels Jun 21, 2022

xcrzx self-assigned this Jun 21, 2022

xcrzx mentioned this issue Jun 23, 2022

[Security Solution] Performance measurement and monitoring with Elastic APM #130971

Open

banderror added Feature:Rule Management Security Solution Detection Rule Management area Feature:ML Rule Security Solution Machine Learning rule type Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area 8.4 candidate v8.4.0 labels Jun 24, 2022

xcrzx mentioned this issue Jun 28, 2022

[Security Solution] Fix performance issues affecting rules management #135311

Merged

banderror removed the 8.4 candidate label Aug 16, 2022

xcrzx mentioned this issue Mar 16, 2023

[Security Solution] Deduplicate requests to ML #153244

Merged

banderror added the 8.8 candidate label Mar 21, 2023

xcrzx closed this as completed in #153244 Mar 23, 2023

xcrzx added the v8.8.0 label Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security Solution] Rules management performance and scalability issues #134826

[Security Solution] Rules management performance and scalability issues #134826

xcrzx commented Jun 21, 2022 •

edited

Loading

elasticmachine commented Jun 21, 2022

elasticmachine commented Jun 21, 2022

[Security Solution] Rules management performance and scalability issues #134826

[Security Solution] Rules management performance and scalability issues #134826

Comments

xcrzx commented Jun 21, 2022 • edited Loading

Summary

Security Solution transactions by impact

Rules table filtration

Prebuilt rules installation

Requests for installed integrations

Duplicated ML Jobs Summary requests

elasticmachine commented Jun 21, 2022

elasticmachine commented Jun 21, 2022

xcrzx commented Jun 21, 2022 •

edited

Loading