[Security Solution] Rules management performance and scalability issues #134826
Labels
8.8 candidate
Feature:ML Rule
Security Solution Machine Learning rule type
Feature:Prebuilt Detection Rules
Security Solution Prebuilt Detection Rules area
Feature:Rule Management
Security Solution Detection Rule Management area
performance
Team:Detection Rule Management
Security Detection Rule Management Team
Team:Detections and Resp
Security Detection Response Team
Team: SecuritySolution
Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc.
v8.4.0
v8.8.0
Summary
As we've started collecting real user monitoring data, it is now possible to analyze the performance measurements of Security Solution user interactions using this view. Note that there could be not much data as many production clusters were not yet updated to 8.3, but we can conduct an analysis already.
Security Solution transactions by impact
Rules table filtration
One of the most impactful transactions is securitySolution rulesTable filter, which is triggered every time a user navigates to the rules management table and changes the table filter (selects tags, custom rules, etc.). While the average latency of those transactions is not too high (~1200 ms), they get triggered pretty often, making them one of the top transactions by total performance impact.
If we look deeper at the transactions, we can see that their slowness in the majority of cases is due to slow responses from the
detection_engine/rules/prepackages/_status
endpoint. Examples:On the rules management page, we use the response from the
detection_engine/rules/prepackages/_status
toFor both purposes, we don't need to fetch data on every rule filter value change. For example, it would be enough to fetch prepackaged rule statuses only on the initial page load and cache it. Then, we can revalidate the cache once in some time (let's say 5 minutes) and after rule mutations.
Further, if we look into details of the execution of the
detection_engine/rules/prepackages/_status
, we'll see more opportunities for optimization.Almost all asynchronous operations of that endpoint get executed serially. Rewriting them into parallel execution could significantly reduce the total endpoint execution time.
Another option would be to split that endpoint into several smaller ones. For example, one endpoint for statuses of prepackaged rules and another for Elastic/custom rules counters.
Prebuilt rules installation
Analysis of securitySolution rulesTable loadPrebuilt transactions shows this operation has several performance issues.
installPrepackagedRules
->installPrepackagedTimelines
->updatePrepackagedRules
, see source) while those methods do not depend on each other and could be called in parallel.installPrepackagedRules
method currently writes all prepackaged rules in parallel. So it sends > 600 parallel write requests if all rules get installed on a fresh Kibana instance. It could lead to performance problems on less powerful instances. This method should leveragepromisePool
and respect theMAX_RULES_TO_UPDATE_IN_PARALLEL
setting.updatePrepackagedRules
should usepromisePool
instead of a custom chunk implementation.RulesClient
level.Requests for installed integrations
Many different transactions captured requests to installed integrations. However, those requests don't seem to be related to user actions. For example:
That probably indicates that we have misconfigured the request cache and fetch installed integrations more frequently than needed. And integrations are not something that often changes, so we can configure stale time to be, for example, 15 minutes so that we fetch that data at most once in 15 minutes.
Duplicated ML Jobs Summary requests
During the initial loading of the rules management page we send 2-3 duplicated
POST /api/ml/jobs/jobs_summary
API requests:The text was updated successfully, but these errors were encountered: