[Entity Analytics][Risk Engine] Risk Scoring Task #163216

rylnd · 2023-08-04T21:46:25Z

What this PR does

Adds a new Task Manager task, risk_engine:risk_scoring, responsible for invoking the calculateAndPersistRiskScores API defined in the risk scoring service.
- Unlike an alerting task, we do not encrypt/persist an API key for the user. Instead, we use the internal kibana user to query all alerts in the current space.
- The task configuration is stored as part of the existing risk-engine-configuration Saved Object
Extends the risk-engine-configuration SO to include more configuration fields
- Management of this configuration is not currently exposed to the user. They can only enable/disable the entire "Risk Engine" on the Settings -> Entity Risk Score page
- The settings currently serve mainly as the "default" values for task execution, but also as a way for a customer/SA to modify task execution if necessary.
- We expect to be modifying these default values before release, as part of our planned "tuning" stage.

How to Review

Setup:
- The risk engine acts on Detection engine alerts, and so you will need to create:
  1. some "source" data (logs, filebeat, auditbeat, etc)
  2. Rules looking for the above "source" data, and generating alerts
- The risk engine requires two feature flags, currently: riskScoringPersistence and riskScoringRoutesEnabled
- You will also need a Platinum or greater license.

Test that the task executes correctly
1. With the above data set up, navigate to Settings -> Entity Risk Score page, and enable the task by toggling Entity risk scoring to On
2. Within a few minutes, risk scores should be written to the risk score datastream:
  - GET risk-score.risk-score-default/_search
  - Replace default with the name of your current space, as necessary.
3. Disabling/re-enabling the risk engine should trigger another execution of the task (similar to disabling/enabling a DE rule)
Enable the risk engine in another space
- The engine (and task) can be enabled/executed in any kibana space.
- Because the engine only acts upon alerts in the current space, you will need to first ensure alerts exist in that space.
Validate the data/mappings of persisted risk scores
- Scores are based on the Stage 1 ECS RFC
- There is no UI reading from these scores, currently (but that is introduced in ## Risk score from new Risk Engine showing in UI #163237)

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk	Probability	Severity	Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space.	Low	High	Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks.	High	Low	Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled.	Medium	High	Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

This was checked for breaking API changes and was labeled appropriately

* Adds the basic registration/start code paths, although we still need to `start()` the task from our management route * Adds some of the task state, although I'm not entirely sure what we want/need there. * Still need to pull most of the execution parameters from our new Saved Object.

Rather than creating the transforms/indices on behalf of the internal kibana user, we now do so on behalf of the user who initially set up/enabled the risk engine. If we have need for the internal user in the future, we can add that as a separate dependency. This also refactors some of the setup so that we instantiate the data client when it is requested from the request context. We do not currently have a need to create a data client outside of the context of a user request.

After a previous refactor we were no longer using the `user` argument to a lot of these methods, and the `savedObjectsClient` and `namespace` args were being passed around a bunch because they came from `init`. Since we don't currently have a need to instantiate the client/resources outside of the current space, both the namespace and the soClient are now part of the client's constructor. I left the namespace as an argument to a few calls from routes; this should be identical to the property internal to the client, and I don't quite know what it would mean/whether it would work if they were different. The SO client, for example, is coded for the current namespace.

Most queries that use 'now' cannot be cached by elasticsearch. Ref: https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-request-cache.html

* Adds helper methods for converting data from configuration * Adds retrieval of SO config and inputs index data to both dataclient and riskScoreService.

This includes the alerts index for the current space.

…-ref HEAD~1..HEAD --fix'

…te --fix'

We added attributes to the SO, here.

I do not like the way this is structure, but right now I'm just trying to execute the task so that I can TDD this. However, local ES/Kibana errors are preventing me from running FTRs locally, so I guess I'll wait for CI tonight.

All our tests invoke this with ssl: true, but in the case where we're trying to run tests against e.g. localhost this override prevents us from being able to use a non-SSL connection.

This all works, now, but the code is super confusing and I hate it. Will be refactoring once integration tests are written.

What are the odds that we rely on behavior, there? Let's find out!

* Renames a test file because it was identical to the production filename and driving me crazy * Moves some shared logic into utils: SO management, invoking Risk Engine routes, etc. * Excercises most of the functionality of the task via tests. I'll probably come up with some more here but I think this will be enough to confidently refactor the task code

…-fix'

* Remove console statements * lint improperly spaced comments :(

One (or more) of these tests would occasinally fail when the first batch of 5 scores caused `waitForRiskScores` to return prematurely (before the second batch of 5 scores was available in ES). By specifying a scoreCount, we can address that race condition. Closes elastic#162736.

I couldn't find a permutation of `removeIfExists` that actually worked, but now that this is under test I'll take another pass later.

So that I can run `--grep "Risk Engine"` to run all these.

We were previously using this to retrieve the username, but that functionality no longer exists.

* Remove unused dependencies * Better error messages

* Adds `interval` to SO configuration * Adds `namespace` to task state so that a per-space task can be executed (theoretically) * Updates tests

So's I don't forget

…te --fix'

This way the UI/API can convey it to the user.

If an error is raised during enablement (either updating the SO or starting the task), we will undo enablement by disabling the engine.

nkhristinin

Thanks for addressing the comments, LGTM!

This required threading through a `namespace` parameter for a few rule helpers, but I made it optional so it should be a minimal change.

We do not need to return a schedule here until we have need to override the default interval. When the risk engine is enabled, the task is scheduled with the interval configured in the saved object.

Rather than defining the SO extensions ourselves, we instead build a fake kibana request for the particular namespace we're in, and use the request-based helpers to build a scoped SO client from that request. In this way, we're not coupled to the extensions implementation, and additionally we have the encryption and spaces extensions defined/enabled for us. We need to disable the security extension, however, because there is not a kibana system user that would be able to access these Saved Objects, normally.

Conflicts: src/core/server/integration_tests/saved_objects/migrations/group2/check_registered_types.test.ts

Now that we're actually enabling the task when we enable the risk engine, we need this flag enabled, or else we'll get an error because the task wasn't registered on kibana startup.

We can navigate directly to the entity analytics management page; we don't need to visit alerts for these tests.

x-pack/plugins/security_solution/server/lib/risk_engine/tasks/helpers.ts

I didn't quite understand what these SSL changes were doing, and although they seem to be unused I'm just not going to touch them for now. This was committed in the course of running some things locally against a non-SSL kibana, but that time has passed. Additionally, this reverts the disabling of observability in integration tests. While this was useful for developing locally, it's not something that should be in main.

Conflicts: x-pack/test/detection_engine_api_integration/utils/create_rule.ts x-pack/test/detection_engine_api_integration/utils/wait_for_rule_status.ts

* Rescheduling is inherent to the fact that define an `interval` for our schedule; no need to test that explicitly * Also removes tests around failure conditions; the entire task run is wrapped in a try/catch such that we don't need to do that ourselves.

kibana-ci · 2023-08-24T21:37:34Z

💚 Build Succeeded

Buildkite Build
Commit: 22938f8

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id	before	after	diff
`securitySolution`	4453	4454	+1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`securitySolution`	15.7MB	15.7MB	+152.0B

History

💔 Build #153323 failed 9abdb47
💔 Build #153300 failed fe475d4
💛 Build #153255 was flaky bcb1cee
💛 Build #153008 was flaky 8687fff
💔 Build #152988 failed 26a440f

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

rylnd force-pushed the risk_engine_task branch 3 times, most recently from 989cc64 to 9107a52 Compare August 10, 2023 00:46

rylnd and others added 27 commits August 10, 2023 15:59

Add some helpers for converting a relative date range to an absolute one

1ca8915

Most queries that use 'now' cannot be cached by elasticsearch. Ref: https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-request-cache.html

Add (and use) configuration from our saved object

61d3fd3

* Adds helper methods for converting data from configuration * Adds retrieval of SO config and inputs index data to both dataclient and riskScoreService.

Create our initial configuration SO with reasonable default values

3aab2fb

This includes the alerts index for the current space.

WIP: starting task and writing an integration test

f5d8b8d

[CI] Auto-commit changed files from 'node scripts/precommit_hook.js -…

b63146e

…-ref HEAD~1..HEAD --fix'

[CI] Auto-commit changed files from 'node scripts/check_mappings_upda…

a1a7614

…te --fix'

Add risk engine task to CI test runner

2daf715

More accurate integration test name

2b9a8d5

Update integration test

a29b2b2

We added attributes to the SO, here.

Trying to thread through our init route properly

ecf4fea

I do not like the way this is structure, but right now I'm just trying to execute the task so that I can TDD this. However, local ES/Kibana errors are preventing me from running FTRs locally, so I guess I'll wait for CI tonight.

Remove overriding ssl config in our integration tests

8d67ef0

All our tests invoke this with ssl: true, but in the case where we're trying to run tests against e.g. localhost this override prevents us from being able to use a non-SSL connection.

WIP: Moving dependencies around to get enable/disable of task working

af2b245

This all works, now, but the code is super confusing and I hate it. Will be refactoring once integration tests are written.

Disable o11y plugin in our integration tests

50cfb82

What are the odds that we rely on behavior, there? Let's find out!

[CI] Auto-commit changed files from 'node scripts/eslint --no-cache -…

a8c59e4

…-fix'

Fix some basic CI checks

f494399

* Remove console statements * lint improperly spaced comments :(

Test (and fix) removal of the risk scoring task

060a975

I couldn't find a permutation of `removeIfExists` that actually worked, but now that this is under test I'll take another pass later.

Common description for all our risk engine integration tests

f9ac1a9

So that I can run `--grep "Risk Engine"` to run all these.

Remove security dependency from risk engine routes

5670253

We were previously using this to retrieve the username, but that functionality no longer exists.

Clean up Risk Engine routes

865bd63

* Remove unused dependencies * Better error messages

Refactors task code into component functions

ee0d4e5

* Adds `interval` to SO configuration * Adds `namespace` to task state so that a per-space task can be executed (theoretically) * Updates tests

Add TODO

e9725ef

So's I don't forget

[CI] Auto-commit changed files from 'node scripts/check_mappings_upda…

c307603

…te --fix'

rylnd added 5 commits August 21, 2023 22:01

Merge branch 'main' into risk_engine_task

343ee3c

Add JSDoc for some ambiguously named test helper functions

672fd00

Re-throw errors from taskmanager when enabling the risk scoring task

2879d8a

This way the UI/API can convey it to the user.

Add some tests around enable/disable routes

0dcb486

Better synchronize enabling of risk engine task

7cf14b9

If an error is raised during enablement (either updating the SO or starting the task), we will undo enablement by disabling the engine.

rylnd force-pushed the risk_engine_task branch from 7250a06 to 7cf14b9 Compare August 23, 2023 04:31

rylnd requested a review from nkhristinin August 23, 2023 04:32

Merge branch 'main' into risk_engine_task

14eabff

nkhristinin approved these changes Aug 23, 2023

View reviewed changes

rylnd added 7 commits August 23, 2023 11:42

linting: remove redundant await

904a764

Add risk engine test for non-default space

c8c1c12

This required threading through a `namespace` parameter for a few rule helpers, but I made it optional so it should be a minimal change.

Remove outstanding TODO

d160e0d

We do not need to return a schedule here until we have need to override the default interval. When the risk engine is enabled, the task is scheduled with the interval configured in the saved object.

Merge branch 'main' into risk_engine_task

26a440f

Conflicts: src/core/server/integration_tests/saved_objects/migrations/group2/check_registered_types.test.ts

Add persistence feature flag to our cypress tests

acdec7d

Now that we're actually enabling the task when we enable the risk engine, we need this flag enabled, or else we'll get an error because the task wasn't registered on kibana startup.

Remove unnecessary visit step from our test

8687fff

We can navigate directly to the entity analytics management page; we don't need to visit alerts for these tests.

azasypkin reviewed Aug 24, 2023

View reviewed changes

x-pack/plugins/security_solution/server/lib/risk_engine/tasks/helpers.ts Outdated Show resolved Hide resolved

rylnd added 4 commits August 24, 2023 11:18

Rename and document function to make its purpose more clear

80785e2

Merge branch 'main' into risk_engine_task

bcb1cee

Conflicts: x-pack/test/detection_engine_api_integration/utils/create_rule.ts x-pack/test/detection_engine_api_integration/utils/wait_for_rule_status.ts

Update pending tests

fe475d4

* Rescheduling is inherent to the fact that define an `interval` for our schedule; no need to test that explicitly * Also removes tests around failure conditions; the entire task run is wrapped in a try/catch such that we don't need to do that ourselves.

rylnd enabled auto-merge (squash) August 24, 2023 18:12

rylnd added 2 commits August 24, 2023 14:06

Merge branch 'main' into risk_engine_task

9abdb47

Merge branch 'main' into risk_engine_task

22938f8

rylnd merged commit 43b0fab into elastic:main Aug 24, 2023

kibanamachine added the v8.11.0 label Aug 24, 2023

rylnd deleted the risk_engine_task branch August 24, 2023 21:56

ymao1 mentioned this pull request Oct 2, 2023

Make risk engine feature flags default true #167778

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Entity Analytics][Risk Engine] Risk Scoring Task #163216

[Entity Analytics][Risk Engine] Risk Scoring Task #163216

rylnd commented Aug 4, 2023 •

edited

Loading

nkhristinin left a comment

kibana-ci commented Aug 24, 2023

[Entity Analytics][Risk Engine] Risk Scoring Task #163216

[Entity Analytics][Risk Engine] Risk Scoring Task #163216

Conversation

rylnd commented Aug 4, 2023 • edited Loading

What this PR does

How to Review

Risk Matrix

For maintainers

nkhristinin left a comment

Choose a reason for hiding this comment

kibana-ci commented Aug 24, 2023

💚 Build Succeeded

Metrics [docs]

Module Count

Async chunks

History

rylnd commented Aug 4, 2023 •

edited

Loading