Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Entity Analytics][Risk Engine] Risk Scoring Task #163216

Merged
merged 65 commits into from
Aug 24, 2023

Conversation

rylnd
Copy link
Contributor

@rylnd rylnd commented Aug 4, 2023

What this PR does

  • Adds a new Task Manager task, risk_engine:risk_scoring, responsible for invoking the calculateAndPersistRiskScores API defined in the risk scoring service.
    • Unlike an alerting task, we do not encrypt/persist an API key for the user. Instead, we use the internal kibana user to query all alerts in the current space.
    • The task configuration is stored as part of the existing risk-engine-configuration Saved Object
  • Extends the risk-engine-configuration SO to include more configuration fields
    • Management of this configuration is not currently exposed to the user. They can only enable/disable the entire "Risk Engine" on the Settings -> Entity Risk Score page
    • The settings currently serve mainly as the "default" values for task execution, but also as a way for a customer/SA to modify task execution if necessary.
    • We expect to be modifying these default values before release, as part of our planned "tuning" stage.

How to Review

  • Setup:
    • The risk engine acts on Detection engine alerts, and so you will need to create:
      1. some "source" data (logs, filebeat, auditbeat, etc)
      2. Rules looking for the above "source" data, and generating alerts
    • The risk engine requires two feature flags, currently: riskScoringPersistence and riskScoringRoutesEnabled
    • You will also need a Platinum or greater license.
  1. Test that the task executes correctly
    1. With the above data set up, navigate to Settings -> Entity Risk Score page, and enable the task by toggling Entity risk scoring to On
    2. Within a few minutes, risk scores should be written to the risk score datastream:
      • GET risk-score.risk-score-default/_search
      • Replace default with the name of your current space, as necessary.
    3. Disabling/re-enabling the risk engine should trigger another execution of the task (similar to disabling/enabling a DE rule)
  2. Enable the risk engine in another space
    • The engine (and task) can be enabled/executed in any kibana space.
    • Because the engine only acts upon alerts in the current space, you will need to first ensure alerts exist in that space.
  3. Validate the data/mappings of persisted risk scores

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk Probability Severity Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space. Low High Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks. High Low Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled. Medium High Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

@rylnd rylnd force-pushed the risk_engine_task branch 3 times, most recently from 989cc64 to 9107a52 Compare August 10, 2023 00:46
rylnd and others added 27 commits August 10, 2023 15:59
* Adds the basic registration/start code paths, although we still need to
  `start()` the task from our management route
* Adds some of the task state, although I'm not entirely sure what we
  want/need there.
* Still need to pull most of the execution parameters from our new Saved
  Object.
Rather than creating the transforms/indices on behalf of the internal
kibana user, we now do so on behalf of the user who initially set
up/enabled the risk engine.

If we have need for the internal user in the future, we can add that as
a separate dependency.

This also refactors some of the setup so that we instantiate the data
client when it is requested from the request context. We do not
currently have a need to create a data client outside of the context of
a user request.
After a previous refactor we were no longer using the `user` argument to
a lot of these methods, and the `savedObjectsClient` and `namespace`
args were being passed around a bunch because they came from `init`.

Since we don't currently have a need to instantiate the client/resources
outside of the current space, both the namespace and the soClient are
now part of the client's constructor.

I left the namespace as an argument to a few calls from routes; this
should be identical to the property internal to the client, and I don't
quite know what it would mean/whether it would work if they were
different. The SO client, for example, is coded for the current
namespace.
* Adds helper methods for converting data from configuration
* Adds retrieval of SO config and inputs index data to both dataclient
  and riskScoreService.
This includes the alerts index for the current space.
We added attributes to the SO, here.
I do not like the way this is structure, but right now I'm just trying
to execute the task so that I can TDD this.

However, local ES/Kibana errors are preventing me from running FTRs
locally, so I guess I'll wait for CI tonight.
All our tests invoke this with ssl: true, but in the case where we're
trying to run tests against e.g. localhost this override prevents us
from being able to use a non-SSL connection.
This all works, now, but the code is super confusing and I hate it. Will
be refactoring once integration tests are written.
What are the odds that we rely on behavior, there? Let's find out!
* Renames a test file because it was identical to the production
  filename and driving me crazy
* Moves some shared logic into utils: SO management, invoking Risk
  Engine routes, etc.
* Excercises most of the functionality of the task via tests. I'll
  probably come up with some more here but I think this will be enough
  to confidently refactor the task code
* Remove console statements
* lint improperly spaced comments :(
One (or more) of these tests would occasinally fail when the first batch
of 5 scores caused `waitForRiskScores` to return prematurely (before the
second batch of 5 scores was available in ES).

By specifying a scoreCount, we can address that race condition.

Closes elastic#162736.
I couldn't find a permutation of `removeIfExists` that actually worked,
but now that this is under test I'll take another pass later.
So that I can run `--grep "Risk Engine"` to run all these.
We were previously using this to retrieve the username, but that
functionality no longer exists.
* Remove unused dependencies
* Better error messages
* Adds `interval` to SO configuration
* Adds `namespace` to task state so that a per-space task can be
  executed (theoretically)
* Updates tests
So's I don't forget
@rylnd rylnd force-pushed the risk_engine_task branch from 7250a06 to 7cf14b9 Compare August 23, 2023 04:31
@rylnd rylnd requested a review from nkhristinin August 23, 2023 04:32
Copy link
Contributor

@nkhristinin nkhristinin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments, LGTM!

rylnd added 7 commits August 23, 2023 11:42
This required threading through a `namespace` parameter for a few rule
helpers, but I made it optional so it should be a minimal change.
We do not need to return a schedule here until we have need to
override the default interval. When the risk engine is enabled,
the task is scheduled with the interval configured in the saved object.
Rather than defining the SO extensions ourselves, we instead build a
fake kibana request for the particular namespace we're in, and use the
request-based helpers to build a scoped SO client from that request. In
this way, we're not coupled to the extensions implementation, and
additionally we have the encryption and spaces extensions
defined/enabled for us.

We need to disable the security extension, however, because there is not
a kibana system user that would be able to access these Saved Objects,
normally.
 Conflicts:
	src/core/server/integration_tests/saved_objects/migrations/group2/check_registered_types.test.ts
Now that we're actually enabling the task when we enable the risk
engine, we need this flag enabled, or else we'll get an error because
the task wasn't registered on kibana startup.
We can navigate directly to the entity analytics management page; we
don't need to visit alerts for these tests.
rylnd added 4 commits August 24, 2023 11:18
I didn't quite understand what these SSL changes were doing, and
although they seem to be unused I'm just not going to touch them for
now. This was committed in the course of running some things locally
against a non-SSL kibana, but that time has passed.

Additionally, this reverts the disabling of observability in integration
tests. While this was useful for developing locally, it's not something
that should be in main.
 Conflicts:
	x-pack/test/detection_engine_api_integration/utils/create_rule.ts
	x-pack/test/detection_engine_api_integration/utils/wait_for_rule_status.ts
* Rescheduling is inherent to the fact that define an `interval` for our
  schedule; no need to test that explicitly
* Also removes tests around failure conditions; the entire task run is
  wrapped in a try/catch such that we don't need to do that ourselves.
@rylnd rylnd enabled auto-merge (squash) August 24, 2023 18:12
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
securitySolution 4453 4454 +1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
securitySolution 15.7MB 15.7MB +152.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.11 candidate backport:skip This commit does not require backporting Feature:Entity Analytics Security Solution Entity Analytics features release_note:skip Skip the PR/issue when compiling release notes Team:Detection Engine Security Solution Detection Engine Area Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Theme: entity_analytics v8.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants