Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution] PoC of Prebuilt Detection Rules package with historical versions #137420

Closed
6 of 8 tasks
Tracked by #174166
xcrzx opened this issue Jul 28, 2022 · 17 comments
Closed
6 of 8 tasks
Tracked by #174166
Assignees
Labels
8.7 candidate Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.7.0

Comments

@xcrzx
Copy link
Contributor

xcrzx commented Jul 28, 2022

Epic: https://github.com/elastic/security-team/issues/1974 (internal)
Background info:

Summary

To allow users to customize prebuilt detection rules, we need to find a way to distribute the rules with all their historical versions. And as our main rule distribution method is via a Fleet package, we need to ensure that it can handle an increased number of saved objects in the package.

Screenshot 2022-08-16 at 15 46 45

  1. We need to create a copy of the Prebuilt Security Detection Rules Fleet package containing all historical versions of prepackaged rules.
  2. Include the rule version in the names of documents stored in the package: security_rule/[ruleId]:[ruleVersion].json'
  3. Include the rule version in the saved object ids specified in each rule JSON file inside the package:
    {
      "id": "[ruleId]:[ruleVersion]",
      "type": "security-rule",
      "attributes": {...}
    }

Why we need that change

Currently, we allow a limited set of modifications to prebuilt detection rules. Users can modify only rule exceptions and actions. Other rule fields, like description, query, tags, etc., are not modifiable, which creates unavoidable inconvenience for our users. We call this constraint "rule immutability". We want to remove the immutability constraint of prebuilt rules, allow users to make any necessary adjustments, and still receive rule updates.

More on rule customization in the Architecture Design Document. We recommend reading the following relevant sections:

  • Summary
  • How it’s going to look
  • Possible approaches

After that more context info can be found in:

Todo

Verify that:

  • It is possible to get the latest prepackaged rule version for a given ruleId.
  • It is possible to access the content of any historical prepackaged rule version for a given ruleId.
  • The package installation time is acceptable for the current number of rules, and Kibana SO read/write operations don't suffer from performance degradation.
  • The approach is scalable. Test a scenario with 10000 rules, 100 historical versions each.

Reach out to stakeholders for feedback (consider doing it via opening an RFC):

  • @elastic/kibana-core on the scalability of storing historical versions in security-rule Saved Objects
  • @elastic/fleet on the scalability of storing historical versions in the fleet package
  • @elastic/threat-research-and-detection-engineering on adding historical rule versions to every new Prebuilt Security Detection Rules package and changes in its build system required to support that
  • @elastic/response-ops-ram on consistency with the Alerting Framework in the future
@xcrzx xcrzx added Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Rule Management Security Detection Rule Management Team 8.5 candidate labels Jul 28, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@xcrzx
Copy link
Contributor Author

xcrzx commented Aug 17, 2022

I have created a Fleet package containing a large number of detection rules to verify how that approach scales. And here are some limitations that I've encountered along the way.

Max files per package

If we try to build a package with many files using elastic-package build, we could encounter the following error:

Error: building package failed: invalid content found in built zip package: found 1 validation error:
   1. folder [/rules-poc/build/packages/random_detection_rules-0.0.1.zip/kibana/security_rule] exceeds the limit of 65535 files

This limitation is not that significant for our use-case as the current number of detection rules in Security Solution with all their historical versions created in the past 2.5 years is ~4200. So we have a substantial buffer before reaching the 65k rules limit.

Saved objects import limit

Further, if we try to install a package that contains more than 10000 saved objects, the installation fails with the following error:

Error installing random_detection_rules 0.0.1: Can't import more than 10000 objects

The max number of objects to import is controlled by the savedObjects.maxImportExportSize config option in kibana.yml. However, if we set that option higher and try to install the package again, we encounter another error:

Error installing random_detection_rules 0.0.1: The number of nested documents has exceeded the allowed limit of [10000].
This limit can be set by changing the [index.mapping.nested_objects.limit] index level setting.

We can try to further tweak Elasticsearch settings, but it doesn't seem practical for our use case as the solution won't be universal. So installing more than 10000 saved objects is not a good option, and we need to consider alternatives.

Alternatives

The current package installation method implies that we import package assets as saved objects and persist them locally. But on the solution side, we need only a fraction of assets to be available. E.g., if a given detection rule has 10 versions, we don't need to read all of them, we only need two versions to build diff (more on diff #137446). So probably, we could read package assets directly without installing them.

An EPR API already allows us to read the list of package assets GET https://epr-snapshot.elastic.co:443/package/security_detection_engine/8.1.1/.

{
    "name": "security_detection_engine",
    "title": "Prebuilt Security Detection Rules",
    "version": "8.1.1",
    "assets": [
        "/package/security_detection_engine/8.1.1/NOTICE.txt",
        "/package/security_detection_engine/8.1.1/changelog.yml",
        "/package/security_detection_engine/8.1.1/manifest.yml",
        "/package/security_detection_engine/8.1.1/docs/README.md",
        "/package/security_detection_engine/8.1.1/kibana/security_rule/000047bb-b27a-47ec-8b62-ef1a5d2c9e19.json",
        "/package/security_detection_engine/8.1.1/kibana/security_rule/00140285-b827-4aee-aa09-8113f58a08f3.json",
        "/package/security_detection_engine/8.1.1/kibana/security_rule/0022d47d-39c7-4f69-a232-4fe9dc7a3acd.json",
        "/package/security_detection_engine/8.1.1/kibana/security_rule/0136b315-b566-482f-866c-1d8e2477ba16.json",
        "/package/security_detection_engine/8.1.1/kibana/security_rule/015cca13-8832-49ac-a01b-a396114809f6.json"
    ]
}

And given the list of package assets, we could encode rule versions in file names and only read rules we know were updated. So we could skip saved objects installation altogether.

@elastic/fleet could you please take a look at that approach? Do you have any concerns or limitations associated with it? It could potentially increase the number of requests Kibana makes to epr.elastic.co as instead of downloading a single .zip archive we'll start fetching .json files one by one.

@hop-dev
Copy link
Contributor

hop-dev commented Aug 17, 2022

The 10k limit is because our installation saved object tracks all the installed assets as part of a nested document here An alternative approach could be to look at a more efficient way of recording these rules on the installation object that does not take 1 nested doc per rule but still allows us to find the rules when we come to uninstall the package.

The alternative proposal of not installing them at all would probably need a change to the package spec to add a way to indicate to fleet not to install the assets, currently we attempt to install everything in the kibana directory. Though really if these are never going to be installed in kibana, maybe they would be a new kind of asset all together and would not live in the kibana folder?

@xcrzx
Copy link
Contributor Author

xcrzx commented Aug 17, 2022

Though really if these are never going to be installed in kibana, maybe they would be a new kind of asset all together and would not live in the kibana folder?

@hop-dev Yeah, that makes sense to me. Would that be a significant change on the Fleet side if we were to introduce the new asset type? And what would be the best way to approach that change, do we need to create a ticket/proposal to start the discussion?

Meanwhile, I'll continue this PoC to see if we could work efficiently (performance-wise, etc.) with package assets without installing them on the solution side.

@joshdover
Copy link
Contributor

joshdover commented Aug 23, 2022

This limit can be set by changing the [index.mapping.nested_objects.limit] index level setting.

This error points to something that is likely a problematic mapping type in your mappings for security rules. We have other saved object types that have scaled to 100k+ objects without any problems, so I don't think this is a fundamental problem with saved objects. I'd take a good look at your mapping types and see if there's something that could be tweaked there.

That said, I also like @hop-dev's suggestion of putting this into some other opaque saved object type to contain the history. You probably don't need all of the same fields to be mapped on these historical rules. Adding new SO types to the package-spec and Fleet's installation logic is pretty low-effort.

An EPR API already allows us to read the list of package assets GET https://epr-snapshot.elastic.co:443/package/security_detection_engine/8.1.1/.

We are actually planning to remove this direct file access API very soon and already have some beta/testing versions of the registry available where this API is removed. I will DM you the internal email thread about this change if you'd like to add any feedback.

Packages should be completely self-contained and the package registry is not currently intended to be used in this manner. I'd like to explore other options if viable before we consider adding this API back.

@xcrzx
Copy link
Contributor Author

xcrzx commented Aug 24, 2022

Expanded the description with the "Why we need that change" section for folks who need more context. You could find more context here, but there are too many unrelated details.

@xcrzx
Copy link
Contributor Author

xcrzx commented Aug 24, 2022

@joshdover, @banderror, and I have synced on the current package limitations and discussed different approaches we could use to overcome them.

  • Option 1: keep the possibility to directly read package files without installing them as described in the comment above. It is probably the most promising approach, but it requires more consideration and consultation with people working on the package registry — especially considering the plans to disable direct file access.
  • Option 2: pack all historical rule versions in a single file/saved object to minimize the number of saved objects to install. That approach would require reading all rule versions in memory to determine whether there's an update or not. That operation could consume a lot of application memory as all historical versions could take up to 100 Mb. So that could be problematic for some setups.
  • Option 3: find a way to work around the 10k saved objects limit. That could be relatively easily done on the Fleet side by removing mappings for the installed_kibana field, but still, we will have to find a way to bypass the 10k objects import limit on the core framework level. On top of that, we'll need to ensure that the package installation doesn't take too long, i.e., doesn't fail with a timeout.

@joshdover
Copy link
Contributor

@mtojek I'd like to get your input on this one.

In thinking about how to support this type of use case (looking up old versions of a package asset in order to support a 3-way merge for user customizations), it does seem like the asset download API would be quite advantageous over an ever-growing package that contains all of the historical rules. Let me know if we should chat in more detail about this.

@mtojek
Copy link
Contributor

mtojek commented Aug 25, 2022

@joshdover Thanks for the ping!

I'm on the fence, to be honest as the intention for the "arbitrary files API" is to expose only static resources like docs or images - artifacts used to render UI. All package configuration is retrieved from package revisions. On the other hand, we could expand the set of extractable assets and extract ones from already published packages. It shouldn't be a big deal for us. As I've written in the email thread, I wouldn't mark this as a blocker for Package Storage v2 migration, as we can iterate on this.

I'm sure that you had a deep discussion around this topic and wouldn't like to propose "yet another" idea, especially since I'm not working closely with security rules. Looking at the thread, it seems that rules should be handled similarly to the Git model. A user can modify any file and merge/overwrite with remote changes. Let's consider the following case - if the user's integration is 100 revs behind the latest package revision, Kibana will have to call the Package Storage 100 times (assuming that all revisions are in a single file), which will take time. I'm not sure if option 2. isn't the best choice in terms of predictability and stable implementation. 100 MB doesn't sound like a big issue considering it's only for the purpose of rules conflict resolution.

Option 4.:
If you don't want to keep that rules-bundle file in a package, maybe we could park it somewhere close to it in the Package Storage? We'll have to figure out something for air-gapped environments.

@jsoriano
Copy link
Member

jsoriano commented Aug 25, 2022

I think that a package should work the same independently of its source. Making a package depend on its historical versions may bring problems:

  • The package will not work the same if a different set of historical versions is available, that could happen when used from air-gapped-environments, in development stages, or if the package is ever bundled in Kibana or other products.
  • If a version has a problem, we may need to introduce another mechanism to ignore these problematic versions.
  • If a version has to be removed due to reasons out of our control (like legal or security reasons), we may find situations where users cannot upgrade.

Also, the use of the asset download API brings security concerns. We are trying to introduce signed packages, but there is no way in this API to ensure that the asset is signed, or comes without any unexpected modification from a signed package. This is specially relevant for a security solution, an attacker that could alter the access to this API could send fake historical assets that produce diffs that convert the detection rules in noops.

I think though that there are alternatives to include all historical rules in the package or anywhere else:

  • Would it be enough for the diff-building algorithm to have only the versions installed by the user? e.g. if a user has the version v1 installed, and wants to update to version v4, would it be enough with comparing the assets in v1 and v4? Then kibana would only need to keep a copy of v1 while updating the assets to v4.
  • Use of migrations (as the common practice used in database schema migrations): packages don't contain whole copies of the required assets, but the differences between versions, so from an older version, the current version can be reconstructed. That would mean to store in the package the diffs, instead of the whole assets.

@xcrzx
Copy link
Contributor Author

xcrzx commented Aug 26, 2022

@mtojek, @jsoriano Thanks for taking the time to look into the issue. I'll try to answer your questions, but it would probably be more productive to have a separate meeting to ensure we're on the same page.

Let's consider the following case - if the user's integration is 100 revs behind the latest package revision, Kibana will have to call the Package Storage 100 times (assuming that all revisions are in a single file), which will take time.

We'll need to fetch only two versions to build a diff: the latest available version and the original "forked" rule version. I.e., we'll call the Package Storage only two times. Because we store the entire rule object as a historical version, any versions in-between the two we are comparing are unnecessary for the diff algorithm.

I'm not sure if option 2. isn't the best choice in terms of predictability and stable implementation. 100 MB doesn't sound like a big issue considering it's only for the purpose of rules conflict resolution.

Yeah, that option could work out well. We will probably build an experiment to measure the performance, decide whether it is okay for us, and compare it with other available options.

If you don't want to keep that rules-bundle file in a package, maybe we could park it somewhere close to it in the Package Storage? We'll have to figure out something for air-gapped environments.

Could you please elaborate on your proposal? Not sure I understand it. And as for air-gapped systems, we have a filesystem-based distribution method of detection rules. That means we bundle all rules with Kibana and use them as the default fallback if rules from Fleet are unavailable. That said, we want to reconsider the rule distribution method in the future and migrate to a Fleet package that gets installed at build time.

I think that a package should work the same independently of its source. Making a package depend on its historical versions may bring problems

The package itself would not depend on its historical versions. What we are planning to do is to start bundling all released rule versions together as package assets. We'll also add a graceful degradation mechanism on the solution side, so the rule upgrade process will work even if some assets are missing or clients continue to use outdated packages.

We are trying to introduce signed packages, but there is no way in this API to ensure that the asset is signed, or comes without any unexpected modification from a signed package. This is specially relevant for a security solution, an attacker that could alter the access to this API could send fake historical assets that produce diffs that convert the detection rules in noops.

Yeah, thanks for highlighting that. We'll have to weigh whether it poses a security risk for us. Ultimately a human operator makes a final decision on whether rules need to be updated or not. If the detection rules package was compromised, it would be clearly visible in the UI, so the risk could not be that high.

Would it be enough for the diff-building algorithm to have only the versions installed by the user? e.g. if a user has the version v1 installed, and wants to update to version v4, would it be enough with comparing the assets in v1 and v4? Then kibana would only need to keep a copy of v1 while updating the assets to v4.

For the diff algorithm, yes, two versions would be enough. But we also plan to add the ability to roll back to any historical rule version, so we still have to be able to access all rule versions somehow.

Use of migrations (as the common practice used in database schema migrations): packages don't contain whole copies of the required assets, but the differences between versions, so from an older version, the current version can be reconstructed. That would mean to store in the package the diffs, instead of the whole assets.

Not sure what problem that could solve? The total number of objects that we store seemingly wouldn't change. Also, we considered storing diffs in the early stages of technical discussions. Still, We decided not to follow that path as it offers no clear advantages. Moreover, the reconstruction algorithm itself is much harder in terms of implementation and also computationally more expensive.

@banderror
Copy link
Contributor

@mtojek @jsoriano thank you for your feedback! I agree with @xcrzx, let's schedule a meeting so we could answer your questions and resolve any confusion and concerns around this PoC.

I also feel like maybe we've done not the best job explaining our needs and goals in the first place. I updated the ticket description and added more information to the documents mentioned in it. I hope it can be helpful for getting more context before the meeting.

@joshdover Dmitrii will schedule something on Monday, please join as well if you have time!

@rudolf
Copy link
Contributor

rudolf commented Aug 31, 2022

removing mappings for the installed_kibana field, but still, we will have to find a way to bypass the 10k objects import limit on the core framework level.

Yes the 10k limit is imposed by Elasticsearch on nested fields. We could theoretically override this index level setting but the limit is probably there for a good reason so we would have to follow-up with the Elasticsearch team to understand any repercussions this might have.

The 10k objects import limit is a safe guard for protecting the Kibana server's memory because all imported objects are loaded into memory. At the time we couldn't agree about what's the best way to protect the memory so we also added an additional maxImportPayloadBytes option #91172 (comment) I'm not sure the maxImportExportSize limit is really effective any more because canvas workpads could be > 12mb which would result in 120GB. All this to say I don't think this should be the blocker.

Having said that, I don't think adding 10k (or 1M for your worst case scenario of "10000 rules, 100 historical versions each") saved objects feels like the right data structure for what we're trying to achieve. I don't fully understand all the details but it seems to me like we should be able to have a dedicated "rule history" saved object type that could contain e.g. 100 revisions of that rule in a format that's more compact than a saved object per rule revision.

@xcrzx
Copy link
Contributor Author

xcrzx commented Sep 6, 2022

Hey @rudolf, thanks for joining the discussion!

Yes the 10k limit is imposed by Elasticsearch on nested fields. We could theoretically override this index level setting but the limit is probably there for a good reason so we would have to follow-up with the Elasticsearch team to understand any repercussions this might have.

I think it would be easier to remove the mapping for the istalled_kibana field as it is currently not used so that we don't need to meddle with the Elasticsearch settings altogether.

At the time we couldn't agree about what's the best way to protect the memory so we also added an additional maxImportPayloadBytes option #91172 (comment)

Thanks, that setting could be very useful for our use case. That said, I think we should measure the impact of importing large sets of saved objects on memory consumption. If it turns out to be too high, we will probably need to import assets in chunks so to protect Kibana from potential OOM.

I don't fully understand all the details but it seems to me like we should be able to have a dedicated "rule history" saved object type that could contain e.g. 100 revisions of that rule in a format that's more compact than a saved object per rule revision.

We consider different options, including storing all historical rule versions in a single saved object. All of them have their pros and cons. For example, the business logic we will implement will require reading individual rule versions, like the ones matching a specific range according to semver. So storing all versions in separate saved objects would allow us to read them efficiently in one database request. Otherwise, we will need to implement more complex logic, like reading rule versions and filtering them in memory. Also, we'll be constrained in the types of requests we can execute against the "compressed" structure. Likely, we wouldn't be able to aggregate data efficiently if we need to in the future for our business logic. That's why we try to consider alternatives first.

@rudolf What do you consider a significant number of saved objects that could affect Kibana's performance? And what are the implications of having, let's say, 20000-30000 saved objects?

@rudolf
Copy link
Contributor

rudolf commented Sep 6, 2022

So storing all versions in separate saved objects would allow us to read them efficiently in one database request.

Yes, it makes sense to explore the tradeoff here between how much of the processing happens in Elasticsearch vs Kibana.

What do you consider a significant number of saved objects that could affect Kibana's performance?

We have a handful of clusters with > 500k saved objects without any complaints of performance. But the size of the saved objects are a bigger problem than the amount, there's one report where an 11GB .kibana index with just 60k saved objects is causing slow startup times of more than 7 minutes and probably even longer to complete a full upgrade migration (btw we could probably improve this a lot with #124946)

So I suspect similar to import/export we would have to benchmark this to try to come up with some kind of upper limit.

xcrzx added a commit that referenced this issue Jan 3, 2023
…ects (#148141)

**Resolves: #147695,
#148174
**Related to: #145851,
#137420

## Summary

This PR improves the stability of the Fleet packages installation
process with many saved objects.

1. Changed mappings of the `installed_kibana` and `package_assets`
fields from `nested` to `object` with `enabled: false`. Values of those
fields were retrieved from `_source`, and no queries or aggregations
were performed against them. So the mappings were unused, while during
the installation of packages containing more than 10,000 saved objects,
an error was thrown due to the nested field limitations:

   ```
Error installing security_detection_engine 8.4.1: The number of nested
documents has exceeded the allowed limit of
   [10000].
This limit can be set by changing the
[index.mapping.nested_objects.limit] index level setting.
   ```
2. Improved the deletion of previous package assets by switching from
sending multiple `savedObjectsClient.delete` requests in parallel to a
single `savedObjectsClient.bulkDelete` request. Multiple parallel
requests were causing the Elasticsearch cluster to stop responding for
some time; see [this
ticket](#147695) for more info.

**Before**
![Screenshot 2022-12-28 at 11 09
35](https://user-images.githubusercontent.com/1938181/209816219-ade6dd0a-0d56-4acc-929e-b88571f0fe81.png)

**After**
![Screenshot 2022-12-28 at 13 56
44](https://user-images.githubusercontent.com/1938181/209816209-16c69922-4ae2-4589-9aa4-5a28050037f4.png)
@xcrzx
Copy link
Contributor Author

xcrzx commented Jan 4, 2023

Closing this issue as completed. See this PR for more info on PoC results: #145851

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.7 candidate Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.7.0
Projects
None yet
Development

No branches or pull requests

8 participants