Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: consider putting all rules into a single file to reduce IO #1212

Closed
williballenthin opened this issue Dec 6, 2022 · 11 comments · Fixed by #1291
Closed

perf: consider putting all rules into a single file to reduce IO #1212

williballenthin opened this issue Dec 6, 2022 · 11 comments · Fixed by #1291
Assignees
Labels
enhancement New feature or request question Further information is requested
Milestone

Comments

@williballenthin
Copy link
Collaborator

it can take a little while for capa to begin analyzing, because it has to load rules, signatures, etc.
we currently have 700+ rules, which i suspect might result in a lot of IO.
we should investigate how much of the loading time can be attributed to rule loading IO, and if the load time can be improved by optimizing the rule loading.
for example, perhaps we could support a zip archive of rules that's read into memory once, versus 700+ seek&read operations.

@williballenthin williballenthin added enhancement New feature or request question Further information is requested labels Dec 6, 2022
@williballenthin
Copy link
Collaborator Author

on my flimsy codespace (2 core/4gb ram) loading rules takes 1 second:
image

total runtime against PMA 01-01 is 12s:
image

so if we made this instantaneous, we'd get an 8% improvement for simple samples.

@williballenthin
Copy link
Collaborator Author

benchmarking the various parts of rule loading:

image

image

it takes a small amount of time to read the rule content and quite a bit of time to parse and validate the content. so, this proposal won't really address any performance issue related to rule loading.

@mr-tz
Copy link
Collaborator

mr-tz commented Dec 20, 2022

Reopening to reconsider storing/loading the serialized RuleSet.

@mr-tz mr-tz reopened this Dec 20, 2022
@mr-tz mr-tz added this to the 5.0.0 milestone Dec 20, 2022
@williballenthin
Copy link
Collaborator Author

williballenthin commented Dec 21, 2022

we've been discussing the strategy of caching rulesets to improve the startup time of capa.

we currently think that capa spends around five seconds loading rules before analyzing a sample. most of this time is validating rules and statements; this is a CPU task, not an IO task. we note that, most of the time, capa rules don't change from invocation to invocation.

therefore, we hypothesize that introducing a persistent cache of the validated ruleset can speedup capa by 4-5 seconds.

the design might look something like this:

  1. check if cache exists. if not, load rules as normal.
  2. when rules have been validated and the ruleset object is created, write the cache to disk.
  3. the cache can be written into the temp dir or similar (maybe $XDG_CACHE_HOME). it should not be a problem for the cache to be deleted.
  4. the cache is a pickled (v4) representation of the ruleset. its probably 1-2 MB in size. binary format. loading this must be fast.
  5. when the cache is used, capa should validate that the cache matches the rules on disk. it might do this by hashing the rule content and comparing with a hash embedded in the cache. we want to avoid any cache invalidation problems. we'll also want to validate the capa code version used to create the cache.

we'll want to be careful around race conditions updating the cache. also, the pickle format can include code thats executed upon load, so we should be clear that the cache must be "trusted" somehow. if there's any issue processing the cache file, we should throw out the old one and regenerate it; worst case, the performance is about the same as today. there should be a way to disable the cache via CLI option, which is relevant for users running in docker or other transient environments.

we expect the hit rate on the cache should be pretty high because we don't think that many capa users often modify their rules. for example, the standalone capa.exe will always use the embedded rules (unless overridden) so we could distribute a pre-built cache for capa.exe that benefits most users (we'll have to be very careful about versioning and probably cannot distribute pre-built caches for anything but capa.exe).

@mr-tz
Copy link
Collaborator

mr-tz commented Dec 21, 2022

Pickle is the easiest option vs. creating a custom JSON export format (likely several hundred LOC) for all objects embedded in a RuleSet.

We should be clear about the threat model for code execution via a Pickled cache file: if users can't/don't share the cache file (as it resides in a mostly hidden location that's far away from capa code and rules) then it is hard to imagine the social engineering required for an attacker to convince a user to load a malicious cache file. However, this makes clear the point that we should not distribute pre-built cache files, except within the standalone binaries (where the fact a cache is present is not important at all).

In general, users should not be aware there is a cache in use at all.

Next steps:

  • Sketch out source implementation
    • where does the cache reside? Options: tmp or cache special directories, configurable via environment variable
  • adjust standalone build

@williballenthin
Copy link
Collaborator Author

williballenthin commented Dec 21, 2022

cache file location

The cache file should not be stored alongside capa code nor capa rules (it should be basically impossible to accidentally share). Users shouldn't stumble across the file unless they go looking for it. If the file is deleted accidentally or by the OS to reclaim disk/memory, that's totally ok.

We should try to find existing work on the topic of cache files and follow conventions.

Best choices so far:

  • Linux: $XDG_CACHE_HOME/capa/
  • Windows: %LOCALAPPDATA%\flare\capa\cache (ref)
  • MacOS: ~/Library/Caches/capa (ref)

filename: capa-XXXXXXXX.cache where XXXXXXXX is a hash derived from the source data.

When writing the cache, create it as a temp file (using appropriate library) and then atomically move to its destination (after ensuring directory exists). we should try to avoid any race condition in cache file creation.

We should enable users to 1) disable the cache via env var, and 2) specify the cache directory via env var, in order to support running capa in an ephemeral environment, such as docker/k8s.

@williballenthin
Copy link
Collaborator Author

williballenthin commented Dec 21, 2022

cache file format

b"capa" + b"\x00\x00\x00\x01" + id + zlib(pickle(object))

object looks like:

  • id: str hash of source data
  • ruleset: capa.rules.RuleSet

@williballenthin
Copy link
Collaborator Author

williballenthin commented Dec 21, 2022

cache file identifier/hash

we want to be able to detect when the cache is invalid, such as when the underlying rules have changed or the capa version has changed (and types/import paths may have changed).

proposed:

sha256(capa.version.encode("utf-8") + sorted(sha256(rule1.text), sha256(rule2.text), ...)))

This hash should be validated when loading the cache file. It should also be used to derive the file path of the cache file. This way, there can be multiple versions of capa and its rules on a system, each using a separate cache file.

When an error is encountered with the cache file, such as if it failed to load, the cache file should be deleted and then fall back to non-cached rule loading.

@williballenthin
Copy link
Collaborator Author

williballenthin commented Dec 21, 2022

we'll need to build a tool to generate a cache file into a known location so that we can pre-build the cache file for the standalone binaries.

the logic for using the cache file for standalone binaries should be fairly straightforward: if -r is not present, use the embedded cache file. in fact, we don't really need the rules to be embedded too, but i personally think its good for the source data to be around.

building the cache in CI and adding it to the pyinstaller build will take a few minutes but should be pretty straightforward. its binary data that should be added alongside the rules, just like the rules and signatures are today.

@williballenthin williballenthin self-assigned this Dec 21, 2022
@mr-tz
Copy link
Collaborator

mr-tz commented Dec 22, 2022

Should we also do this for the signatures/signature analyzer? That part also takes a couple of seconds and rules almost never change.

This may be done in viv-utils.

@williballenthin
Copy link
Collaborator Author

williballenthin commented Dec 22, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants