perf: consider putting all rules into a single file to reduce IO #1212

williballenthin · 2022-12-06T14:57:59Z

it can take a little while for capa to begin analyzing, because it has to load rules, signatures, etc.
we currently have 700+ rules, which i suspect might result in a lot of IO.
we should investigate how much of the loading time can be attributed to rule loading IO, and if the load time can be improved by optimizing the rule loading.
for example, perhaps we could support a zip archive of rules that's read into memory once, versus 700+ seek&read operations.

williballenthin · 2022-12-07T14:17:33Z

on my flimsy codespace (2 core/4gb ram) loading rules takes 1 second:

total runtime against PMA 01-01 is 12s:

so if we made this instantaneous, we'd get an 8% improvement for simple samples.

williballenthin · 2022-12-07T14:24:58Z

benchmarking the various parts of rule loading:

it takes a small amount of time to read the rule content and quite a bit of time to parse and validate the content. so, this proposal won't really address any performance issue related to rule loading.

mr-tz · 2022-12-20T14:51:30Z

Reopening to reconsider storing/loading the serialized RuleSet.

williballenthin · 2022-12-21T10:10:10Z

we've been discussing the strategy of caching rulesets to improve the startup time of capa.

we currently think that capa spends around five seconds loading rules before analyzing a sample. most of this time is validating rules and statements; this is a CPU task, not an IO task. we note that, most of the time, capa rules don't change from invocation to invocation.

therefore, we hypothesize that introducing a persistent cache of the validated ruleset can speedup capa by 4-5 seconds.

the design might look something like this:

check if cache exists. if not, load rules as normal.
when rules have been validated and the ruleset object is created, write the cache to disk.
the cache can be written into the temp dir or similar (maybe $XDG_CACHE_HOME). it should not be a problem for the cache to be deleted.
the cache is a pickled (v4) representation of the ruleset. its probably 1-2 MB in size. binary format. loading this must be fast.
when the cache is used, capa should validate that the cache matches the rules on disk. it might do this by hashing the rule content and comparing with a hash embedded in the cache. we want to avoid any cache invalidation problems. we'll also want to validate the capa code version used to create the cache.

we'll want to be careful around race conditions updating the cache. also, the pickle format can include code thats executed upon load, so we should be clear that the cache must be "trusted" somehow. if there's any issue processing the cache file, we should throw out the old one and regenerate it; worst case, the performance is about the same as today. there should be a way to disable the cache via CLI option, which is relevant for users running in docker or other transient environments.

we expect the hit rate on the cache should be pretty high because we don't think that many capa users often modify their rules. for example, the standalone capa.exe will always use the embedded rules (unless overridden) so we could distribute a pre-built cache for capa.exe that benefits most users (we'll have to be very careful about versioning and probably cannot distribute pre-built caches for anything but capa.exe).

mr-tz · 2022-12-21T11:58:12Z

Pickle is the easiest option vs. creating a custom JSON export format (likely several hundred LOC) for all objects embedded in a RuleSet.

We should be clear about the threat model for code execution via a Pickled cache file: if users can't/don't share the cache file (as it resides in a mostly hidden location that's far away from capa code and rules) then it is hard to imagine the social engineering required for an attacker to convince a user to load a malicious cache file. However, this makes clear the point that we should not distribute pre-built cache files, except within the standalone binaries (where the fact a cache is present is not important at all).

In general, users should not be aware there is a cache in use at all.

Next steps:

Sketch out source implementation
- where does the cache reside? Options: tmp or cache special directories, configurable via environment variable
adjust standalone build

williballenthin · 2022-12-21T12:35:11Z

cache file location

The cache file should not be stored alongside capa code nor capa rules (it should be basically impossible to accidentally share). Users shouldn't stumble across the file unless they go looking for it. If the file is deleted accidentally or by the OS to reclaim disk/memory, that's totally ok.

We should try to find existing work on the topic of cache files and follow conventions.

Best choices so far:

Linux: $XDG_CACHE_HOME/capa/
Windows: %LOCALAPPDATA%\flare\capa\cache (ref)
MacOS: ~/Library/Caches/capa (ref)

filename: capa-XXXXXXXX.cache where XXXXXXXX is a hash derived from the source data.

When writing the cache, create it as a temp file (using appropriate library) and then atomically move to its destination (after ensuring directory exists). we should try to avoid any race condition in cache file creation.

We should enable users to 1) disable the cache via env var, and 2) specify the cache directory via env var, in order to support running capa in an ephemeral environment, such as docker/k8s.

williballenthin · 2022-12-21T12:41:22Z

cache file format

b"capa" + b"\x00\x00\x00\x01" + id + zlib(pickle(object))

object looks like:

id: str hash of source data
ruleset: capa.rules.RuleSet

williballenthin · 2022-12-21T12:43:29Z

cache file identifier/hash

we want to be able to detect when the cache is invalid, such as when the underlying rules have changed or the capa version has changed (and types/import paths may have changed).

proposed:

sha256(capa.version.encode("utf-8") + sorted(sha256(rule1.text), sha256(rule2.text), ...)))

This hash should be validated when loading the cache file. It should also be used to derive the file path of the cache file. This way, there can be multiple versions of capa and its rules on a system, each using a separate cache file.

When an error is encountered with the cache file, such as if it failed to load, the cache file should be deleted and then fall back to non-cached rule loading.

williballenthin · 2022-12-21T12:51:29Z

we'll need to build a tool to generate a cache file into a known location so that we can pre-build the cache file for the standalone binaries.

the logic for using the cache file for standalone binaries should be fairly straightforward: if -r is not present, use the embedded cache file. in fact, we don't really need the rules to be embedded too, but i personally think its good for the source data to be around.

building the cache in CI and adding it to the pyinstaller build will take a few minutes but should be pretty straightforward. its binary data that should be added alongside the rules, just like the rules and signatures are today.

mr-tz · 2022-12-22T17:26:13Z

Should we also do this for the signatures/signature analyzer? That part also takes a couple of seconds and rules almost never change.

This may be done in viv-utils.

williballenthin · 2022-12-22T19:16:38Z

I don’t think this will work for the signatures because the matcher is implemented in Rust and therefore cannot be pickled. I’m still a bit surprised how slow the signature loading is, but in the past when I’ve studied it, I wasn’t able to find any obvious places to fix. Perhaps I should try again to save another few seconds.

ref: #1212

williballenthin added enhancement New feature or request question Further information is requested labels Dec 6, 2022

williballenthin closed this as completed Dec 7, 2022

mr-tz reopened this Dec 20, 2022

mr-tz added this to the 5.0.0 milestone Dec 20, 2022

williballenthin self-assigned this Dec 21, 2022

mr-tz mentioned this issue Jan 6, 2023

explorer: cache rule set #317

Closed

williballenthin mentioned this issue Jan 20, 2023

cache rule set across invocations of capa #1291

Merged

6 tasks

williballenthin added a commit that referenced this issue Jan 20, 2023

rules: cache the ruleset to disk

476ffab

ref: #1212

williballenthin closed this as completed in #1291 Jan 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: consider putting all rules into a single file to reduce IO #1212

perf: consider putting all rules into a single file to reduce IO #1212

williballenthin commented Dec 6, 2022

williballenthin commented Dec 7, 2022

williballenthin commented Dec 7, 2022

mr-tz commented Dec 20, 2022

williballenthin commented Dec 21, 2022 •

edited

Loading

mr-tz commented Dec 21, 2022 •

edited by williballenthin

Loading

williballenthin commented Dec 21, 2022 •

edited

Loading

williballenthin commented Dec 21, 2022 •

edited

Loading

williballenthin commented Dec 21, 2022 •

edited

Loading

williballenthin commented Dec 21, 2022 •

edited

Loading

mr-tz commented Dec 22, 2022

williballenthin commented Dec 22, 2022 via email •

edited

Loading

perf: consider putting all rules into a single file to reduce IO #1212

perf: consider putting all rules into a single file to reduce IO #1212

Comments

williballenthin commented Dec 6, 2022

williballenthin commented Dec 7, 2022

williballenthin commented Dec 7, 2022

mr-tz commented Dec 20, 2022

williballenthin commented Dec 21, 2022 • edited Loading

mr-tz commented Dec 21, 2022 • edited by williballenthin Loading

williballenthin commented Dec 21, 2022 • edited Loading

cache file location

williballenthin commented Dec 21, 2022 • edited Loading

cache file format

williballenthin commented Dec 21, 2022 • edited Loading

cache file identifier/hash

williballenthin commented Dec 21, 2022 • edited Loading

mr-tz commented Dec 22, 2022

williballenthin commented Dec 22, 2022 via email • edited Loading

williballenthin commented Dec 21, 2022 •

edited

Loading

mr-tz commented Dec 21, 2022 •

edited by williballenthin

Loading

williballenthin commented Dec 21, 2022 •

edited

Loading

williballenthin commented Dec 21, 2022 •

edited

Loading

williballenthin commented Dec 21, 2022 •

edited

Loading

williballenthin commented Dec 21, 2022 •

edited

Loading

williballenthin commented Dec 22, 2022 via email •

edited

Loading