tighten rule pre-selection #2080

williballenthin · 2024-05-14T20:03:38Z

closes #2074
ref #2063, particularly "tighten rule pre-selection" and "lots of time spent in instancecheck"

Stacked on #1950, so I've marked this as a PR onto that branch so the diff is sensible. I think we can probably rebase onto master, though, if necessary.

This PR implements the "tighten rule pre-selection" algorithm described here: #2063 (comment) . In summary:

Rather than indexing all features from all rules, we should pick and index the minimal set (ideally, one) of features from each rule that must be present for the rule to match. When we have multiple candidates, pick the feature that is probably most uncommon and therefore "selective".

This seems to work pretty well. Total evaluations when running against mimikatz drop from 19M to 1.1M (wow!) and capa seems to match around 3x more functions per second (wow wow). I did not expect such a good result - in fact, although the capa matches seem the be the same, I still wonder if something is broken 🤔. More tests needed.

label	count(evaluations)	time
[before any optimization, prehistoric]	104,496,193	86.62s
`8858537` pep8 [before this PR]	19,939,632	25.74s
`a66524a` rules: match: better debug paranoid matching	1,157,514	8.10s

TODO:

add some tests for the feature indexer, if only to show a human how it works
namespace matching
prove that it matches exactly the same as before, just faster
xfail the tests and document the unsupported constructs
inline documentation explaining the algorithm better
wall clock performance numbers

williballenthin · 2024-05-14T20:04:40Z

capa/rules/__init__.py

+    @property
+    def file_rules(self):
+        return self.rules_by_scope[Scope.FILE]
+
+    @property
+    def process_rules(self):
+        return self.rules_by_scope[Scope.PROCESS]
+
+    @property
+    def thread_rules(self):
+        return self.rules_by_scope[Scope.THREAD]
+
+    @property
+    def call_rules(self):
+        return self.rules_by_scope[Scope.CALL]
+
+    @property
+    def function_rules(self):
+        return self.rules_by_scope[Scope.FUNCTION]
+
+    @property
+    def basic_block_rules(self):
+        return self.rules_by_scope[Scope.BASIC_BLOCK]
+
+    @property
+    def instruction_rules(self):
+        return self.rules_by_scope[Scope.INSTRUCTION]


for backwards compatibility. during a major version, we can probably remove these with preference to rules_by_scope.

capa/rules/__init__.py

mr-tz

nice work, we should do extensive tests comparing the results before and after to ensure everything works as expected. the speedup looks promising!

capa/rules/__init__.py

williballenthin · 2024-05-16T13:22:50Z

we should do extensive tests comparing the results before and after to ensure everything works as expected.

I plan to run this implementation side by side with the ceng.match implementation and assert the results are precisely the same across a wide range of samples. There should be no leaks of abstraction or details in the new one, it should just be faster.

CHANGELOG updated or no update needed, thanks! 😄

…to perf-rule-pre-selection

tests/test_match.py

capa/rules/__init__.py

mike-hunhoff · 2024-06-03T17:17:47Z

capa/rules/__init__.py

+            string_features = [
+                feature
+                for feature in features
+                if isinstance(feature, (capa.features.common.Substring, capa.features.common.Regex))
+            ]
+            bytes_features = [feature for feature in features if isinstance(feature, capa.features.common.Bytes)]
+            hashable_features = [
+                feature
+                for feature in features
+                if not isinstance(
+                    feature, (capa.features.common.Substring, capa.features.common.Regex, capa.features.common.Bytes)
+                )
+            ]


Can this be optimized? We're looping and calling isinstance on every feature three times.

let me try and then run some benchmarks. I agree it looks wasteful, but I'm not sure if it has a real world effect.

capa/rules/__init__.py

mike-hunhoff

This looks great @williballenthin - I'm pumped about the improved efficiency. The logic and code that you've implemented here appears sound. Let's get this merged pending successful paranoid invocation across a wide range of samples

mike-hunhoff · 2024-06-03T17:29:58Z

Should we rebase this on top of master so that it doesn't depend on BinExport2?

I'm inclined to say "yes" although we lose the intermediate history. This would allow us to do a minor release and get the optimizations out there.

Yes let's rebase on master so we can get this to our users ASAP

mr-tz

amazing work! noted a few minor things I've noticed and the paranoid run will provide a lot of value

capa/rules/__init__.py

mr-tz · 2024-06-03T18:43:00Z

capa/rules/__init__.py

+        # We may want to try to pre-evaluate these strings, based on their presence in the file,
+        # to reduce the number of evaluations we do here.
+        # See: https://github.com/mandiant/capa/issues/2063#issuecomment-2095639672
+        #
+        # We may also want to specialize case-insensitive strings, which would enable them to
+        # be indexed, and therefore skip the scanning here, improving performance.
+        # This strategy is described here:
+        # https://github.com/mandiant/capa/issues/2063#issuecomment-2107083068


add TODOs for these notes?

yeah, and i'll spin off the original issue comments into dedicated issues we can use to track the idea.

capa/rules/__init__.py

Co-authored-by: Moritz <[email protected]>

Co-authored-by: Mike Hunhoff <[email protected]>

fariss

Very good improvmenets, I just have question below.

Unrelated to this PR, I think we can replace

capa/capa/rules/__init__.py

Line 459 in 960ee86

b = codecs.decode(s.replace(" ", "").encode("ascii"), "hex")

with:

b = bytes.fromhex(s)

https://docs.python.org/3/library/stdtypes.html#bytes.fromhex

capa/rules/__init__.py

williballenthin · 2024-06-04T10:29:11Z

paranoid linting succeeded!

❯ time python scripts/lint.py rules/ --thorough
INFO:lint:collecting potentially referenced samples
                                                                                                                                                                                                     
encrypt data using RC4 via SystemFunction033                                                                                                                                                         
FAIL: referenced example doesn't exist: Add the referenced example to samples directory ($capa-root/tests/data or supplied via --samples)                                                         
                                                                                                                                                                                                        
(nursery)  linked against hp-socket                                                                                                                                                                   
WARN: referenced example doesn't exist: Add the referenced example to samples directory ($capa-root/tests/data or supplied via --samples)                                                                                                                                                                                                                                                         rules with WARN:                                                                                                                                                                                      - linked against hp-socket

rules with FAIL:
  - encrypt data using RC4 via SystemFunction033

________________________________________________________
Executed in  125.20 mins    fish           external
   usr time  124.04 mins   66.00 micros  124.04 mins
   sys time    0.98 mins  898.00 micros    0.98 mins

	time
paranoid	125 minutes
master	62 minutes
this PR	44 minutes

So, this improves the performance of capa (with the vivisect backend) by about 30%. When using the BinExport2 backend, I think the performance improvement will be closer to 2-3x, since less time is spent doing analysis.

mr-tz · 2024-06-04T10:52:08Z

awesome, big performance improvement!

…to perf-rule-pre-selection

williballenthin · 2024-06-06T08:21:28Z

new PR that's rebased against master: #2125

williballenthin added 3 commits May 14, 2024 21:41

features: mark format as a global feature

8b0076b

pep8

8858537

rules: optimize rule pre-filtering, first revision

9c0c662

williballenthin added the enhancement New feature or request label May 14, 2024

This comment was marked as resolved.

Sign in to view

williballenthin commented May 14, 2024

View reviewed changes

capa/rules/__init__.py Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

mr-tz reviewed May 16, 2024

View reviewed changes

capa/rules/__init__.py Outdated Show resolved Hide resolved

capa/rules/__init__.py Outdated Show resolved Hide resolved

williballenthin added 3 commits May 22, 2024 09:59

Merge branch 'feat/1755' into perf-rule-pre-selection

07f347b

lints

2d9c82f

rules: add documentation for optimized match routine

0dc0c51

williballenthin mentioned this pull request May 22, 2024

investigate optimization of rule matching (May, 2024) #2063

Closed

williballenthin added 3 commits May 22, 2024 14:40

bytes: log length of bytes evaluations

f86a60c

ruleset: document optimized match behavior

6e50f48

changelog

b7d0734

williballenthin and others added 12 commits May 22, 2024 15:40

ruleset: infrastructure to test optimized matcher

f853214

Merge branch 'feat/1755' into perf-rule-pre-selection

0bc9cb5

Merge branch 'feat/1755' into perf-rule-pre-selection

d25d74f

Merge branch 'perf-rule-pre-selection' of github.com:mandiant/capa in…

6f9c34b

…to perf-rule-pre-selection

pep8

9b7fb4e

linters

e8ef897

rules: match: handle namespace match statements

e49d47d

rules: more tests for logic edge cases

a4f4f0b

rules: match paranoid true

bff7f0a

rules: document logic edge cases

d20f040

Merge branch 'feat/1755' into perf-rule-pre-selection

ad3643b

Merge branch 'perf-rule-pre-selection' of github.com:mandiant/capa in…

ced0226

…to perf-rule-pre-selection

williballenthin requested a review from fariss June 3, 2024 16:26