Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tighten rule pre-selection #2080

Closed
wants to merge 37 commits into from
Closed

Conversation

williballenthin
Copy link
Collaborator

@williballenthin williballenthin commented May 14, 2024

closes #2074
ref #2063, particularly "tighten rule pre-selection" and "lots of time spent in instancecheck"

Stacked on #1950, so I've marked this as a PR onto that branch so the diff is sensible. I think we can probably rebase onto master, though, if necessary.


This PR implements the "tighten rule pre-selection" algorithm described here: #2063 (comment) . In summary:

Rather than indexing all features from all rules, we should pick and index the minimal set (ideally, one) of features from each rule that must be present for the rule to match. When we have multiple candidates, pick the feature that is probably most uncommon and therefore "selective".

This seems to work pretty well. Total evaluations when running against mimikatz drop from 19M to 1.1M (wow!) and capa seems to match around 3x more functions per second (wow wow). I did not expect such a good result - in fact, although the capa matches seem the be the same, I still wonder if something is broken 🤔. More tests needed.

label count(evaluations) time
[before any optimization, prehistoric] 104,496,193 86.62s
8858537 pep8 [before this PR] 19,939,632 25.74s
a66524a rules: match: better debug paranoid matching 1,157,514 8.10s

TODO:

  • add some tests for the feature indexer, if only to show a human how it works
  • namespace matching
  • prove that it matches exactly the same as before, just faster
  • xfail the tests and document the unsupported constructs
  • inline documentation explaining the algorithm better
  • wall clock performance numbers

@williballenthin williballenthin added the enhancement New feature or request label May 14, 2024
github-actions[bot]

This comment was marked as resolved.

Comment on lines +1389 to +1415
@property
def file_rules(self):
return self.rules_by_scope[Scope.FILE]

@property
def process_rules(self):
return self.rules_by_scope[Scope.PROCESS]

@property
def thread_rules(self):
return self.rules_by_scope[Scope.THREAD]

@property
def call_rules(self):
return self.rules_by_scope[Scope.CALL]

@property
def function_rules(self):
return self.rules_by_scope[Scope.FUNCTION]

@property
def basic_block_rules(self):
return self.rules_by_scope[Scope.BASIC_BLOCK]

@property
def instruction_rules(self):
return self.rules_by_scope[Scope.INSTRUCTION]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for backwards compatibility. during a major version, we can probably remove these with preference to rules_by_scope.

@williballenthin

This comment was marked as outdated.

Copy link
Collaborator

@mr-tz mr-tz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work, we should do extensive tests comparing the results before and after to ensure everything works as expected. the speedup looks promising!

capa/rules/__init__.py Outdated Show resolved Hide resolved
capa/rules/__init__.py Outdated Show resolved Hide resolved
@williballenthin
Copy link
Collaborator Author

williballenthin commented May 16, 2024

we should do extensive tests comparing the results before and after to ensure everything works as expected.

I plan to run this implementation side by side with the ceng.match implementation and assert the results are precisely the same across a wide range of samples. There should be no leaks of abstraction or details in the new one, it should just be faster.

@github-actions github-actions bot dismissed their stale review May 22, 2024 13:23

CHANGELOG updated or no update needed, thanks! 😄

capa/rules/__init__.py Outdated Show resolved Hide resolved
capa/rules/__init__.py Outdated Show resolved Hide resolved
capa/rules/__init__.py Outdated Show resolved Hide resolved
Comment on lines +1702 to +1714
string_features = [
feature
for feature in features
if isinstance(feature, (capa.features.common.Substring, capa.features.common.Regex))
]
bytes_features = [feature for feature in features if isinstance(feature, capa.features.common.Bytes)]
hashable_features = [
feature
for feature in features
if not isinstance(
feature, (capa.features.common.Substring, capa.features.common.Regex, capa.features.common.Bytes)
)
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be optimized? We're looping and calling isinstance on every feature three times.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me try and then run some benchmarks. I agree it looks wasteful, but I'm not sure if it has a real world effect.

capa/rules/__init__.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@mike-hunhoff mike-hunhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @williballenthin - I'm pumped about the improved efficiency. The logic and code that you've implemented here appears sound. Let's get this merged pending successful paranoid invocation across a wide range of samples

@mike-hunhoff
Copy link
Collaborator

Should we rebase this on top of master so that it doesn't depend on BinExport2?

I'm inclined to say "yes" although we lose the intermediate history. This would allow us to do a minor release and get the optimizations out there.

Yes let's rebase on master so we can get this to our users ASAP

Copy link
Collaborator

@mr-tz mr-tz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing work! noted a few minor things I've noticed and the paranoid run will provide a lot of value

capa/rules/__init__.py Show resolved Hide resolved
capa/rules/__init__.py Outdated Show resolved Hide resolved
capa/rules/__init__.py Outdated Show resolved Hide resolved
capa/rules/__init__.py Outdated Show resolved Hide resolved
Comment on lines +1874 to +1881
# We may want to try to pre-evaluate these strings, based on their presence in the file,
# to reduce the number of evaluations we do here.
# See: https://github.com/mandiant/capa/issues/2063#issuecomment-2095639672
#
# We may also want to specialize case-insensitive strings, which would enable them to
# be indexed, and therefore skip the scanning here, improving performance.
# This strategy is described here:
# https://github.com/mandiant/capa/issues/2063#issuecomment-2107083068
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add TODOs for these notes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, and i'll spin off the original issue comments into dedicated issues we can use to track the idea.

capa/rules/__init__.py Outdated Show resolved Hide resolved
williballenthin and others added 2 commits June 3, 2024 21:29
Copy link
Collaborator

@fariss fariss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good improvmenets, I just have question below.


Unrelated to this PR, I think we can replace

b = codecs.decode(s.replace(" ", "").encode("ascii"), "hex")

with:

b = bytes.fromhex(s)

https://docs.python.org/3/library/stdtypes.html#bytes.fromhex

capa/rules/__init__.py Show resolved Hide resolved
capa/rules/__init__.py Outdated Show resolved Hide resolved
@williballenthin
Copy link
Collaborator Author

williballenthin commented Jun 4, 2024

paranoid linting succeeded!

❯ time python scripts/lint.py rules/ --thorough
INFO:lint:collecting potentially referenced samples
                                                                                                                                                                                                     
encrypt data using RC4 via SystemFunction033                                                                                                                                                         
FAIL: referenced example doesn't exist: Add the referenced example to samples directory ($capa-root/tests/data or supplied via --samples)                                                         
                                                                                                                                                                                                        
(nursery)  linked against hp-socket                                                                                                                                                                   
WARN: referenced example doesn't exist: Add the referenced example to samples directory ($capa-root/tests/data or supplied via --samples)                                                                                                                                                                                                                                                         rules with WARN:                                                                                                                                                                                      - linked against hp-socket

rules with FAIL:
  - encrypt data using RC4 via SystemFunction033

________________________________________________________
Executed in  125.20 mins    fish           external
   usr time  124.04 mins   66.00 micros  124.04 mins
   sys time    0.98 mins  898.00 micros    0.98 mins
time
paranoid 125 minutes
master 62 minutes
this PR 44 minutes

So, this improves the performance of capa (with the vivisect backend) by about 30%. When using the BinExport2 backend, I think the performance improvement will be closer to 2-3x, since less time is spent doing analysis.

@mr-tz
Copy link
Collaborator

mr-tz commented Jun 4, 2024

awesome, big performance improvement!

@williballenthin
Copy link
Collaborator Author

new PR that's rebased against master: #2125

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants