Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to select which functions or processes you which to extract capabilities from #2156

Merged
merged 46 commits into from
Aug 20, 2024

Conversation

yelhamer
Copy link
Collaborator

This PR adds the ability to select which function/process capa should extract capabilities from. The proposed syntax is as follows:

$ capa malware.exe --functions 0x645fa0,0x543dd0,0x630ac0 # static analysis
$ capa malware.log --processes 3288,4321,3234 # dynamic analysis

I haven't added a testcase for dynamic analysis. I am planning to do so once the Drakvuf feature extractor gets merged (#2143) since that's a big motive for this PR.

Nuance: I couldn't find the right names for some of the internal variables, so please feel free to set them as you wish.

Thanks :)

Checklist

  • No CHANGELOG update needed
  • No new tests needed
  • No documentation update needed

@williballenthin
Copy link
Collaborator

First, I think this is a very reasonable feature to add, especially with the Drakvuf sandbox support! I'm happy that we should be able to remove similar logic in show-features (here) with these changes. It seems like there are multiple places that this would fit in.

There are two things to discuss:

  1. the command line argument syntax, and
  2. the design of the filtering.

For (1), I don't imagine any major problems, though we may want to consider if these arguments are commonly used enough to be shown all the time, or only in an "expert" mode, or suggested when the Drakvuf sandbox is encountered, etc. That being said, let's be careful not to get derailed by tiny details and hold up the merge. If there's debate, the arguments could always be considered "experimental" for a bit until we stabilize them.

For (2), the question is: how do we filter the functions/scopes/features, particularly in a way that enables further extension, if necessary? That is, say we want to also filter on threads or basic blocks, could we build this easily?

The primary place to consider is the function signature to find_capabilities (and related routines). I am concerned about adding too many optional arguments that become difficult to reason about. I'd prefer the signature of this core routine to be as succinct as possible.

An alternative design is to introduce wrappers around the feature extractors that can filter the scopes/features. They would act just like the underlying feature extractor, but would yield only the entries that are requested. Then, the wrapped feature extractor can be passed around as-is today. Notably, we can trivially create wrappers for different scopes/features without threading optional arguments around. Also, they could potentially be combined (though I don't think we're likely to really need this functionality).

For example:

class StaticFeatureExtractorFilter:
    def __init__(self, inner: StaticFeatureExtractor):
        self.inner = inner

    def __getattr__(self, name):
        if hasattr(self, name):
            # if the filter has an override, use that
            return getattr(self, name)
        # otherwise use the inner feature extractor
        return getattr(self.inner, name)

class FunctionFilter(StaticFeatureExtractorFilter):
    def __init__(self, inner: StaticFeatureExtractor, functions: Set[Address]):
        super().__init__(self, inner)
        self.functions = functions

    def get_functions(self):
        yield from (f for f in self.inner.get_functions() if f.address in self.functions)

Then we can use the filter like so:

wanted_functions: Set[Address] = get_wanted_functions_from_cli(args)
if wanted_functions:
    # if the user wants to only show specific functions, we filter them down
    feature_extractor = FunctionFilter(feature_extractor, wanted_functions)
else:
    # otherwise, we use the full feature extractor
    pass

...

# use either the filtered extractor or the full extractor interchangably
find_capabilities(feature_extractor, ...)

Of course, we can filter processes/threads/basic blocks/etc. in the same way.

Would you be open to discuss alternative designs like this @yelhamer? Or any thoughts @mr-tz @mike-hunhoff

@yelhamer
Copy link
Collaborator Author

@williballenthin I really like the design you proposed and I'm willing to implement it. Would the filter classes reside in the base_extractor.py file?

Also, should we make the base "StaticFeatureExtractorFilter" a child of the "StaticFeatureExtractor" class? since I think there are some cases (if I am not mistaken) where we use the instance of the extractor we're passing to determine whether the analysis is static or dynamic.

@williballenthin
Copy link
Collaborator

I think there are some cases (if I am not mistaken) where we use the instance of the extractor we're passing to determine whether the analysis is static or dynamic.

Good point. Also I think this will better satisfy mypy.

OTOH, I'm not sure that the hasattr checks will work as written (since the base class has empty implementations of these) so that will take some tweaking. Should still be possible.

We should also pass along the inner name appropriately, since I think the metadata structure includes the feature extractor name.

@yelhamer
Copy link
Collaborator Author

@williballenthin I have also thought of the following possible implementation using a function factory:

def FunctionFilter(extractor: StaticFeatureExtractor, functions: Set) -> StaticFeatureExtractor:
    from types import MethodType

    get_functions = extractor.get_functions  # fetch original get_functions()

    def filtered_get_functions(self):
        yield from (f for f in get_functions() if f.address in functions)

    extractor.get_functions = MethodType(filtered_get_functions, extractor)
    return extractor

Then we can do as you suggested:

wanted_functions: Set[Address] = get_wanted_functions_from_cli(args)
if wanted_functions:
    # if the user wants to only show specific functions, we filter them down
    feature_extractor = FunctionFilter(feature_extractor, wanted_functions)
else:
    # otherwise, we use the full feature extractor
    pass

...

# use either the filtered extractor or the full extractor interchangably
find_capabilities(feature_extractor, ...)

Another question remains which is whether we want to register which filters an extractor has installed in it. If we want to do so then we might just consider storing the set of desired functions as an attribute in the extractor, then reference it internally in the get_functions() routine (without needing any wrapping or so).

@williballenthin
Copy link
Collaborator

I have also thought of the following possible implementation using a function factory

This looks like it would also work. I'm not sure of the pros/cons right now, so perhaps try one of the new implementations and let's see how it feels?

capa/main.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@williballenthin williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation looks pretty good! I've left comments inline. Please address the comments tagged 👀 and optionally the remaining ones. We can merge this as soon as tomorrow assuming the points are addressed :-)

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
capa/features/extractors/base_extractor.py Outdated Show resolved Hide resolved
capa/features/extractors/base_extractor.py Outdated Show resolved Hide resolved
capa/main.py Outdated Show resolved Hide resolved
doc/usage.md Outdated Show resolved Hide resolved
doc/usage.md Outdated Show resolved Hide resolved
doc/usage.md Outdated Show resolved Hide resolved
capa/features/extractors/base_extractor.py Outdated Show resolved Hide resolved
capa/main.py Show resolved Hide resolved
@yelhamer
Copy link
Collaborator Author

@williballenthin I think I have addressed all of the the comments. Let me know if there's anything else I need to do :)

capa/main.py Outdated Show resolved Hide resolved
doc/usage.md Outdated Show resolved Hide resolved
Co-authored-by: Willi Ballenthin <[email protected]>
Copy link
Collaborator

@williballenthin williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

@williballenthin
Copy link
Collaborator

ah, CLA failure is cause GH associated the wrong email with my suggestions. well, its me, and i'm covered under the CLA.

@williballenthin
Copy link
Collaborator

let's go!

@williballenthin williballenthin merged commit 791f5e2 into mandiant:master Aug 20, 2024
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants