Add the ability to select which functions or processes you which to extract capabilities from #2156

yelhamer · 2024-06-19T01:33:36Z

This PR adds the ability to select which function/process capa should extract capabilities from. The proposed syntax is as follows:

$ capa malware.exe --functions 0x645fa0,0x543dd0,0x630ac0 # static analysis
$ capa malware.log --processes 3288,4321,3234 # dynamic analysis

I haven't added a testcase for dynamic analysis. I am planning to do so once the Drakvuf feature extractor gets merged (#2143) since that's a big motive for this PR.

Nuance: I couldn't find the right names for some of the internal variables, so please feel free to set them as you wish.

Thanks :)

Checklist

No CHANGELOG update needed

No new tests needed

No documentation update needed

…lysis flavor

williballenthin · 2024-06-19T09:40:13Z

First, I think this is a very reasonable feature to add, especially with the Drakvuf sandbox support! I'm happy that we should be able to remove similar logic in show-features (here) with these changes. It seems like there are multiple places that this would fit in.

There are two things to discuss:

the command line argument syntax, and
the design of the filtering.

For (1), I don't imagine any major problems, though we may want to consider if these arguments are commonly used enough to be shown all the time, or only in an "expert" mode, or suggested when the Drakvuf sandbox is encountered, etc. That being said, let's be careful not to get derailed by tiny details and hold up the merge. If there's debate, the arguments could always be considered "experimental" for a bit until we stabilize them.

For (2), the question is: how do we filter the functions/scopes/features, particularly in a way that enables further extension, if necessary? That is, say we want to also filter on threads or basic blocks, could we build this easily?

The primary place to consider is the function signature to find_capabilities (and related routines). I am concerned about adding too many optional arguments that become difficult to reason about. I'd prefer the signature of this core routine to be as succinct as possible.

An alternative design is to introduce wrappers around the feature extractors that can filter the scopes/features. They would act just like the underlying feature extractor, but would yield only the entries that are requested. Then, the wrapped feature extractor can be passed around as-is today. Notably, we can trivially create wrappers for different scopes/features without threading optional arguments around. Also, they could potentially be combined (though I don't think we're likely to really need this functionality).

For example:

class StaticFeatureExtractorFilter:
    def __init__(self, inner: StaticFeatureExtractor):
        self.inner = inner

    def __getattr__(self, name):
        if hasattr(self, name):
            # if the filter has an override, use that
            return getattr(self, name)
        # otherwise use the inner feature extractor
        return getattr(self.inner, name)

class FunctionFilter(StaticFeatureExtractorFilter):
    def __init__(self, inner: StaticFeatureExtractor, functions: Set[Address]):
        super().__init__(self, inner)
        self.functions = functions

    def get_functions(self):
        yield from (f for f in self.inner.get_functions() if f.address in self.functions)

Then we can use the filter like so:

wanted_functions: Set[Address] = get_wanted_functions_from_cli(args)
if wanted_functions:
    # if the user wants to only show specific functions, we filter them down
    feature_extractor = FunctionFilter(feature_extractor, wanted_functions)
else:
    # otherwise, we use the full feature extractor
    pass

...

# use either the filtered extractor or the full extractor interchangably
find_capabilities(feature_extractor, ...)

Of course, we can filter processes/threads/basic blocks/etc. in the same way.

Would you be open to discuss alternative designs like this @yelhamer? Or any thoughts @mr-tz @mike-hunhoff

yelhamer · 2024-06-19T14:22:40Z

@williballenthin I really like the design you proposed and I'm willing to implement it. Would the filter classes reside in the base_extractor.py file?

Also, should we make the base "StaticFeatureExtractorFilter" a child of the "StaticFeatureExtractor" class? since I think there are some cases (if I am not mistaken) where we use the instance of the extractor we're passing to determine whether the analysis is static or dynamic.

williballenthin · 2024-06-19T18:28:00Z

I think there are some cases (if I am not mistaken) where we use the instance of the extractor we're passing to determine whether the analysis is static or dynamic.

Good point. Also I think this will better satisfy mypy.

OTOH, I'm not sure that the hasattr checks will work as written (since the base class has empty implementations of these) so that will take some tweaking. Should still be possible.

We should also pass along the inner name appropriately, since I think the metadata structure includes the feature extractor name.

yelhamer · 2024-06-20T03:19:04Z

@williballenthin I have also thought of the following possible implementation using a function factory:

def FunctionFilter(extractor: StaticFeatureExtractor, functions: Set) -> StaticFeatureExtractor:
    from types import MethodType

    get_functions = extractor.get_functions  # fetch original get_functions()

    def filtered_get_functions(self):
        yield from (f for f in get_functions() if f.address in functions)

    extractor.get_functions = MethodType(filtered_get_functions, extractor)
    return extractor

Then we can do as you suggested:

wanted_functions: Set[Address] = get_wanted_functions_from_cli(args)
if wanted_functions:
    # if the user wants to only show specific functions, we filter them down
    feature_extractor = FunctionFilter(feature_extractor, wanted_functions)
else:
    # otherwise, we use the full feature extractor
    pass

...

# use either the filtered extractor or the full extractor interchangably
find_capabilities(feature_extractor, ...)

Another question remains which is whether we want to register which filters an extractor has installed in it. If we want to do so then we might just consider storing the set of desired functions as an attribute in the extractor, then reference it internally in the get_functions() routine (without needing any wrapping or so).

williballenthin · 2024-06-20T07:12:05Z

I have also thought of the following possible implementation using a function factory

This looks like it would also work. I'm not sure of the pros/cons right now, so perhaps try one of the new implementations and let's see how it feels?

…o not interfer with following tests

tests/test_capabilities.py

capa/main.py

…ing function factories

…filters

williballenthin

This implementation looks pretty good! I've left comments inline. Please address the comments tagged 👀 and optionally the remaining ones. We can merge this as soon as tomorrow assuming the points are addressed :-)

CHANGELOG.md

capa/features/extractors/base_extractor.py

capa/main.py

doc/usage.md

capa/features/extractors/base_extractor.py

capa/main.py

Co-authored-by: Willi Ballenthin <[email protected]>

yelhamer · 2024-08-20T05:55:24Z

@williballenthin I think I have addressed all of the the comments. Let me know if there's anything else I need to do :)

capa/main.py

capa/features/extractors/base_extractor.py

Co-authored-by: Willi Ballenthin <[email protected]>

capa/main.py

Co-authored-by: Willi Ballenthin <[email protected]>

doc/usage.md

Co-authored-by: Willi Ballenthin <[email protected]>

williballenthin

🥳

williballenthin · 2024-08-20T07:25:16Z

ah, CLA failure is cause GH associated the wrong email with my suggestions. well, its me, and i'm covered under the CLA.

williballenthin · 2024-08-20T12:09:36Z

let's go!

yelhamer added 4 commits June 19, 2024 01:47

initial commit

38c6623

test_capabilities.py: add tests

154afe1

CHANGELOG.md: update changelog

1ae174b

usage.md: updated documentation

acd69a3

yelhamer mentioned this pull request Jun 19, 2024

Add a Feature Extractor for the Drakvuf Sandbox #2143

Merged

3 tasks

yelhamer added 6 commits June 19, 2024 02:55

main.py: use input_format instead of file_extractors to determine ana…

3aaae2e

…lysis flavor

fix linting

f7c43e9

apply flake8 suggestions

b7e345d

main.py: Use Optional typehint

8c8321b

main.py: bugfix for return instead of raise

1642e7e

main.py: add errorcode for invalid input format

090ade5

yelhamer added 3 commits June 20, 2024 08:40

Function/Process filtering: use a function to filter

8e8e0ec

Function/Process filtering: ignore mypy errors for method reassignment

1d52600

function/proc filtering tests: use a copy of the extractor in order t…

d78272f

…o not interfer with following tests

yelhamer requested a review from williballenthin June 20, 2024 08:45

williballenthin reviewed Jun 20, 2024

View reviewed changes

tests/test_capabilities.py Outdated Show resolved Hide resolved

mr-tz reviewed Jun 20, 2024

View reviewed changes

capa/main.py Outdated Show resolved Hide resolved

yelhamer added 6 commits June 21, 2024 06:47

Extractor Filters: wrap classes and overwrite __class__ instead of us…

e3071f8

…ing function factories

Extractor Filters: fix mypy errors

c2058bf

function/proc filtering: overwrite __instancecheck__() for extractor …

c54bafc

…filters

base_extractor: update FeatureExtractor type to include filters

fe9f332

capa/loader.py: update assert_never() for mypy

b329f3f

capa/loader.py: use tuple in isinstance() for flake8

1a79591

yelhamer requested review from williballenthin and mr-tz June 21, 2024 07:32

williballenthin requested changes Aug 19, 2024

View reviewed changes

yelhamer and others added 14 commits August 19, 2024 19:17

Update CHANGELOG.md: typo

1168996

Co-authored-by: Willi Ballenthin <[email protected]>

Update capa/main.py

2f00b7f

Co-authored-by: Willi Ballenthin <[email protected]>

Update capa/main.py

79f3097

Co-authored-by: Willi Ballenthin <[email protected]>

Update doc/usage.md

9ce2a3c

Co-authored-by: Willi Ballenthin <[email protected]>

Update doc/usage.md

28e274f

Co-authored-by: Willi Ballenthin <[email protected]>

update changelog

fa61273

Update capa/features/extractors/base_extractor.py

0640ba9

Co-authored-by: Willi Ballenthin <[email protected]>

base_extractor.py: rename variable

b693aa0

base_extractor.py: update comments

10a26a8

main.py: add FilterConfig type

b0d8071

main.py: add asserts for checking filters are not empty

ac50103

main.py: remove unused Set import

a194a13

main.py: move filters extractor into get_extractor_from_cli() routine

e80f474

doc/usage.md: update usage according to reviews

88d9d67

yelhamer requested a review from williballenthin August 20, 2024 05:27