Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Extensions for Mask R-CNN for improved performance #379

Closed

Conversation

seryilmaz
Copy link

This PR adds C++/CUDA extensions for some functions used for Mask R-CNN (and for some other models as well). Extensions are for generating mask targets for loss computation in mask head, box iou computation, box encode function, and proposal matcher function.

@facebook-github-bot
Copy link

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Copy link
Contributor

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I have an initial question: does this PR work with fp16 on the GPU? I have seen that the kernels use float, but there are no checks for fp32 when dispatching to the kernels (only is_cuda).

@seryilmaz
Copy link
Author

We keep coordinate and IoU values in fp32 even in mixed precision training; so they work in our mixed precision training. If you cast coordinates (either bounding box coordinates or mask polygon coordinates as fetched from dataset), or IoU matrix (which is an input to proposal matcher function) to fp16, then some of these kernels won't work.
Do you want us to extend these functions to fp16 inputs?

@fmassa
Copy link
Contributor

fmassa commented Jan 25, 2019

@seryilmaz thanks for the explanation!

I think it's ok to leave these kernels to work on fp32-only for now, but I'd say that it can be a bit confusing if the users move the IoU values to be in fp16.

I'll do a pass on the implementation later today and I'll let you know

@fmassa
Copy link
Contributor

fmassa commented Jan 31, 2019

I was thinking about a way of being less intrusive to the python implementation, what about the following:

1 - all the parts of the python implementation that you have currently wrapped inside a conditional would become functions.
2 - each function body contains only the python implementation for it.
3 - each function will have a decorator, which will point to the optimized C implementation and will dispatch to it if it satisfies the criteria (input types are good, global flag is activated).

Here is an example implementation:

def criterion(b1, b2):
    global enable_optim
    if not enable_optim:
        return False
    if b1.bbox.is_cuda and b2.bbox.is_cuda:
        return True
    return False

@optimized_codepath(
    _C.box_iou,
    lambda boxlist1, boxlist2: boxlist1.bbox, boxlist2.bbox,
    criterion)
def boxlist_iou(boxlist1, boxlist2):
    # all the python stays here
    return iou

The basic idea is that, if criterion is satisfied for the inputs (and a global flag that will enable/disable the optimizations), then the optimized implementation will be used.
Else, the pure python is used.

This has the benefit to reduce the amount of places where the user need to change something if they want to disable the optimizations, and also has a clean separation between both.

Thoughts?

@slayton58
Copy link
Contributor

That could work, I'd be a little concerned about the verbose-ness of it as the number of arguments increases though.

@botcs
Copy link
Contributor

botcs commented Feb 25, 2019

Hi @slayton58,

This PR seems really valuable, since the CPU cropping / resizing is causing some serious overhead.
Are you planning to implement @fmassa's review?

Thank you :)

@slayton58
Copy link
Contributor

@botcs Well, this is @seryilmaz's PR, so...

FWIW I feel that a concrete implementation framework to handle this kind of optimisation would be a big help

@botcs
Copy link
Contributor

botcs commented Feb 25, 2019

@slayton58 oh sorry, I forgot to mention @seryilmaz as well.

BTW, what do you mean by:

a concrete implementation framework

@slayton58
Copy link
Contributor

We have a concept from Francisco above for a framework to handle these optimizations but no actual implementation to work from.

@zimenglan-sysu-512
Copy link
Contributor

hi @seryilmaz
curious, what's the current state of this PR?

@botcs botcs closed this Sep 17, 2019
@gottbrath
Copy link

@fmassa -- I got a ping on this the other day from NVIDIA. Can you look at this again? It looks like @slayton58 didn't think he had enough detail from your comments above to carry this forward.

@fmassa
Copy link
Contributor

fmassa commented Oct 21, 2019

@gottbrath @slayton58 maskrcnn-benchmark is now deprecated in favor of detectron2. I'd recommend re-opening an issue to discuss about those optimization in detectron2

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants