Non-Maximum Supression on the GPU #392

ghost · 2018-01-17T22:32:37Z

Is there any interest in an NMS layer that runs on the GPU for torchvision? I have one implemented; it gives a 1-2 order of magnitude speedup over a naive version composed from pytorch ops. Would be happy to contribute it if anyone's interested.

soumith · 2018-01-17T22:38:30Z

that's definitely interesting. We should add it into pytorch, until we figure out ATen extensions.
For example like how we added ROIPooling: pytorch/pytorch#3672

ghost · 2018-01-17T22:46:21Z

Great, I'll read over that PR for starters.

ghost · 2018-02-20T22:31:26Z

This is all ready now. Where in the Python API would you like it go? I actually have two implementations ready; one is quicker in principle but it requires allocating global memory to store intermediate results, and freeing the memory at the end of each call completely kills performance. Is there anything one can do about this? If not, the other implementation (which doesn't allocate any intermediate global memory) is still fine and gives the kind of speedup I mentioned anyway, so it's no big deal.

aosokin · 2018-03-13T09:01:43Z

Hi, is this going to be added soon? (I'm asking since pytorch/pytorch/#5404 is closed now)

fmassa · 2018-03-14T14:12:09Z

Yes, I think we should add efficient NMS in here, following the Cpp extensions that were recently added.
Because Cpp extensions are only present in master, I'd wait until 0.4.0 is released before merging any C++ code into torchvision.
How does that sound?

aosokin · 2018-03-14T14:14:06Z

@fmassa That's fine anyway.
Btw, is 0.4 coming soon? :-)

fmassa · 2018-03-14T14:14:57Z

Yes, once pytorch/pytorch#5673 is addressed

aosokin · 2018-03-14T14:16:04Z

Thanks!

ruotianluo · 2018-04-25T23:06:19Z

Whats the current status of this issue?

fmassa · 2018-04-26T09:01:29Z

@ruotianluo I'm drafting the layers part of torchvision in https://github.com/pytorch/vision/tree/layers
I've added a CPU version of NMS, and I'll look into porting one of the following versions: from py-faster-rcnn, from this PR and a few others.
Have you tried any of those?

ruotianluo · 2018-04-26T22:08:49Z

When I build my faster-rcnn project, I ported the nms from py-faster-rcnn; the main reason is I'm not familiar with cuda code and it's simpler to just copy.
With regard to https://github.com/pytorch/pytorch/pull/5404/files, it's using a full byte to save a boolean value(mask), while py-faster-rcnn uses each bit. (I can't comment on anything else, 'cause as I said I'm not familiar.)

fmassa · 2018-04-27T09:43:38Z

Thanks, I'll take that into account when porting NMS to GPU.

ghost · 2018-04-27T10:34:54Z

On the other hand, the py-faster-rcnn version throws data back and fourth between GPU and CPU, which may not be ideal. Also it requires a workspace of size O(n_boxes^2), and dynamically allocating/freeing this messes with performance.

fmassa · 2018-04-27T12:38:50Z

We can use the PyTorch caching allocator to handle memory allocation (which will save us from the sync points from freeing the memory).
I'll have a closer look at both implementations this weekend, but thanks for the feedbacks!

ghost · 2018-04-27T13:08:14Z

Cool, in that case you might also be interested in the implementation here, https://github.com/dssa56/projects/tree/master/nms under cuda_nms_workspace.cu. That uses the same technique as py-faster-rcnn, but keeps all data on the GPU. I didn't use it in the PR because of the workspace, but if that's a non-issue then it's potentially worth a look.

ahkarami · 2018-07-01T12:14:34Z

Has the current version of Torchvision the GPU-based NMS?

fmassa · 2018-07-03T09:08:59Z

No, it doesn't.
I have it implemented locally (based on the implementations that were shared previously), but I haven't pushed it yet.

ahkarami · 2018-07-03T17:05:30Z

@fmassa,
Ok. Thank you for your reply. I hope the PyTorch team push it as soon as possible.

Sucran · 2018-07-27T02:27:30Z

@fmassa Can I do you a favor if you need for implementing the GPU-based NMS ? I'm also working on it. I'm also using the GPU-based RoI align and RoI pool from your torchvision branch.

fmassa · 2018-07-27T04:30:36Z

I have a version of it that I adapted from https://github.com/rbgirshick/py-faster-rcnn/blob/master/lib/nms/nms_kernel.cu
But because pytorch is so rapidly evolving, it has already been broken by pytorch refactoring at least 3 times.
I'll hold on before pushing it to torchvision because I expect those breakage to still happen.
I can push the current version I have in a gist though, if you want to try it out

Sucran · 2018-07-27T07:06:04Z

@fmassa Thank you. Actually, I prefer to try it out, then consider how to fix the breakage in my production env. Looking forward to the gist version.

fmassa · 2018-07-27T18:56:56Z

@Sucran here is the implementation https://gist.github.com/fmassa/cf9ab87e4bd71e849655d5abb8cfd6c6
It was working, but I'm not sure anymore.
Also, it required PyTorch compiled from source

fmassa · 2018-10-25T12:36:15Z

FYI, we have released our implementation of {Faster, Mask} R-CNN in https://github.com/facebookresearch/maskrcnn-benchmark , which contains a CPU and CUDA implementation for Non Maximum suppression.

I suggest we move this discussion there for now.

0phoff · 2018-10-25T20:16:12Z

@fmassa, Why close this issue though?
Many people use torchvision, but do not care about or need the maskrcnn repository.
I still think having a pure cuda implementation of NMS in torch/torchvision would be highly appreciated by lots of people!

fmassa · 2018-10-25T20:43:39Z

I'm curious in which cases do people use NMS outside of object detection?
But at some point we might indeed add layers to torchvision, it's just that packaging torchvision with those is a bit more involved so I'll let it for when the Cpp extensions API is more stabilized.

But I'm opening it anyway as a reminder

fmassa · 2019-05-07T18:14:36Z

#826 has been merged and adds support for NMS on both CPU and GPU

neighthan · 2019-07-11T18:54:22Z

FYI, this is located at torchvision.ops.nms.

neighthan · 2019-07-11T19:14:42Z

It would be nice to mention the format for the boxes in the docstring, though. Is it [y1, x1, y2, x2] for some diagonal corners, as in TensorFlow? Or perhaps just top-left and bottom-right corners?

varunagrawal · 2019-07-11T20:16:17Z

It should be [x1, y1, x2, y2] but adding this to the docstring makes a lot of sense.

varunagrawal · 2019-07-11T20:40:52Z

@fmassa @neighthan I've made the changes in #1110. It should be quick to review since it is mostly doc changes.

dashesy · 2019-08-01T21:34:46Z

@neighthan torchvision.ops.nms seems to also support CUDA? the only issue is that it does not support batch numbers, that would be another dimension to benefit from CUDA parallelisation.
I have this nms implementation that does and I use it to apply NMS on multiple images at once, each independently.

dishank-b · 2019-10-23T00:49:26Z

@varunagrawal @fmassa Can anyone help me what should be used now to run NMS (cuda) in batch mode while training? The NMS implementation provided by torchvision.ops is not in Batch mode and will be slow to use at training time.

fmassa · 2019-10-23T14:39:39Z

@dishank-b what do you mean by batch mode?

Also note that we have a batched_nms implementation in torch.ops.batched_nms.

dishank-b · 2019-10-23T22:02:36Z

@fmassa By batch mode, I mean taking input in the form of [batch_size, N, 4]. The current torchvision.ops.nms takes the boxes in form [N,4] where N is the number of boxes in single Image. Whereas let's say during training Faster RCNN if your batch_size is 16, then we would like the NMS to run for those 16 images in a batch. Although there is a problem that N may not be the same for all 16 images.

Also, the torch.ops.batched_nms is different, it for the case when there is more than 1 classes to in the image.

dashesy · 2019-10-23T22:10:04Z

ONNX implementation of NMS is generic enough, I wish torchvision had something similar.
There is the batched-nms as described by @dishank-b and there is per-class NMS that is what torch.ops.batched_nms seems to do. ONNX implementation supports both cases using one API.

fmassa · 2019-10-24T08:35:39Z

@dishank-b @dashesy torchvision.ops.batched_nms can be used to perform NMS in a batch, but it doesn't matter if it is per class or per image actually. It just performs NMS independently "per category", which can mean image, class, etc.

The limitation with having NMS taking [batch_size, N, 4] inputs is that you need to set a max number of boxes, and pad the images that do not have enough boxes with something (like zeros). This is actually a subset of what torchvision.ops.batched_nms can do. Let me show you an example of how to implement this NMS with torchvision.ops.batched_nms:

def batched_nms(boxes, scores, iou_threshold):
    # boxes is a [batch_size, N, 4] tensor, and scores a
    # [batch_size, N] tensor.
    batch_size, N, _ = boxes.shape
    indices = torch.arange(batch_size, device=boxes.device)
    indices = indices[:, None].expand(batch_size, N).flatten()
    boxes_flat = boxes.flatten(0, 1)
    scores_flat = scores.flatten()
    indices_flat = torchvision.ops.boxes.batched_nms(
        boxes_flat, scores_flat, indices, iou_threshold)
    # now reshape the indices as you want, maybe
    # projecting back to the [batch_size, N] space
    # I'm omitting this here
    indices = indices_flat
    return indices

Let me know if you have questions

dishank-b · 2019-10-24T22:04:28Z

@fmassa Thanks, this seems to work with batch either in images or class. However what if we have both the cases i.e do the image batching as well as classes?
I can think of a possible solution that we create indices = torch.arange(batch_size*classes) where each element of the indices will represent each class in each image of the batch.
But this seems quite naive to me, do you have any other way of doing the same?

dashesy · 2019-10-24T22:09:58Z

@dishank-b in that case you can squash the 2 dimensions together with view

boxes = boxes.view(-1, N, 4)

Since they are independent, it would not matter if they are from different batch, or different class then.

Still would be nice if nms had a max_boxes_per_batch parameter to avoid the zero padding.

dishank-b · 2019-10-24T22:14:55Z

@dashesy I am not sure that would work because let's say there are two boxes in two different images but of the same category, now if they are overlapping then they would be merged using above method, which should not be the case as they are from different images.

Let me know if I am wrong somewhere.

dashesy · 2019-10-24T22:23:43Z

I think they will not be merged.
N is the maximum number of boxes per-image-per-class. Now lets say you have 2 classes and 3 images then you have boxes with shape 3x2xNx4 after doing the view you will have 6xNx4 and first dimension will be done independently.

dishank-b · 2019-10-25T04:36:46Z

@dashesy Yeah, you are right, I get it. I was kind of doing the same but using the indices. Thanks your's way is easier.

fmassa · 2019-10-25T09:55:07Z

@dashesy

Still would be nice if nms had a max_boxes_per_batch parameter to avoid the zero padding.

This makes a few assumptions though, which might not always be desired. We need to either remove some boxes via a criterion (low scores? small boxes? something else?) or pad with zeros to the maximum size, which is not very convenient.

I considered those cases when implementing batched_nms and Faster R-CNN, and I came up with what we currently have, which seems like a good compromise on flexibility and ease-of-use.

MarkusAmann · 2019-11-12T09:49:23Z

@fmassa Is or will there be any implementation of Soft-NMS in pytorch too?

fmassa · 2019-11-13T18:41:32Z

@fortunex3000

Is or will there be any implementation of Soft-NMS in pytorch too?

it's not in the plans for torchvision in the near future

varunagrawal · 2019-11-13T19:11:00Z

Soft-NMS is pretty straightforward. I might add it once academic winter break starts.

rmcavoy · 2019-11-24T19:44:32Z

Is the documentation/code for torchvision.ops.boxes.batched_nms correct? The documentation suggests that the nms iou threshold discards boxes with iou < iou threshold (i.e. it only discards boxes that don't overlap which would normally be considered unique). My own testing of the function showed that the function does give fewer boxes with a smaller threshold (0.0 gave 1 box per image) and gives all boxes back with a threshold of 1.0. Is there something I am not understanding about the implementation or the documentation? This seems to be a non-standard choice as you should want to keep boxes with a small iou (i.e. less overlap and more likely to be non unique) and not discard them. See the soft-nms paper Figure 2 comparing NMS and soft-NMS for how it seems to me like it should behave. http://www.cs.umd.edu/~bharat/snms.pdf

fmassa · 2019-11-26T10:55:42Z

@rmcavoy there is an error in the documentation.

vision/torchvision/ops/boxes.py

Line 27 in c05da2a

boxes with IoU < iou_threshold

should read

        boxes with IoU > iou_threshold

I've sent a PR fixing the documentation in #1614, thanks!

Edwardmark · 2020-01-07T02:24:39Z

@fmassa how can I flatten back the indices_flat to [batch_size, N] if each image has different number of valid boxes after nms.

fmassa · 2020-01-08T10:17:33Z

@Edwardmark can you open a new issue describing what you are trying to do?

dishank-b · 2020-01-08T20:57:54Z

@dishank-b @dashesy torchvision.ops.batched_nms can be used to perform NMS in a batch, but it doesn't matter if it is per class or per image actually. It just performs NMS independently "per category", which can mean image, class, etc.

The limitation with having NMS taking [batch_size, N, 4] inputs is that you need to set a max number of boxes, and pad the images that do not have enough boxes with something (like zeros). This is actually a subset of what torchvision.ops.batched_nms can do. Let me show you an example of how to implement this NMS with torchvision.ops.batched_nms:
def batched_nms(boxes, scores, iou_threshold):
    # boxes is a [batch_size, N, 4] tensor, and scores a
    # [batch_size, N] tensor.
    batch_size, N, _ = boxes.shape
    indices = torch.arange(batch_size, device=boxes.device)
    indices = indices[:, None].expand(batch_size, N).flatten()
    boxes_flat = boxes.flatten(0, 1)
    scores_flat = scores.flatten()
    indices_flat = torchvision.ops.boxes.batched_nms(
        boxes_flat, scores_flat, indices, iou_threshold)
    # now reshape the indices as you want, maybe
    # projecting back to the [batch_size, N] space
    # I'm omitting this here
    indices = indices_flat
    return indices
Let me know if you have questions

@fmassa, I think @Edwardmark wants to ask how to unflatten the indices in above reply, so that you get the detections for each image, given you don't know the number of final detections in each image.

Edwardmark · 2020-01-09T08:50:04Z

@fmassa @dishank-b Yes, and I figure it out. We can unflatten the indices based on the returned indices range. For example, indices within range(0, N) belongs to img1, and indices within range(N, 2N) belongs to img2. Thank you all very much.

lamhoangtung · 2021-02-28T03:43:57Z

@Edwardmark. How did you unflatten the indices efficiently, without looping over each returned indices value and check if in range ... ?

ghost mentioned this issue Feb 25, 2018

Non Maximum Suppression pytorch/pytorch#5404

Closed

fmassa added enhancement help wanted labels Apr 20, 2018

fmassa closed this as completed Oct 25, 2018

fmassa reopened this Oct 25, 2018

fmassa closed this as completed May 7, 2019

fmassa mentioned this issue Nov 26, 2019

Fix documentation for NMS #1614

Merged

NicolasHug mentioned this issue Jan 29, 2024

RuntimeError: "nms_kernel" not implemented for 'BFloat16' #8241

Closed

Non-Maximum Supression on the GPU #392

Non-Maximum Supression on the GPU #392

Comments

ghost commented Jan 17, 2018

soumith commented Jan 17, 2018

ghost commented Jan 17, 2018

ghost commented Feb 20, 2018

aosokin commented Mar 13, 2018 • edited Loading

fmassa commented Mar 14, 2018

aosokin commented Mar 14, 2018

fmassa commented Mar 14, 2018

aosokin commented Mar 14, 2018

ruotianluo commented Apr 25, 2018

fmassa commented Apr 26, 2018

ruotianluo commented Apr 26, 2018 • edited Loading

fmassa commented Apr 27, 2018

ghost commented Apr 27, 2018 via email • edited by ghost Loading

fmassa commented Apr 27, 2018

ghost commented Apr 27, 2018

ahkarami commented Jul 1, 2018

fmassa commented Jul 3, 2018

ahkarami commented Jul 3, 2018

Sucran commented Jul 27, 2018

fmassa commented Jul 27, 2018

Sucran commented Jul 27, 2018 • edited Loading

fmassa commented Jul 27, 2018 • edited Loading

fmassa commented Oct 25, 2018

0phoff commented Oct 25, 2018 • edited Loading

fmassa commented Oct 25, 2018

fmassa commented May 7, 2019

neighthan commented Jul 11, 2019

neighthan commented Jul 11, 2019

varunagrawal commented Jul 11, 2019

varunagrawal commented Jul 11, 2019

dashesy commented Aug 1, 2019 • edited Loading

dishank-b commented Oct 23, 2019

fmassa commented Oct 23, 2019

dishank-b commented Oct 23, 2019

dashesy commented Oct 23, 2019

fmassa commented Oct 24, 2019

dishank-b commented Oct 24, 2019

dashesy commented Oct 24, 2019

dishank-b commented Oct 24, 2019

dashesy commented Oct 24, 2019 • edited Loading

dishank-b commented Oct 25, 2019

fmassa commented Oct 25, 2019

MarkusAmann commented Nov 12, 2019

fmassa commented Nov 13, 2019

varunagrawal commented Nov 13, 2019

rmcavoy commented Nov 24, 2019

fmassa commented Nov 26, 2019 • edited Loading

Edwardmark commented Jan 7, 2020

fmassa commented Jan 8, 2020

dishank-b commented Jan 8, 2020

Edwardmark commented Jan 9, 2020

lamhoangtung commented Feb 28, 2021

aosokin commented Mar 13, 2018 •

edited

Loading

ruotianluo commented Apr 26, 2018 •

edited

Loading

ghost commented Apr 27, 2018 via email •

edited by ghost

Loading

Sucran commented Jul 27, 2018 •

edited

Loading

fmassa commented Jul 27, 2018 •

edited

Loading

0phoff commented Oct 25, 2018 •

edited

Loading

dashesy commented Aug 1, 2019 •

edited

Loading

dashesy commented Oct 24, 2019 •

edited

Loading

fmassa commented Nov 26, 2019 •

edited

Loading