Add C++ ops to torchvision #826

fmassa · 2019-03-31T18:54:29Z

This PR adds C++ / CUDA extensions to torchvision for ROIAlign, ROIPool and nms.

I'll still be doing some cleanups / linting, as well as moving the files from layers to ops folder, so this is not yet ready to be merged.

Thanks a lot to @varunagrawal who added a number of improvements on top of the layers branch, such as CPU support for ROIAlign and ROIPool backwards, as well as unit tests for those operations!

varunagrawal · 2019-04-05T18:48:37Z

@fmassa are you sure about removing those + 1 for area computations in NMS? Assuming we are working in the XYXY box format, a box (0, 0, 15, 15) is technically 16x16 in size.

fmassa · 2019-04-05T20:32:02Z

@varunagrawal it depends on how we consider what the coordinates of the bounding box represent and if cropping is inclusive or exclusive on the boundary.

If indexing is inclusive (like in matlab), then yes, the +1 is necessary. But if it's exclusive (like in Python), then x[0:15] has 15 elements.

This is up for discussion, but I believe that we should be consistent everywhere.

Tagging @rbgirshick if he has any comments if we should include the +1 or not in the dimensions / area calculation of bounding boxes.

varunagrawal · 2019-04-08T23:49:34Z

When I looked through the implementation of IoU on maskrcnn-benchmark, the +1 is always included, so there's precedence that requires that.

More importantly, I don't think it is a indexing issue. For example, in COCO the boxes are XYWH e.g. (0, 0, 16, 16), and to get the RoI, you are right that we would need to use x[X:W, Y:W] if it's exclusive lilke python.
However, as per the detection literature, the model gives us the bounding boxes in XYXY format, and to get the width and height of the bounding box correctly, we need to offset by +1. Using the example of (0, 0, 16, 16) in XYWH mode, the bottom right pixel in the RoI would be (15, 15), giving us (0, 0, 15, 15) in XYXY mode, since if you count the number of pixels, 0-15 gives us 16 pixels which is the correct size.

varunagrawal · 2019-04-09T02:36:16Z

torchvision/csrc/cuda/nms_cuda.cu

+
+  dim3 blocks(
+      at::cuda::ATenCeilDiv(boxes_num, threadsPerBlock),
+      at::cuda::ATenCeilDiv(boxes_num, threadsPerBlock));


We can replace the above two lines directly with col_blocks.

varunagrawal · 2019-04-09T03:09:38Z

torchvision/csrc/nms.h

+      at::cuda::CUDAGuard device_guard(dets.device());
+      return at::empty({0}, dets.options().dtype(at::kLong));
+    }
+    auto b = at::cat({dets, scores.unsqueeze(1)}, 1);


I've been wondering about this. It seems unnecessary to waste memory creating a new tensor here when we can easily just update the function signature to accept 3 arguments. This would also lead to a more consistent signature between nms_cuda and nms_cpu.

Thoughts?

I totally agree, I just wanted to minimize the amount of work for now, but this is definitely something I'd like to have done

I'll add a PR on top of this branch to add that later today.

fmassa · 2019-04-09T08:50:55Z

@varunagrawal thanks for the comments. To address your points:

When I looked through the implementation of IoU on maskrcnn-benchmark, the +1 is always included, so there's precedence that requires that.

maskrcnn-benchmark tried to be as close as possible to Detectron, to provide baselines in PyTorch for detection models. In particular, it was backwards-compatible with Detectron models. But Detectron itself carries a lot of baggage from a long time ago, and some of the decisions there I do not think apply anymore.

More importantly, I don't think it is a indexing issue. For example, in COCO the boxes are XYWH e.g. (0, 0, 16, 16), and to get the RoI, you are right that we would need to use x[X:W, Y:W] if it's exclusive lilke python.
However, as per the detection literature, the model gives us the bounding boxes in XYXY format, and to get the width and height of the bounding box correctly, we need to offset by +1. Using the example of (0, 0, 16, 16) in XYWH mode, the bottom right pixel in the RoI would be (15, 15), giving us (0, 0, 15, 15) in XYXY mode, since if you count the number of pixels, 0-15 gives us 16 pixels which is the correct size.

I believe this is just a matter of convention. For example, in TF object detection library, area and width/height do not include the +1. This is particularly important because they use normalized coordinates (so that boxes are within 0-1). Plus, when converting COCO dataset to their format, no +1 is added either.
I tend to agree with their approach, which just means that in x1y1x2y2, the x1y1 is inclusive and x2y2 is exclusive.

If you see, the xyxy<->xywh conversion functions from maskrcnn-benchmark reproduce exactly those from Detectron. Still, when going from xywh to xyxy, we remove +1. Then, when computing the area, the +1 is added again. This is weird to me, and I'd like to remove those inconsistencies in torchvision (as we are not tied to matching exactly Detectron, as there is maskrcnn-benchmark for that).

There are plenty of inconsistencies in maskrcnn-benchmark that were carried from Detectron: in BoxList.crop, the width/height computation does not add a +1

Let me know if you disagree

varunagrawal · 2019-04-09T16:11:27Z

maskrcnn-benchmark tried to be as close as possible to Detectron, to provide baselines in PyTorch for detection models. In particular, it was backwards-compatible with Detectron models. But Detectron itself carries a lot of baggage from a long time ago, and some of the decisions there I do not think apply anymore.

I see. I guess having Ross' comments on this would be helpful then.

I believe this is just a matter of convention. For example, in TF object detection library, area and width/height do not include the +1. This is particularly important because they use normalized coordinates (so that boxes are within 0-1). Plus, when converting COCO dataset to their format, no +1 is added either.
I tend to agree with their approach, which just means that in x1y1x2y2, the x1y1 is inclusive and x2y2 is exclusive.

Okay this is what I was looking for. If x2y2 is exclusive then that solves our problem. I was worried it was not, and we should make a note of it in the docs since quite a few people write code expecting it to be inclusive.

rbgirshick · 2019-04-10T14:27:42Z

The "+1" history is described here https://github.com/facebookresearch/Detectron/blob/master/detectron/utils/boxes.py#L28-L40.

I think this is baggage that should be shed. I believe the correct thing to do is consider a box as a geometric entity defined in terms of continuous coordinates. In this case width is simply x2 - x1. Discrete coordinates and indexing only needs to be considered when rasterizing a continuous entity, in which case a pixel should be thought of a point sample, not a little square (as described in http://alvyray.com/Memos/CG/Microsoft/6_pixel.pdf).

lopuhin · 2019-04-24T10:24:44Z

Thanks @fmassa , it would be great to have officially supported layers for object detection.
When doing some testing, I noticed a difference between GPU and CPU forward for roi pooling, with GPU being correct. I'll try to put a reproducible example.

fmassa · 2019-04-24T10:27:23Z

@lopuhin this might be the same issue as facebookresearch/maskrcnn-benchmark#331

Also, I'm putting together a MWE of object detection and semantic segmentation, to be included in the references of torchvision

lopuhin · 2019-04-24T10:30:48Z

Thanks @fmassa , in my case the difference looks larger, e.g. 1.4492 vs 2.5916 for some values (and equal for other). It could be some bug on my side, will try to investigate soon.

lopuhin · 2019-04-24T11:14:12Z

@fmassa here is an example of roi pooling forward difference between GPU and CPU:

import torch
from torchvision.ops import roi_pool
x = torch.tensor(
    [[[[0.1767, 1.2851, 4.2325, 4.8645, 7.1496]],
      [[2.5916, 4.3361, 3.8143, 6.1329, 2.0230]],
      [[1.4492, 3.3384, 4.0816, 6.3116, 5.1068]]]],
    dtype=torch.float32)
rois = torch.tensor(
    [[0., 1., 0., 4., 0.],
     [0., 2., 0., 3., 0.],
     [0., 0., 0., 0., 0.],
     [0., 0., 0., 0., 0.],
     [0., 2., 0., 2., 0.]],
    dtype=torch.float32)
a = roi_pool(x, rois, (1, 2))
b = roi_pool(x.to('cuda'), rois.to('cuda'), (1, 2))
print(a - b.cpu())

which gives for me (blank lines removed for clarity):

tensor(
       [[[[0.0000, 0.0000]],
         [[0.0000, 1.0167]],
         [[0.2545, 0.8380]]],

        [[[0.0000, 0.0000]],
         [[0.4182, 0.0000]],
         [[0.1509, 0.0000]]],

        [[[0.0000, 0.0000]],
         [[0.0000, 0.0000]],
         [[1.1424, 1.1424]]],

        [[[0.0000, 0.0000]],
         [[0.0000, 0.0000]],
         [[1.1424, 1.1424]]],

        [[[0.0000, 0.0000]],
         [[0.4182, 0.4182]],
         [[0.1509, 0.1509]]]])

while I expected a - b to be close to zero here

fmassa · 2019-04-24T13:08:48Z

@lopuhin this looks indeed like a bug, I'll have a look before merging this

codecov-io · 2019-04-25T16:15:54Z

Codecov Report

Merging #826 into master will increase coverage by 0.72%.
The diff coverage is 61.31%.

@@            Coverage Diff             @@
##           master     #826      +/-   ##
==========================================
+ Coverage   56.61%   57.34%   +0.72%     
==========================================
  Files          38       43       +5     
  Lines        3432     3603     +171     
  Branches      540      553      +13     
==========================================
+ Hits         1943     2066     +123     
- Misses       1372     1418      +46     
- Partials      117      119       +2

Impacted Files	Coverage Δ
torchvision/__init__.py	`62.5% <100%> (+2.5%)`	⬆️
torchvision/ops/__init__.py	`100% <100%> (ø)`
torchvision/ops/_utils.py	`27.27% <27.27%> (ø)`
torchvision/ops/boxes.py	`51.42% <51.42%> (ø)`
torchvision/ops/roi_align.py	`66.66% <66.66%> (ø)`
torchvision/ops/roi_pool.py	`68.29% <68.29%> (ø)`
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 78ed423...06fd8ae. Read the comment docs.

soumith · 2019-04-25T19:56:33Z

fyi, the CI is now good to go on this.

fmassa · 2019-04-26T11:43:12Z

@lopuhin I believe I've fixed the bug you pointed out. Thanks a lot for the repro!

lopuhin · 2019-04-26T15:48:41Z

Thank you very much for the fix and for the feature @fmassa , I confirm that it works as expected for me 👍

TheCodez · 2019-05-04T14:54:06Z

@fmassa can those ops be used from torchscript?

fmassa · 2019-05-06T09:48:01Z

@TheCodez probably not as of now, but I'll want to make them work in torchscript at some point in time

TheCodez · 2019-05-06T20:26:27Z

@fmass nms can probably just be added using the below code and loading the correct lib
for the right target platform in __init__.py:

static auto registry = torch::jit::RegisterOperators("vision::nms", &nms);

The other ops seem to be more involved.

fmassa · 2019-05-07T09:45:23Z

@TheCodez yes, and I'll follow the modifications from facebookresearch/maskrcnn-benchmark#138 to make the operators here work with JIT, but that might require a few modifications elsewhere in PyTorch so that both can coexist, so I'm postponing this change.

Also fixes a bug in the clip_boxes_to_image -- this function needs a test!

lopuhin · 2019-05-17T12:24:37Z

FWIW I'm getting RuntimeError: CUDA error: an illegal memory access was encountered in ROIPool_backward_cuda, I'll try to reduce this to a reproducible issue an post a separate issue

    @staticmethod
    @once_differentiable
    def backward(ctx, grad_output):
        rois, argmax = ctx.saved_tensors
        output_size = ctx.output_size
        spatial_scale = ctx.spatial_scale
        bs, ch, h, w = ctx.input_shape
        grad_input = _C.roi_pool_backward(
            grad_output, rois, argmax, spatial_scale,
>           output_size[0], output_size[1], bs, ch, h, w)
E       RuntimeError: CUDA error: an illegal memory access was encountered (ROIPool_backward_cuda at /tmp/pip-install-nz4z64qf/torchvision/torchvision/csrc/cuda/ROIPool_cuda.cu:240)
E       frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7ff03cd24441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
E       frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7ff03cd23d7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
E       frame #2: ROIPool_backward_cuda(at::Tensor const&, at::Tensor const&, at::Tensor const&, float, int, int, int, int, int, int) + 0xad3 (0x7ff029c08feb in /usr/local/lib/python3.6/dist-packages/torchvision/_C.cpython-36m-x86_64-linux-gnu.so)
E       frame #3: ROIPool_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, float, int, int, int, int, int, int) + 0x140 (0x7ff029bd3a60 in /usr/local/lib/python3.6/dist-packages/torchvision/_C.cpython-36m-x86_64-linux-gnu.so)
E       frame #4: <unknown function> + 0x22157 (0x7ff029be1157 in /usr/local/lib/python3.6/dist-packages/torchvision/_C.cpython-36m-x86_64-linux-gnu.so)
E       frame #5: <unknown function> + 0x1e461 (0x7ff029bdd461 in /usr/local/lib/python3.6/dist-packages/torchvision/_C.cpython-36m-x86_64-linux-gnu.so)
E       frame #6: /usr/bin/python3() [0x5030d5]
E       frame #7: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
E       frame #8: /usr/bin/python3() [0x504c28]
E       frame #9: /usr/bin/python3() [0x58644b]
E       frame #10: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
E       frame #11: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
E       frame #12: /usr/bin/python3() [0x504c28]
E       frame #13: /usr/bin/python3() [0x58644b]
E       frame #14: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
E       frame #15: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
E       frame #16: /usr/bin/python3() [0x504c28]
E       frame #17: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
E       frame #18: /usr/bin/python3() [0x591461]
E       frame #19: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
E       frame #20: torch::autograd::PyFunction::apply(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> >&&) + 0x193 (0x7ff085755823 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
E       frame #21: <unknown function> + 0x3108aa (0x7ff03b1ae8aa in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1)
E       frame #22: torch::autograd::Engine::evaluate_function(torch::autograd::FunctionTask&) + 0x385 (0x7ff03b1a7975 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1)
E       frame #23: torch::autograd::Engine::thread_main(torch::autograd::GraphTask*) + 0xc0 (0x7ff03b1a9970 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1)
E       frame #24: torch::autograd::Engine::thread_init(int) + 0x136 (0x7ff03b1a6d46 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1)
E       frame #25: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7ff0857502ea in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
E       frame #26: <unknown function> + 0xbd8f0 (0x7ff03c8278f0 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
E       frame #27: <unknown function> + 0x76db (0x7ff09eb066db in /lib/x86_64-linux-gnu/libpthread.so.0)
E       frame #28: clone + 0x3f (0x7ff09ee3f88f in /lib/x86_64-linux-gnu/libc.so.6)

/usr/local/lib/python3.6/dist-packages/torchvision/ops/roi_pool.py:34: RuntimeError

fmassa · 2019-05-17T14:02:43Z

Yes, a repro on a new issue would be great!

lopuhin · 2019-05-17T14:16:52Z

Thanks @fmassa , so far it looks a bit more challenging to reproduce :)
Even adding time.sleep(1) before line 34 makes the issue much less likely to happen (without it, it always happens), and also adding grad_output = grad_output.contiguous() makes the issue go away. Also I tried on two machines so far and it happens only on one (although environments are slightly different). CuDNN is disabled and CUDA_LAUNCH_BLOCKING is set.
Still hope to reduce it, as it happens reliably.

fmassa · 2019-05-17T14:18:08Z

Don't we add a contiguous call to grad_output already?

fmassa · 2019-05-17T14:18:32Z

If not, we should

lopuhin · 2019-05-17T14:19:44Z

Oh interesting, I didn't know it's required, this is the place where it helps me:

vision/torchvision/ops/roi_pool.py

Line 32 in 37bd11d

grad_input = _C.roi_pool_backward(

fmassa · 2019-05-17T14:21:52Z

There is a call to contiguous in the cuda file, jest before the kernel launch, so the problem should be elsewhere

lopuhin · 2019-05-17T14:24:39Z

Right, found it here

vision/torchvision/csrc/cuda/ROIPool_cuda.cu

Line 224 in dc3ac29

grad.contiguous().data<scalar_t>(),

Just for my own sanity, I see that we first get the strides (to be passed into the kernel), and only then call grad.contiguous, is this right? As I understand, making it contiguous should change the strides?

vision/torchvision/csrc/cuda/ROIPool_cuda.cu

Lines 216 to 224 in dc3ac29

    
           int n_stride = grad.stride(0); 
        
           int c_stride = grad.stride(1); 
        
           int h_stride = grad.stride(2); 
        
           int w_stride = grad.stride(3); 
        
           AT_DISPATCH_FLOATING_TYPES_AND_HALF(grad.type(), "ROIPool_backward", [&] { 
        
             RoIPoolBackward<scalar_t><<<grid, block, 0, stream>>>( 
        
                 grad.numel(), 
        
                 grad.contiguous().data<scalar_t>(),

fmassa · 2019-05-17T15:00:42Z

Yes, you are right, this is a bug! Removing the contiguous in this case should fix it. Can you also check if the same happens in ROIAlign, and send a PR? Thanks for the catch!

lopuhin · 2019-05-17T15:01:18Z

Thanks! Sure, will do.

fmassa · 2019-05-17T15:02:08Z

And please also add a test for this case, if possible

Naman-ntc · 2019-07-31T21:44:18Z

Hi, thanks a lot for adding the PR for these ops.
Although I see that these ops haven't been documented neither in stable nor in master docs. Could someone add docs for these operations since it took me a lot of time to find about their existence and their signatures. I would have loved to do so, but I don't have enough knowledge about them to write docs.
Thanks

varunagrawal · 2019-07-31T22:10:10Z

@fmassa I guess we should rerun Sphinx and generate the docs afresh?

fmassa · 2019-08-01T11:36:52Z

@Naman-ntc I'm adding the doc entries for those functions now.

fmassa mentioned this pull request Apr 3, 2019

without gpu facebookresearch/maskrcnn-benchmark#634

Closed

varunagrawal mentioned this pull request Apr 8, 2019

Layers branch with NMS-GPU #727

Closed

varunagrawal reviewed Apr 9, 2019

View reviewed changes

fmassa force-pushed the layers-v2 branch from 0b896bc to c0c5577 Compare April 15, 2019 11:55

fmassa mentioned this pull request Apr 24, 2019

allow passing in progress=False for model_zoo.load_url call #862

Closed

pmeier mentioned this pull request Apr 26, 2019

Added progress flag to model getters #875

Merged

fmassa mentioned this pull request May 7, 2019

Update travis to use more recent GCC #891

Merged

fmassa added 4 commits May 7, 2019 17:11

Initial layout for layers with cpp extensions

3582bd5

Move files around

5a31ba6

Fix import after move

5a1bf0a

Add support for multiple types to ROIAlign

54b1a83

fmassa added 2 commits May 7, 2019 17:12

Fix bug pointed by @lopuhin

dfe8ec1

Fix RoIPool reference implementation in Python 2

4bdc54c

Also fixes a bug in the clip_boxes_to_image -- this function needs a test!

fmassa force-pushed the layers-v2 branch from c4fb6d7 to 4bdc54c Compare May 7, 2019 15:16

Remove change in .travis

06fd8ae

fmassa merged commit dc3ac29 into master May 7, 2019

fmassa deleted the layers-v2 branch May 7, 2019 18:13

fmassa mentioned this pull request May 7, 2019

Non-Maximum Supression on the GPU #392

Closed

justusschock mentioned this pull request May 12, 2019

Pytorch 1.0 MIC-DKFZ/medicaldetectiontoolkit#41

Closed

fmassa mentioned this pull request May 19, 2019

Fix RoIAlign and RoIPool for non-contiguous gradients #920

Merged

varunagrawal mentioned this pull request May 22, 2019

nms_cuda signature update #945

Merged

fmassa mentioned this pull request Dec 2, 2019

Question about converting custom dataset to coco api #1617

Closed

fmassa mentioned this pull request Feb 13, 2020

Shouldn't have a +1 in the NMS implementation for the boxes width/height computation ? #1872

Closed

glenn-jocher mentioned this pull request Mar 16, 2020

TESTING/INFERENCE AUGMENTATION ultralytics/yolov3#931

Closed

fmassa mentioned this pull request Dec 14, 2020

Width Calculation For Bounding Boxes in torchvision\models\detection\_utils.py #3169

Closed

weiji14 mentioned this pull request Sep 6, 2022

IDTReeS - Clip boxes outside of image bounds microsoft/torchgeo#760

Merged

NicolasHug mentioned this pull request Jan 26, 2023

Area calculation in TorchVision Object Detection Finetuning Tutorial pytorch/tutorials#2173

Closed

Add C++ ops to torchvision #826

Add C++ ops to torchvision #826

Conversation

fmassa commented Mar 31, 2019

varunagrawal commented Apr 5, 2019

fmassa commented Apr 5, 2019

varunagrawal commented Apr 8, 2019

varunagrawal Apr 9, 2019

Choose a reason for hiding this comment

varunagrawal Apr 9, 2019 • edited Loading

Choose a reason for hiding this comment

fmassa Apr 9, 2019

Choose a reason for hiding this comment

varunagrawal Apr 9, 2019

Choose a reason for hiding this comment

fmassa commented Apr 9, 2019

varunagrawal commented Apr 9, 2019

rbgirshick commented Apr 10, 2019

lopuhin commented Apr 24, 2019

fmassa commented Apr 24, 2019

lopuhin commented Apr 24, 2019

lopuhin commented Apr 24, 2019

fmassa commented Apr 24, 2019

codecov-io commented Apr 25, 2019 • edited Loading

Codecov Report

soumith commented Apr 25, 2019

fmassa commented Apr 26, 2019

lopuhin commented Apr 26, 2019

TheCodez commented May 4, 2019

fmassa commented May 6, 2019

TheCodez commented May 6, 2019

fmassa commented May 7, 2019

lopuhin commented May 17, 2019

fmassa commented May 17, 2019

lopuhin commented May 17, 2019

fmassa commented May 17, 2019

fmassa commented May 17, 2019

lopuhin commented May 17, 2019

fmassa commented May 17, 2019

lopuhin commented May 17, 2019

fmassa commented May 17, 2019

lopuhin commented May 17, 2019 • edited Loading

fmassa commented May 17, 2019

Naman-ntc commented Jul 31, 2019 • edited Loading

varunagrawal commented Jul 31, 2019

fmassa commented Aug 1, 2019

varunagrawal Apr 9, 2019 •

edited

Loading

codecov-io commented Apr 25, 2019 •

edited

Loading

lopuhin commented May 17, 2019 •

edited

Loading

Naman-ntc commented Jul 31, 2019 •

edited

Loading