Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add C++ ops to torchvision #826

Merged
merged 47 commits into from
May 7, 2019
Merged

Add C++ ops to torchvision #826

merged 47 commits into from
May 7, 2019

Conversation

fmassa
Copy link
Member

@fmassa fmassa commented Mar 31, 2019

This PR adds C++ / CUDA extensions to torchvision for ROIAlign, ROIPool and nms.

I'll still be doing some cleanups / linting, as well as moving the files from layers to ops folder, so this is not yet ready to be merged.

Thanks a lot to @varunagrawal who added a number of improvements on top of the layers branch, such as CPU support for ROIAlign and ROIPool backwards, as well as unit tests for those operations!

@varunagrawal
Copy link
Contributor

@fmassa are you sure about removing those + 1 for area computations in NMS? Assuming we are working in the XYXY box format, a box (0, 0, 15, 15) is technically 16x16 in size.

@fmassa
Copy link
Member Author

fmassa commented Apr 5, 2019

@varunagrawal it depends on how we consider what the coordinates of the bounding box represent and if cropping is inclusive or exclusive on the boundary.

If indexing is inclusive (like in matlab), then yes, the +1 is necessary. But if it's exclusive (like in Python), then x[0:15] has 15 elements.

This is up for discussion, but I believe that we should be consistent everywhere.

Tagging @rbgirshick if he has any comments if we should include the +1 or not in the dimensions / area calculation of bounding boxes.

@varunagrawal
Copy link
Contributor

When I looked through the implementation of IoU on maskrcnn-benchmark, the +1 is always included, so there's precedence that requires that.

More importantly, I don't think it is a indexing issue. For example, in COCO the boxes are XYWH e.g. (0, 0, 16, 16), and to get the RoI, you are right that we would need to use x[X:W, Y:W] if it's exclusive lilke python.
However, as per the detection literature, the model gives us the bounding boxes in XYXY format, and to get the width and height of the bounding box correctly, we need to offset by +1. Using the example of (0, 0, 16, 16) in XYWH mode, the bottom right pixel in the RoI would be (15, 15), giving us (0, 0, 15, 15) in XYXY mode, since if you count the number of pixels, 0-15 gives us 16 pixels which is the correct size.


dim3 blocks(
at::cuda::ATenCeilDiv(boxes_num, threadsPerBlock),
at::cuda::ATenCeilDiv(boxes_num, threadsPerBlock));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can replace the above two lines directly with col_blocks.

at::cuda::CUDAGuard device_guard(dets.device());
return at::empty({0}, dets.options().dtype(at::kLong));
}
auto b = at::cat({dets, scores.unsqueeze(1)}, 1);
Copy link
Contributor

@varunagrawal varunagrawal Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been wondering about this. It seems unnecessary to waste memory creating a new tensor here when we can easily just update the function signature to accept 3 arguments. This would also lead to a more consistent signature between nms_cuda and nms_cpu.

Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree, I just wanted to minimize the amount of work for now, but this is definitely something I'd like to have done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a PR on top of this branch to add that later today.

@fmassa
Copy link
Member Author

fmassa commented Apr 9, 2019

@varunagrawal thanks for the comments. To address your points:

When I looked through the implementation of IoU on maskrcnn-benchmark, the +1 is always included, so there's precedence that requires that.

maskrcnn-benchmark tried to be as close as possible to Detectron, to provide baselines in PyTorch for detection models. In particular, it was backwards-compatible with Detectron models. But Detectron itself carries a lot of baggage from a long time ago, and some of the decisions there I do not think apply anymore.

More importantly, I don't think it is a indexing issue. For example, in COCO the boxes are XYWH e.g. (0, 0, 16, 16), and to get the RoI, you are right that we would need to use x[X:W, Y:W] if it's exclusive lilke python.
However, as per the detection literature, the model gives us the bounding boxes in XYXY format, and to get the width and height of the bounding box correctly, we need to offset by +1. Using the example of (0, 0, 16, 16) in XYWH mode, the bottom right pixel in the RoI would be (15, 15), giving us (0, 0, 15, 15) in XYXY mode, since if you count the number of pixels, 0-15 gives us 16 pixels which is the correct size.

I believe this is just a matter of convention. For example, in TF object detection library, area and width/height do not include the +1. This is particularly important because they use normalized coordinates (so that boxes are within 0-1). Plus, when converting COCO dataset to their format, no +1 is added either.
I tend to agree with their approach, which just means that in x1y1x2y2, the x1y1 is inclusive and x2y2 is exclusive.

If you see, the xyxy<->xywh conversion functions from maskrcnn-benchmark reproduce exactly those from Detectron. Still, when going from xywh to xyxy, we remove +1. Then, when computing the area, the +1 is added again. This is weird to me, and I'd like to remove those inconsistencies in torchvision (as we are not tied to matching exactly Detectron, as there is maskrcnn-benchmark for that).

There are plenty of inconsistencies in maskrcnn-benchmark that were carried from Detectron: in BoxList.crop, the width/height computation does not add a +1

Let me know if you disagree

@varunagrawal
Copy link
Contributor

maskrcnn-benchmark tried to be as close as possible to Detectron, to provide baselines in PyTorch for detection models. In particular, it was backwards-compatible with Detectron models. But Detectron itself carries a lot of baggage from a long time ago, and some of the decisions there I do not think apply anymore.

I see. I guess having Ross' comments on this would be helpful then.

I believe this is just a matter of convention. For example, in TF object detection library, area and width/height do not include the +1. This is particularly important because they use normalized coordinates (so that boxes are within 0-1). Plus, when converting COCO dataset to their format, no +1 is added either.
I tend to agree with their approach, which just means that in x1y1x2y2, the x1y1 is inclusive and x2y2 is exclusive.

Okay this is what I was looking for. If x2y2 is exclusive then that solves our problem. I was worried it was not, and we should make a note of it in the docs since quite a few people write code expecting it to be inclusive.

@rbgirshick
Copy link

The "+1" history is described here https://github.com/facebookresearch/Detectron/blob/master/detectron/utils/boxes.py#L28-L40.

I think this is baggage that should be shed. I believe the correct thing to do is consider a box as a geometric entity defined in terms of continuous coordinates. In this case width is simply x2 - x1. Discrete coordinates and indexing only needs to be considered when rasterizing a continuous entity, in which case a pixel should be thought of a point sample, not a little square (as described in http://alvyray.com/Memos/CG/Microsoft/6_pixel.pdf).

@lopuhin
Copy link
Contributor

lopuhin commented Apr 24, 2019

Thanks @fmassa , it would be great to have officially supported layers for object detection.
When doing some testing, I noticed a difference between GPU and CPU forward for roi pooling, with GPU being correct. I'll try to put a reproducible example.

@fmassa
Copy link
Member Author

fmassa commented Apr 24, 2019

@lopuhin this might be the same issue as facebookresearch/maskrcnn-benchmark#331

Also, I'm putting together a MWE of object detection and semantic segmentation, to be included in the references of torchvision

@lopuhin
Copy link
Contributor

lopuhin commented Apr 24, 2019

Thanks @fmassa , in my case the difference looks larger, e.g. 1.4492 vs 2.5916 for some values (and equal for other). It could be some bug on my side, will try to investigate soon.

@lopuhin
Copy link
Contributor

lopuhin commented Apr 24, 2019

@fmassa here is an example of roi pooling forward difference between GPU and CPU:

import torch
from torchvision.ops import roi_pool
x = torch.tensor(
    [[[[0.1767, 1.2851, 4.2325, 4.8645, 7.1496]],
      [[2.5916, 4.3361, 3.8143, 6.1329, 2.0230]],
      [[1.4492, 3.3384, 4.0816, 6.3116, 5.1068]]]],
    dtype=torch.float32)
rois = torch.tensor(
    [[0., 1., 0., 4., 0.],
     [0., 2., 0., 3., 0.],
     [0., 0., 0., 0., 0.],
     [0., 0., 0., 0., 0.],
     [0., 2., 0., 2., 0.]],
    dtype=torch.float32)
a = roi_pool(x, rois, (1, 2))
b = roi_pool(x.to('cuda'), rois.to('cuda'), (1, 2))
print(a - b.cpu())

which gives for me (blank lines removed for clarity):

tensor(
       [[[[0.0000, 0.0000]],
         [[0.0000, 1.0167]],
         [[0.2545, 0.8380]]],

        [[[0.0000, 0.0000]],
         [[0.4182, 0.0000]],
         [[0.1509, 0.0000]]],

        [[[0.0000, 0.0000]],
         [[0.0000, 0.0000]],
         [[1.1424, 1.1424]]],

        [[[0.0000, 0.0000]],
         [[0.0000, 0.0000]],
         [[1.1424, 1.1424]]],

        [[[0.0000, 0.0000]],
         [[0.4182, 0.4182]],
         [[0.1509, 0.1509]]]])

while I expected a - b to be close to zero here

@fmassa
Copy link
Member Author

fmassa commented Apr 24, 2019

@lopuhin this looks indeed like a bug, I'll have a look before merging this

@codecov-io
Copy link

codecov-io commented Apr 25, 2019

Codecov Report

Merging #826 into master will increase coverage by 0.72%.
The diff coverage is 61.31%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #826      +/-   ##
==========================================
+ Coverage   56.61%   57.34%   +0.72%     
==========================================
  Files          38       43       +5     
  Lines        3432     3603     +171     
  Branches      540      553      +13     
==========================================
+ Hits         1943     2066     +123     
- Misses       1372     1418      +46     
- Partials      117      119       +2
Impacted Files Coverage Δ
torchvision/__init__.py 62.5% <100%> (+2.5%) ⬆️
torchvision/ops/__init__.py 100% <100%> (ø)
torchvision/ops/_utils.py 27.27% <27.27%> (ø)
torchvision/ops/boxes.py 51.42% <51.42%> (ø)
torchvision/ops/roi_align.py 66.66% <66.66%> (ø)
torchvision/ops/roi_pool.py 68.29% <68.29%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 78ed423...06fd8ae. Read the comment docs.

@soumith
Copy link
Member

soumith commented Apr 25, 2019

fyi, the CI is now good to go on this.

@fmassa
Copy link
Member Author

fmassa commented Apr 26, 2019

@lopuhin I believe I've fixed the bug you pointed out. Thanks a lot for the repro!

@lopuhin
Copy link
Contributor

lopuhin commented Apr 26, 2019

Thank you very much for the fix and for the feature @fmassa , I confirm that it works as expected for me 👍

@TheCodez
Copy link
Contributor

TheCodez commented May 4, 2019

@fmassa can those ops be used from torchscript?

@fmassa
Copy link
Member Author

fmassa commented May 6, 2019

@TheCodez probably not as of now, but I'll want to make them work in torchscript at some point in time

@TheCodez
Copy link
Contributor

TheCodez commented May 6, 2019

@fmass nms can probably just be added using the below code and loading the correct lib
for the right target platform in __init__.py:

static auto registry = torch::jit::RegisterOperators("vision::nms", &nms);

The other ops seem to be more involved.

@fmassa
Copy link
Member Author

fmassa commented May 7, 2019

@TheCodez yes, and I'll follow the modifications from facebookresearch/maskrcnn-benchmark#138 to make the operators here work with JIT, but that might require a few modifications elsewhere in PyTorch so that both can coexist, so I'm postponing this change.

fmassa added 2 commits May 7, 2019 17:12
Also fixes a bug in the clip_boxes_to_image -- this function needs a test!
@fmassa fmassa merged commit dc3ac29 into master May 7, 2019
@fmassa fmassa deleted the layers-v2 branch May 7, 2019 18:13
@lopuhin
Copy link
Contributor

lopuhin commented May 17, 2019

FWIW I'm getting RuntimeError: CUDA error: an illegal memory access was encountered in ROIPool_backward_cuda, I'll try to reduce this to a reproducible issue an post a separate issue

    @staticmethod
    @once_differentiable
    def backward(ctx, grad_output):
        rois, argmax = ctx.saved_tensors
        output_size = ctx.output_size
        spatial_scale = ctx.spatial_scale
        bs, ch, h, w = ctx.input_shape
        grad_input = _C.roi_pool_backward(
            grad_output, rois, argmax, spatial_scale,
>           output_size[0], output_size[1], bs, ch, h, w)
E       RuntimeError: CUDA error: an illegal memory access was encountered (ROIPool_backward_cuda at /tmp/pip-install-nz4z64qf/torchvision/torchvision/csrc/cuda/ROIPool_cuda.cu:240)
E       frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7ff03cd24441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
E       frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7ff03cd23d7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
E       frame #2: ROIPool_backward_cuda(at::Tensor const&, at::Tensor const&, at::Tensor const&, float, int, int, int, int, int, int) + 0xad3 (0x7ff029c08feb in /usr/local/lib/python3.6/dist-packages/torchvision/_C.cpython-36m-x86_64-linux-gnu.so)
E       frame #3: ROIPool_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, float, int, int, int, int, int, int) + 0x140 (0x7ff029bd3a60 in /usr/local/lib/python3.6/dist-packages/torchvision/_C.cpython-36m-x86_64-linux-gnu.so)
E       frame #4: <unknown function> + 0x22157 (0x7ff029be1157 in /usr/local/lib/python3.6/dist-packages/torchvision/_C.cpython-36m-x86_64-linux-gnu.so)
E       frame #5: <unknown function> + 0x1e461 (0x7ff029bdd461 in /usr/local/lib/python3.6/dist-packages/torchvision/_C.cpython-36m-x86_64-linux-gnu.so)
E       frame #6: /usr/bin/python3() [0x5030d5]
E       frame #7: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
E       frame #8: /usr/bin/python3() [0x504c28]
E       frame #9: /usr/bin/python3() [0x58644b]
E       frame #10: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
E       frame #11: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
E       frame #12: /usr/bin/python3() [0x504c28]
E       frame #13: /usr/bin/python3() [0x58644b]
E       frame #14: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
E       frame #15: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
E       frame #16: /usr/bin/python3() [0x504c28]
E       frame #17: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
E       frame #18: /usr/bin/python3() [0x591461]
E       frame #19: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
E       frame #20: torch::autograd::PyFunction::apply(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> >&&) + 0x193 (0x7ff085755823 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
E       frame #21: <unknown function> + 0x3108aa (0x7ff03b1ae8aa in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1)
E       frame #22: torch::autograd::Engine::evaluate_function(torch::autograd::FunctionTask&) + 0x385 (0x7ff03b1a7975 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1)
E       frame #23: torch::autograd::Engine::thread_main(torch::autograd::GraphTask*) + 0xc0 (0x7ff03b1a9970 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1)
E       frame #24: torch::autograd::Engine::thread_init(int) + 0x136 (0x7ff03b1a6d46 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1)
E       frame #25: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7ff0857502ea in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
E       frame #26: <unknown function> + 0xbd8f0 (0x7ff03c8278f0 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
E       frame #27: <unknown function> + 0x76db (0x7ff09eb066db in /lib/x86_64-linux-gnu/libpthread.so.0)
E       frame #28: clone + 0x3f (0x7ff09ee3f88f in /lib/x86_64-linux-gnu/libc.so.6)

/usr/local/lib/python3.6/dist-packages/torchvision/ops/roi_pool.py:34: RuntimeError

@fmassa
Copy link
Member Author

fmassa commented May 17, 2019

Yes, a repro on a new issue would be great!

@lopuhin
Copy link
Contributor

lopuhin commented May 17, 2019

Thanks @fmassa , so far it looks a bit more challenging to reproduce :)
Even adding time.sleep(1) before line 34 makes the issue much less likely to happen (without it, it always happens), and also adding grad_output = grad_output.contiguous() makes the issue go away. Also I tried on two machines so far and it happens only on one (although environments are slightly different). CuDNN is disabled and CUDA_LAUNCH_BLOCKING is set.
Still hope to reduce it, as it happens reliably.

@fmassa
Copy link
Member Author

fmassa commented May 17, 2019

Don't we add a contiguous call to grad_output already?

@fmassa
Copy link
Member Author

fmassa commented May 17, 2019

If not, we should

@lopuhin
Copy link
Contributor

lopuhin commented May 17, 2019

Oh interesting, I didn't know it's required, this is the place where it helps me:

grad_input = _C.roi_pool_backward(

@fmassa
Copy link
Member Author

fmassa commented May 17, 2019

There is a call to contiguous in the cuda file, jest before the kernel launch, so the problem should be elsewhere

@lopuhin
Copy link
Contributor

lopuhin commented May 17, 2019

Right, found it here

grad.contiguous().data<scalar_t>(),

Just for my own sanity, I see that we first get the strides (to be passed into the kernel), and only then call grad.contiguous, is this right? As I understand, making it contiguous should change the strides?

int n_stride = grad.stride(0);
int c_stride = grad.stride(1);
int h_stride = grad.stride(2);
int w_stride = grad.stride(3);
AT_DISPATCH_FLOATING_TYPES_AND_HALF(grad.type(), "ROIPool_backward", [&] {
RoIPoolBackward<scalar_t><<<grid, block, 0, stream>>>(
grad.numel(),
grad.contiguous().data<scalar_t>(),

@fmassa
Copy link
Member Author

fmassa commented May 17, 2019

Yes, you are right, this is a bug! Removing the contiguous in this case should fix it. Can you also check if the same happens in ROIAlign, and send a PR? Thanks for the catch!

@lopuhin
Copy link
Contributor

lopuhin commented May 17, 2019

Thanks! Sure, will do.

@fmassa
Copy link
Member Author

fmassa commented May 17, 2019

And please also add a test for this case, if possible

@Naman-ntc
Copy link

Naman-ntc commented Jul 31, 2019

Hi, thanks a lot for adding the PR for these ops.
Although I see that these ops haven't been documented neither in stable nor in master docs. Could someone add docs for these operations since it took me a lot of time to find about their existence and their signatures. I would have loved to do so, but I don't have enough knowledge about them to write docs.
Thanks

@varunagrawal
Copy link
Contributor

@fmassa I guess we should rerun Sphinx and generate the docs afresh?

@fmassa
Copy link
Member Author

fmassa commented Aug 1, 2019

@Naman-ntc I'm adding the doc entries for those functions now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants