🐛 [Bug] `torchvision.ops.roi_align` is throwing an internal assert failed #1157

BrettRyland · 2022-07-01T12:55:05Z

Bug Description

torchvision.ops.roi_align is throwing an internal assert failure when compiling a model using it with torch_tensorrt.compile.

To Reproduce

Repro script: trt_bug3.py

brett@br-workhorse:~/repos/Autosensor/NN/tmp$ python3 trt_bug3.py 
torch.Size([20, 3, 6, 9])
torch.Size([20, 3, 6, 9])
WARNING: [Torch-TensorRT] - Cannot infer input type from calcuations in graph for input x.1. Assuming it is Float32. If not, specify input type explicity
WARNING: [Torch-TensorRT] - Input type for doing shape analysis could not be determined, defaulting to F32
WARNING: [Torch-TensorRT] - Truncating graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 111.0.1
WARNING: [Torch-TensorRT] - There may be undefined behavior using dynamic shape and aten::size
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 111.0.1
WARNING: [Torch-TensorRT] - There may be undefined behavior using dynamic shape and aten::size
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 111.0.1
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Unused Input: input_1
WARNING: [Torch-TensorRT TorchScript Conversion Context] - [RemoveDeadLayers] Input Tensor input_1 is unused or used only at compile-time, but is not being removed.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 111.0.1
WARNING: [Torch-TensorRT] - There may be undefined behavior using dynamic shape and aten::size
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Unused Input: input_3
WARNING: [Torch-TensorRT TorchScript Conversion Context] - [RemoveDeadLayers] Input Tensor input_3 is unused or used only at compile-time, but is not being removed.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 111.0.1
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 111.0.1
WARNING: [Torch-TensorRT] - There may be undefined behavior using dynamic shape and aten::size
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 111.0.1
WARNING: [Torch-TensorRT] - There may be undefined behavior using dynamic shape and aten::size
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 111.0.1
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Unused Input: input_1
WARNING: [Torch-TensorRT TorchScript Conversion Context] - [RemoveDeadLayers] Input Tensor input_1 is unused or used only at compile-time, but is not being removed.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 111.0.1
WARNING: [Torch-TensorRT] - There may be undefined behavior using dynamic shape and aten::size
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Unused Input: input_3
WARNING: [Torch-TensorRT TorchScript Conversion Context] - [RemoveDeadLayers] Input Tensor input_3 is unused or used only at compile-time, but is not being removed.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 111.0.1
Traceback (most recent call last):
  File "/home/brett/repos/Autosensor/NN/tmp/trt_bug3.py", line 79, in <module>
    trt_m = torch_tensorrt.compile(scr_m, inputs=[torch_tensorrt.Input((2, 3, 10, 10)), ], truncate_long_and_double=True)
  File "/home/brett/.local/lib/python3.10/site-packages/torch_tensorrt/_compile.py", line 115, in compile
    return torch_tensorrt.ts.compile(ts_mod, inputs=inputs, enabled_precisions=enabled_precisions, **kwargs)
  File "/home/brett/.local/lib/python3.10/site-packages/torch_tensorrt/ts/_compiler.py", line 113, in compile
    compiled_cpp_mod = _C.compile_graph(module._c, _parse_compile_spec(spec))
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/home/brett/.local/lib/python3.10/site-packages/torchvision/ops/_utils.py", line 30, in forward
def check_roi_boxes_shape(boxes: Union[Tensor, List[Tensor]]):
    if isinstance(boxes, (list, tuple)):
        for _tensor in boxes:
        ~~~~~~~~~~~~~~~~~~~~~
            assert (
            ~~~~~~~~
                _tensor.size(1) == 4
                ~~~~~~~~~~~~~~~~~~~~
            ), "The shape of the tensor in the boxes list is not correct as List[Tensor[L, 4]]"
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    elif isinstance(boxes, torch.Tensor):
        assert boxes.size(1) == 5, "The boxes tensor shape is not correct as Tensor[K, 5]"
RuntimeError: isList()INTERNAL ASSERT FAILED at "/home/brett/github/pytorch/aten/src/ATen/core/ivalue_inl.h":1884, please report a bug to PyTorch. Expected GenericList but got Tensor

Expected behavior

Expected graph compilation to succeed.

Environment

brett@br-workhorse:~/repos/Autosensor/NN/tmp$ python3 -c "from torch.utils.collect_env import main; main()"
Collecting environment information...
PyTorch version: 1.11.0a0+gitbc2c6ed
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04 LTS (x86_64)
GCC version: (Ubuntu 11.2.0-19ubuntu1) 11.2.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.4 (main, Apr  2 2022, 09:04:19) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-37-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.7.64
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Ti
Nvidia driver version: 515.48.07
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.3.2
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.6
[pip3] numpy-quaternion==2022.4.1
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.11.0
[pip3] torch-optimizer==0.1.0
[pip3] torch-tensorrt==1.2.0a0+22d91f5e
[pip3] torchmetrics==0.7.3
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.12.0
[conda] Could not collect

TensorRT is built from 91a92ca with PR #1067 merged in

brett@br-workhorse:~/github/TensorRT$ git log --oneline --graph
* 22d91f5e (HEAD -> master) fix: fix the bug that tag Constant node as fallback node
*   ccb826e7 Merge remote-tracking branch 'origin' into refactor_segmentation
|\  
| * 91a92ca4 (origin/master, origin/HEAD) docs: [Automated] Regenerating documenation for dcf3386
| *   dcf3386e Merge pull request #1057 from pytorch/anuragd/reduce_nox_sessions

The text was updated successfully, but these errors were encountered:

BrettRyland · 2022-07-29T13:42:03Z

Note: I've been able to partially work around this (partially because I'm getting a different issue afterwards) by modifying torchvision/ops with

diff -x __pycache__ /home/brett/github/torchvision/torchvision/ops/ps_roi_align.py /home/brett/.local/lib/python3.10/site-packages/torchvision/ops/ps_roi_align.py
2a3
> from torch.jit.annotations import BroadcastingList2
13c14
<     output_size: int,
---
>     output_size: BroadcastingList2[int],
67c68
<         output_size: int,
---
>         output_size: BroadcastingList2[int],
diff -x __pycache__ /home/brett/github/torchvision/torchvision/ops/roi_align.py /home/brett/.local/lib/python3.10/site-packages/torchvision/ops/roi_align.py
59c59,61
<     if not isinstance(rois, torch.Tensor):
---
>     if isinstance(rois, torch.Tensor):
>         pass
>     else:
diff -x __pycache__ /home/brett/github/torchvision/torchvision/ops/_utils.py /home/brett/.local/lib/python3.10/site-packages/torchvision/ops/_utils.py
29c29,31
<     if isinstance(boxes, (list, tuple)):
---
>     if isinstance(boxes, torch.Tensor):
>         assert boxes.size(1) == 5, "The boxes tensor shape is not correct as Tensor[K, 5]"
>     elif isinstance(boxes, (list, tuple)):
34,35d35
<     elif isinstance(boxes, torch.Tensor):
<         assert boxes.size(1) == 5, "The boxes tensor shape is not correct as Tensor[K, 5]"

so it looks like something is messed up with isinstance for tensors.

BrettRyland · 2022-08-15T13:04:21Z

As a temporary work-around while this and other issues are getting fixed, is it possible to tell torch_tensorrt.compile to ignore certain sections of the model? E.g., to compile the backbone of the network, but leave the rest as scripted or traced?

BrettRyland · 2022-08-18T10:35:24Z

As a temporary work-around while this and other issues are getting fixed, is it possible to tell torch_tensorrt.compile to ignore certain sections of the model? E.g., to compile the backbone of the network, but leave the rest as scripted or traced?

I see that the torch_executed_modules is intended for this, but I haven't been able to get it to work. How does one specify these modules?
E.g., in this simple script trt_bug4.py everything always gets compiled despite all the variants I'm passing to torch_executed_modules.

github-actions · 2022-11-17T00:02:14Z

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

github-actions · 2023-02-17T00:02:09Z

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

BrettRyland added the bug Something isn't working label Jul 1, 2022

narendasan added the component: partitioning label Jul 5, 2022

github-actions bot assigned bowang007 and peri044 Jul 5, 2022

github-actions bot added the No Activity label Nov 17, 2022

ncomly-nvidia removed the No Activity label Nov 18, 2022

narendasan unassigned peri044 Dec 15, 2022

github-actions bot added the No Activity label Feb 17, 2023

github-actions bot closed this as completed Feb 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 [Bug] `torchvision.ops.roi_align` is throwing an internal assert failed #1157

🐛 [Bug] `torchvision.ops.roi_align` is throwing an internal assert failed #1157

BrettRyland commented Jul 1, 2022

BrettRyland commented Jul 29, 2022

BrettRyland commented Aug 15, 2022 •

edited

Loading

BrettRyland commented Aug 18, 2022

github-actions bot commented Nov 17, 2022

github-actions bot commented Feb 17, 2023

🐛 [Bug] torchvision.ops.roi_align is throwing an internal assert failed #1157

🐛 [Bug] torchvision.ops.roi_align is throwing an internal assert failed #1157

Comments

BrettRyland commented Jul 1, 2022

Bug Description

To Reproduce

Expected behavior

Environment

BrettRyland commented Jul 29, 2022

BrettRyland commented Aug 15, 2022 • edited Loading

BrettRyland commented Aug 18, 2022

github-actions bot commented Nov 17, 2022

github-actions bot commented Feb 17, 2023

🐛 [Bug] `torchvision.ops.roi_align` is throwing an internal assert failed #1157

🐛 [Bug] `torchvision.ops.roi_align` is throwing an internal assert failed #1157

BrettRyland commented Aug 15, 2022 •

edited

Loading