Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torchvision.ops.batched_nms() crashes with pytorch 1.9.0 and torchvision 0.10.0 #4071

Closed
immanuelweber opened this issue Jun 16, 2021 · 19 comments

Comments

@immanuelweber
Copy link

immanuelweber commented Jun 16, 2021

🐛 Bug

with the just released pytorch 1.9.0 and torchvision 0.10.0 torchvision.ops.batched_nms() crashes on my machine with the following error:

RuntimeError: Couldn't load custom C++ ops. This can happen if your PyTorch and torchvision versions are incompatible, or if you had errors while compiling torchvision from source. For further information on the compatible versions, check https://github.com/pytorch/vision#installation for the compatibility matrix. Please check your PyTorch version with torch.__version__ and your torchvision version with torchvision.__version__ and verify if they are compatible, and if not please reinstall torchvision so that it matches your PyTorch install.

Since both are of the current version, I guess they should be compatible (they are not yet listed in the compatibility matrix).

To Reproduce

Steps to reproduce the behavior:

this example code shows the behavior on my machine:

import torch as th
import torchvision as tv

boxes = th.zeros(1000, 4)
scores = th.zeros(1000)
idxs = th.zeros(1000)

tv.ops.batched_nms(boxes, scores, idxs, 0.5)

Expected behavior

This should not result in an error.

Environment

Collecting environment information...
PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.27

Python version: 3.9 (64-bit runtime)
Python platform: Linux-4.15.0-144-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB

Nvidia driver version: 460.32.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.2
[pip3] torch==1.9.0
[pip3] torchaudio==0.9.0a0+33b2469
[pip3] torchvision==0.10.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 h8f6ccaa_8 conda-forge
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.2.0 h726a3e6_389 conda-forge
[conda] mkl-service 2.4.0 py39h3811e60_0 conda-forge
[conda] mkl_fft 1.3.0 py39h42c9631_2
[conda] mkl_random 1.2.2 py39hde0f152_0 conda-forge
[conda] numpy 1.20.2 py39h2d18471_0
[conda] numpy-base 1.20.2 py39hfae3a4d_0
[conda] pytorch 1.9.0 py3.9_cuda10.2_cudnn7.6.5_0 pytorch
[conda] torchaudio 0.9.0 py39 pytorch
[conda] torchvision 0.10.0 py39_cu102 pytorch

Additional context

@KonstantinKhabarlak
Copy link

KonstantinKhabarlak commented Jun 16, 2021

Can also report a REGRESSION.
A similar issue has occurred to me when running torch.jit.script
Code that worked with pytorch 1.8.0 and torchvision 0.9.1 after update to pytorch 1.9.0 and torchvision 0.10.0 now fails with:

RuntimeError: 
object has no attribute nms:
  File "C:\tools\Anaconda3\lib\site-packages\torchvision\ops\boxes.py", line 35
    """
    _assert_has_ops()
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
           ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
'nms' is being compiled since it was called from '_batched_nms_vanilla'
  File "C:\tools\Anaconda3\lib\site-packages\torchvision\ops\boxes.py", line 102
    for class_id in torch.unique(idxs):
        curr_indices = torch.where(idxs == class_id)[0]
        curr_keep_indices = nms(boxes[curr_indices], scores[curr_indices], iou_threshold)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        keep_mask[curr_indices[curr_keep_indices]] = True
    keep_indices = torch.where(keep_mask)[0]
'_batched_nms_vanilla' is being compiled since it was called from 'batched_nms'
  File "C:\tools\Anaconda3\lib\site-packages\torchvision\ops\boxes.py", line 66
    # Ideally for GPU we'd use a higher threshold
    if boxes.numel() > 4_000 and not torchvision._is_tracing():
        return _batched_nms_vanilla(boxes, scores, idxs, iou_threshold)
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    else:
        return _batched_nms_coordinate_trick(boxes, scores, idxs, iou_threshold)

@fmassa
Copy link
Member

fmassa commented Jun 16, 2021

Thanks for the reports.

We are looking into this

@NicolasHug
Copy link
Member

For ref I'm unable to reproduce on OSX with conda create -n new pytorch torchvision -c pytorch, the tests pass just fine.

PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 11.3.1 (x86_64)
GCC version: Could not collect
Clang version: 12.0.0 (clang-1200.0.32.29)
CMake version: Could not collect
Libc version: N/A

Python version: 3.8 (64-bit runtime)
Python platform: macOS-10.16-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.2
[pip3] torch==1.9.0
[pip3] torchvision==0.10.0
[conda] blas                      1.0                         mkl
[conda] ffmpeg                    4.3                  h0a44026_0    pytorch
[conda] mkl                       2021.2.0           hecd8cb5_269
[conda] mkl-service               2.3.0            py38h9ed2024_1
[conda] mkl_fft                   1.3.0            py38h4a7008c_2
[conda] mkl_random                1.2.1            py38hb2f4e1b_2
[conda] numpy                     1.20.2           py38h4b4dc7a_0
[conda] numpy-base                1.20.2           py38he0bd621_0
[conda] pytorch                   1.9.0                   py3.8_0    pytorch
[conda] torchvision               0.10.0                 py38_cpu    pytorch

@fmassa
Copy link
Member

fmassa commented Jun 16, 2021

FYI I've also tried with pip by doing

conda create -n test python=3.9
pip install torch torchvision

on a GPU machine and it worked fine.

Collecting environment information...
PyTorch version: 1.9.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.1 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.9 (64-bit runtime)
Python platform: Linux-5.4.0-52-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100

Nvidia driver version: 450.80.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.3
[pip3] torch==1.9.0
[pip3] torchvision==0.10.0
[conda] numpy                     1.20.3                    <pip>
[conda] torch                     1.9.0                     <pip>
[conda] torchvision               0.10.0                    <pip>

I'm now trying on conda with the same environment

@dodobyte
Copy link

Have the same issue, installed with conda, also a GPU machine.

@NicolasHug
Copy link
Member

On a Linux GPU machine it looks like torchvision 0.2.2 gets installed. I tried both with cuda 10.2 and 11.1 and both fail with AttributeError: module 'torchvision' has no attribute 'ops'.

conda create -n new python=3.9 pytorch torchvision cudatoolkit=10.2 -c pytorch

Collecting environment information...
PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.18.4
Libc version: glibc-2.27

Python version: 3.9 (64-bit runtime)
Python platform: Linux-5.4.0-1041-aws-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 11.1.105
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 450.80.02
cuDNN version: Probably one of the following:
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7.6.5
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7.6.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.2
[pip3] torch==1.9.0
[pip3] torchvision==0.2.2
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               10.2.89              hfd86e86_1
[conda] mkl                       2021.2.0           h06a4308_296
[conda] mkl-service               2.3.0            py39h27cfd23_1
[conda] mkl_fft                   1.3.0            py39h42c9631_2
[conda] mkl_random                1.2.1            py39ha9443f7_2
[conda] numpy                     1.20.2           py39h2d18471_0
[conda] numpy-base                1.20.2           py39hfae3a4d_0
[conda] pytorch                   1.9.0           py3.9_cuda10.2_cudnn7.6.5_0    pytorch
[conda] torchvision               0.2.2                      py_3    pytorch

conda create -n new python=3.9 pytorch torchvision cudatoolkit=11.1 -c pytorch -c nvidia

PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.18.4
Libc version: glibc-2.27

Python version: 3.9 (64-bit runtime)
Python platform: Linux-5.4.0-1041-aws-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 11.1.105
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 450.80.02
cuDNN version: Probably one of the following:
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7.6.5
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7.6.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.2
[pip3] torch==1.9.0
[pip3] torchvision==0.2.2
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.1.74              h6bb024c_0    nvidia
[conda] mkl                       2021.2.0           h06a4308_296
[conda] mkl-service               2.3.0            py39h27cfd23_1
[conda] mkl_fft                   1.3.0            py39h42c9631_2
[conda] mkl_random                1.2.1            py39ha9443f7_2
[conda] numpy                     1.20.2           py39h2d18471_0
[conda] numpy-base                1.20.2           py39hfae3a4d_0
[conda] pytorch                   1.9.0           py3.9_cuda11.1_cudnn8.0.5_0    pytorch
[conda] torchvision               0.2.2                      py_3    pytorch

@immanuelweber
Copy link
Author

immanuelweber commented Jun 16, 2021

@NicolasHug regarding, 0.2.2, yesterday I also observed that sometimes conda only found this version, when uninstalling torchvision 0.10.0 and reinstalling it, but I am unable to recreate this at this moment.
the installation line you posted results in 0.10.0 being installed on my machine.

@fmassa
Copy link
Member

fmassa commented Jun 16, 2021

Looking at https://anaconda.org/pytorch/torchvision/files, it seems that the py39_cu102 and py39_cu111 are available, so I'm not sure why it's not being found.

@malfet @seemethere there are problems with torchvision CUDA binaries on Linux for Python 3.9 (details in #4071 (comment)).

And I've just tried with Python 3.8, and even though I'm able to install matching versions, I get the same issue as originally reported in #4071 (comment)

In https://anaconda.org/pytorch/torchvision/files, the dates for torchvision binaries dates from 14 days ago, are we sure we copied the new ones that have been regenerated? Looking at the torchvision RCs in https://anaconda.org/pytorch-test/torchvision/files, they have been generated yesterday, so maybe we copied the wrong files when promoting the binaries?

@malfet
Copy link
Contributor

malfet commented Jun 16, 2021

Hmm, sample code fails for me with

RuntimeError: boxes should be a 2d tensor, got 3D

This one works as expected:

$ python -c "import torch as th; import torchvision as tv; print(tv.ops.batched_nms(th.zeros(100, 4), th.zeros(100), th.zeros(100), 0.5))"
tensor([62, 74, 73, 72, 71, 70, 69, 68, 67, 66, 65, 64, 63, 75, 61, 60, 59, 58,
        57, 56, 55, 54, 53, 52, 51, 87, 99, 98, 97, 96, 95, 94, 93, 92, 91, 90,
        89, 88, 50, 86, 85, 84, 83, 82, 81, 80, 79, 78, 77, 76, 12, 24, 23, 22,
        21, 20, 19, 18, 17, 16, 15, 14, 13, 25, 11, 10,  9,  8,  7,  6,  5,  4,
         3,  2,  1, 37, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38,  0, 36,
        35, 34, 33, 32, 31, 30, 29, 28, 27, 26])

@immanuelweber
Copy link
Author

@malfet you are right, should have mentioned that I just created these tensors to satisfy the inputs without caring for actual correct input. it stills fails with the above-mentioned error on my machine. I updated the sample code above however accordingly.

@immanuelweber
Copy link
Author

immanuelweber commented Jun 16, 2021

Since @fmassa pointed to https://anaconda.org/pytorch-test/torchvision/files, I just installed from there and the sample works.

@malfet
Copy link
Contributor

malfet commented Jun 16, 2021

torchvision in https://anaconda.org/pytorch channel was build against 9d5561b whereas one in https://anaconda.org/pytorch was build against ae9963f
I guess promoting package from one channel to another should resolve the issue

@fmassa
Copy link
Member

fmassa commented Jun 16, 2021

@malfet yes, I've tested by installing torchvision from the pytorch-test channel and it works, so promoting the packages should fix the issue

@malfet note that there are no functional differences in the torchvision code in 9d5561b vs ae9963f, just that the PyTorch versions in between when the RC was cut has changed

@immanuelweber
Copy link
Author

I just checked the packages on PyTorch channel, and they are up-to-date now and the code is working. Am I allowed to close this issue then?

@NicolasHug NicolasHug added the release-issue For release-related issues label Jun 21, 2021
@NicolasHug
Copy link
Member

@egonuel Could you please detail the command that you run and that's now working?

When I run conda create -n test python=3.9 pytorch torchvision cudatoolkit=11.1 -c pytorch -c nvidia, I still get torchvision 0.2.2, so it seems that not everything is fixed yet

@immanuelweber
Copy link
Author

@NicolasHug mmhh, this seems to be a different issue. When I run the line you posted on my machine, everything is fine and 0.10.0 is being installed

@malfet
Copy link
Contributor

malfet commented Jun 21, 2021

I think difference can be explained by presence/absence of conda-forge in ones .condarc. I got the repro after removing conda-forge dependency, but than fixed it by enabling it in the install command as follows:

conda create -n test python=3.9 pytorch torchvision cudatoolkit=11.1 -c pytorch -c nvidia -c conda-forge

@ChouCHou-y
Copy link

with the just released pytorch 1.9.0 and torchvision 0.10.0 torchvision.ops.batched_nms() crashes on my machine with the following error:

RuntimeError: Couldn't load custom C++ ops. This can happen if your PyTorch and torchvision versions are incompatible, or if you had errors while compiling torchvision from source. For further information on the compatible versions, check https://github.com/pytorch/vision#installation for the compatibility matrix. Please check your PyTorch version with torch.version and your torchvision version with torchvision.version and verify if they are compatible, and if not please reinstall torchvision so that it matches your PyTorch install.

how to solve?please

@fmassa
Copy link
Member

fmassa commented Aug 9, 2021

@ChouCHou-y this issue should have been fixed in #4240 (comment) , can you try uninstalling torchvision and installing it again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants