CUDA error trying to run demo following requirements #82

AlverGant · 2020-05-21T18:48:18Z

Hi,
I have installed using the following requirements
Ubuntu 16 on P2xlarge AWS Tesla k80
NVIDIA-SMI 384.130 driver
CUDA release 9.0, V9.0.176
CUDNN libcudnn7_7.6.5.32-1+cuda9.0_amd64.deb
Pytorch==1.0.0, scipy==1.2.0
Python3

All modules were compiled without errors
but when running the example CUDA_VISIBLE_DEVICES=0 python demo_MiddleBury.py

I got the following error:

revise the unique id to a random numer 91561
Namespace(SAVED_MODEL=None, alpha=[0.0, 1.0], arg='./model_weights/91561-Thu-May-21-18:36/args.txt', batch_size=1, channels=3, ctx_lr_coe=1.0, datasetName='Vimeo_90K_interp', datasetPath='', dataset_split=97, debug=False, depth_lr_coe=0.001, dtype=<class 'torch.cuda.FloatTensor'>, end_frame=100, epsilon=1e-06, factor=0.2, filter_lr_coe=1.0, filter_size=4, flow_lr_coe=0.01, force=False, frame_input_dir='/content/DAIN/input_frames', frame_output_dir='/content/DAIN/output_frames', log='./model_weights/91561-Thu-May-21-18:36/log.txt', lr=0.002, netName='DAIN', no_date=False, numEpoch=100, occ_lr_coe=1.0, patience=5, rectify_lr=0.001, save_path='./model_weights/91561-Thu-May-21-18:36', save_which=1, seed=1, start_frame=1, time_step=0.5, uid=None, use_cuda=True, use_cudnn=1, weight_decay=0, workers=8)
cudnn is used
The testing model weight is: ./model_weights/best.pth
The unique id for current testing is: 85504
RubberWhale
/home/ubuntu/.local/lib/python3.5/site-packages/torch/nn/modules/upsampling.py:129: UserWarning: nn.UpsamplingNearest2d is deprecated. Use nn.functional.interpolate instead.
warnings.warn("nn.{} is deprecated. Use nn.functional.interpolate instead.".format(self.name))
/home/ubuntu/.local/lib/python3.5/site-packages/torch/nn/modules/upsampling.py:129: UserWarning: nn.Upsample is deprecated. Use nn.functional.interpolate instead.
warnings.warn("nn.{} is deprecated. Use nn.functional.interpolate instead.".format(self.name))
/home/ubuntu/.local/lib/python3.5/site-packages/torch/nn/functional.py:2423: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
"See the documentation of nn.Upsample for details.".format(mode))
error in correlation_forward_cuda_kernel: no kernel image is available for execution on the device
Traceback (most recent call last):
File "demo_MiddleBury.py", line 131, in
y_s,offset,filter = model(torch.stack((X0, X1),dim = 0))
File "/home/ubuntu/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/DAIN/networks/DAIN.py", line 152, in forward
time_offsets=time_offsets[::-1])
File "/usr/lib/python3.5/contextlib.py", line 77, in exit
self.gen.throw(type, value, traceback)
File "/home/ubuntu/.local/lib/python3.5/site-packages/torch/cuda/init.py", line 326, in stream
yield
File "/home/ubuntu/DAIN/networks/DAIN.py", line 149, in forward
self.forward_flownets(self.flownets, cur_offset_input, time_offsets=time_offsets),
File "/home/ubuntu/DAIN/networks/DAIN.py", line 205, in forward_flownets
temp = model(input) # this is a single direction motion results, but not a bidirectional one
File "/home/ubuntu/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/DAIN/PWCNet/PWCNet.py", line 221, in forward
corr6 = self.corr(c16, c26)
File "/home/ubuntu/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, kwargs)
File "/home/ubuntu/DAIN/PWCNet/correlation_package_pytorch1_0/correlation.py", line 59, in forward
result = CorrelationFunction(self.pad_size, self.kernel_size, self.max_displacement,self.stride1, self.stride2, self.corr_multiply)(input1, input2)
File "/home/ubuntu/DAIN/PWCNet/correlation_package_pytorch1_0/correlation.py", line 27, in forward
self.pad_size, self.kernel_size, self.max_displacement,self.stride1, self.stride2, self.corr_multiply)
RuntimeError: CUDA call failed (correlation_forward_cuda at correlation_cuda.cc:80)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f3d75f03fe1 in /home/ubuntu/.local/lib/python3.5/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f3d75f03dfa in /home/ubuntu/.local/lib/python3.5/site-packages/torch/lib/libc10.so)
frame #2: correlation_forward_cuda(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, int, int, int, int, int) + 0x5e6 (0x7f3d7299ef26 in /usr/local/lib/python3.5/dist-packages/correlation_cuda-0.0.0-py3.5-linux-x86_64.egg/correlation_cuda.cpython-35m-x86_64-linux-gnu.so)
frame #3: + 0x16042 (0x7f3d729ab042 in /usr/local/lib/python3.5/dist-packages/correlation_cuda-0.0.0-py3.5-linux-x86_64.egg/correlation_cuda.cpython-35m-x86_64-linux-gnu.so)
frame #4: + 0x1627e (0x7f3d729ab27e in /usr/local/lib/python3.5/dist-packages/correlation_cuda-0.0.0-py3.5-linux-x86_64.egg/correlation_cuda.cpython-35m-x86_64-linux-gnu.so)
frame #5: + 0x12e76 (0x7f3d729a7e76 in /usr/local/lib/python3.5/dist-packages/correlation_cuda-0.0.0-py3.5-linux-x86_64.egg/correlation_cuda.cpython-35m-x86_64-linux-gnu.so)

frame #9: python3() [0x4ec9a3]
frame #11: python3() [0x4fc63e]
frame #14: THPFunction_do_forward(THPFunction, _object) + 0x15c (0x7f3db0235bdc in /home/ubuntu/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #17: python3() [0x5b4846]
frame #21: python3() [0x4ecab7]
frame #25: python3() [0x4ec9a3]
frame #27: python3() [0x4fc63e]
frame #29: python3() [0x5b4846]
frame #33: python3() [0x4ecab7]
frame #37: python3() [0x4ec9a3]
frame #39: python3() [0x4fc63e]
frame #41: python3() [0x5b4846]
frame #44: python3() [0x54548f]
frame #47: python3() [0x4ecab7]
frame #51: python3() [0x4ec9a3]
frame #53: python3() [0x4fc63e]
frame #55: python3() [0x5b4846]
frame #58: python3() [0x544f43]
frame #60: python3() [0x622642]

Please help!

AlverGant · 2020-05-21T19:08:39Z

I have tested CUDA and pytorch using this demo code and it ran correctly on gpu
https://github.com/jcjohnson/pytorch-examples pytorch tensors
Also tested CUDNN with MNIST sample code from CUDNN
All installation appear to be normal
Python version is 3.5.2

AlverGant · 2020-05-21T20:52:31Z

Solved the issue adding compute_30 to all setup.py and recompiling all torch modules.
It was necesary for Tesla K80

Exspiravit · 2020-05-25T07:58:59Z

@AlverGant hi bro, i try to add compute_30 like this

and recompiling modules but error persist.

AlverGant · 2020-05-25T13:08:04Z

@AlverGant hi bro, i try to add compute_30 like this

and recompiling modules but error persist.

Hi @Exspiravit that is exactly what I did, is your GPU a Tesla K80? If not you have to set it accordingly as http://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
Are you using CUDA 9.0? and CUDNN 7 7.6.5.32-1?

Exspiravit · 2020-05-26T07:43:44Z

Yes i use gpu tesla k80 and have cudnn with cuda 10.0, i dont know if i have some diferences of you.

AlverGant · 2020-05-26T14:52:15Z

You are having the same error that I had "cuda call failed" the differences is that I am using cuda 9 and torch 1.0.0 as recommended in the requirements from the developer

AlverGant · 2020-05-26T14:59:02Z

And I am using an older driver NVIDIA-SMI 384.130

AlphaGit · 2020-06-07T23:21:56Z

Hey all, I just created a PR with an updated version of the Collab that should give you the right dependencies and the encode versions. I was able to test it successfully with T4, P4 and K80.

You can check it out at #87, or check the Colab file before its merged from (here).

RaspberryProgramming · 2020-09-17T01:28:13Z

Colab Pro allows for tesla V100 gpus, which require '-gencode', 'arch=compute_70,code=sm_70',
in the compiler args file. "DAIN/my_package/compiler_args.py"

By default, '-gencode', 'arch=compute_75,code=sm_75', is uncommented. In order for me to fix it, I needed to uncomment line 44 which has '-gencode', 'arch=compute_70,code=sm_70', and maybe this isn't the correct way, but I commented line 42 which had '-gencode', 'arch=compute_75,code=sm_75'.

I then recompiled DAIN and DAIN PyTorch. Sorry if this is confusing, I'm not exactly experienced at this stuff.

laomao0 · 2021-11-07T06:47:31Z

If you do not want to build CUDA programs.
We provide the CUPY version of those packages.
The cupy files do not need to be built.
please refer to:
https://github.com/laomao0/cupy_packages

AlverGant closed this as completed May 25, 2020

AlverGant reopened this May 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error trying to run demo following requirements #82

CUDA error trying to run demo following requirements #82

AlverGant commented May 21, 2020

AlverGant commented May 21, 2020

AlverGant commented May 21, 2020

Exspiravit commented May 25, 2020

AlverGant commented May 25, 2020

Exspiravit commented May 26, 2020

AlverGant commented May 26, 2020

AlverGant commented May 26, 2020

AlphaGit commented Jun 7, 2020

RaspberryProgramming commented Sep 17, 2020

laomao0 commented Nov 7, 2021

CUDA error trying to run demo following requirements #82

CUDA error trying to run demo following requirements #82

Comments

AlverGant commented May 21, 2020

AlverGant commented May 21, 2020

AlverGant commented May 21, 2020

Exspiravit commented May 25, 2020

AlverGant commented May 25, 2020

Exspiravit commented May 26, 2020

AlverGant commented May 26, 2020

AlverGant commented May 26, 2020

AlphaGit commented Jun 7, 2020

RaspberryProgramming commented Sep 17, 2020

laomao0 commented Nov 7, 2021