Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error trying to run demo following requirements #82

Open
AlverGant opened this issue May 21, 2020 · 10 comments
Open

CUDA error trying to run demo following requirements #82

AlverGant opened this issue May 21, 2020 · 10 comments

Comments

@AlverGant
Copy link

Hi,
I have installed using the following requirements
Ubuntu 16 on P2xlarge AWS Tesla k80
NVIDIA-SMI 384.130 driver
CUDA release 9.0, V9.0.176
CUDNN libcudnn7_7.6.5.32-1+cuda9.0_amd64.deb
Pytorch==1.0.0, scipy==1.2.0
Python3

All modules were compiled without errors
but when running the example CUDA_VISIBLE_DEVICES=0 python demo_MiddleBury.py

I got the following error:

revise the unique id to a random numer 91561
Namespace(SAVED_MODEL=None, alpha=[0.0, 1.0], arg='./model_weights/91561-Thu-May-21-18:36/args.txt', batch_size=1, channels=3, ctx_lr_coe=1.0, datasetName='Vimeo_90K_interp', datasetPath='', dataset_split=97, debug=False, depth_lr_coe=0.001, dtype=<class 'torch.cuda.FloatTensor'>, end_frame=100, epsilon=1e-06, factor=0.2, filter_lr_coe=1.0, filter_size=4, flow_lr_coe=0.01, force=False, frame_input_dir='/content/DAIN/input_frames', frame_output_dir='/content/DAIN/output_frames', log='./model_weights/91561-Thu-May-21-18:36/log.txt', lr=0.002, netName='DAIN', no_date=False, numEpoch=100, occ_lr_coe=1.0, patience=5, rectify_lr=0.001, save_path='./model_weights/91561-Thu-May-21-18:36', save_which=1, seed=1, start_frame=1, time_step=0.5, uid=None, use_cuda=True, use_cudnn=1, weight_decay=0, workers=8)
cudnn is used
The testing model weight is: ./model_weights/best.pth
The unique id for current testing is: 85504
RubberWhale
/home/ubuntu/.local/lib/python3.5/site-packages/torch/nn/modules/upsampling.py:129: UserWarning: nn.UpsamplingNearest2d is deprecated. Use nn.functional.interpolate instead.
warnings.warn("nn.{} is deprecated. Use nn.functional.interpolate instead.".format(self.name))
/home/ubuntu/.local/lib/python3.5/site-packages/torch/nn/modules/upsampling.py:129: UserWarning: nn.Upsample is deprecated. Use nn.functional.interpolate instead.
warnings.warn("nn.{} is deprecated. Use nn.functional.interpolate instead.".format(self.name))
/home/ubuntu/.local/lib/python3.5/site-packages/torch/nn/functional.py:2423: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
"See the documentation of nn.Upsample for details.".format(mode))
error in correlation_forward_cuda_kernel: no kernel image is available for execution on the device
Traceback (most recent call last):
File "demo_MiddleBury.py", line 131, in
y_s,offset,filter = model(torch.stack((X0, X1),dim = 0))
File "/home/ubuntu/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/DAIN/networks/DAIN.py", line 152, in forward
time_offsets=time_offsets[::-1])
File "/usr/lib/python3.5/contextlib.py", line 77, in exit
self.gen.throw(type, value, traceback)
File "/home/ubuntu/.local/lib/python3.5/site-packages/torch/cuda/init.py", line 326, in stream
yield
File "/home/ubuntu/DAIN/networks/DAIN.py", line 149, in forward
self.forward_flownets(self.flownets, cur_offset_input, time_offsets=time_offsets),
File "/home/ubuntu/DAIN/networks/DAIN.py", line 205, in forward_flownets
temp = model(input) # this is a single direction motion results, but not a bidirectional one
File "/home/ubuntu/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/DAIN/PWCNet/PWCNet.py", line 221, in forward
corr6 = self.corr(c16, c26)
File "/home/ubuntu/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, kwargs)
File "/home/ubuntu/DAIN/PWCNet/correlation_package_pytorch1_0/correlation.py", line 59, in forward
result = CorrelationFunction(self.pad_size, self.kernel_size, self.max_displacement,self.stride1, self.stride2, self.corr_multiply)(input1, input2)
File "/home/ubuntu/DAIN/PWCNet/correlation_package_pytorch1_0/correlation.py", line 27, in forward
self.pad_size, self.kernel_size, self.max_displacement,self.stride1, self.stride2, self.corr_multiply)
RuntimeError: CUDA call failed (correlation_forward_cuda at correlation_cuda.cc:80)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f3d75f03fe1 in /home/ubuntu/.local/lib/python3.5/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f3d75f03dfa in /home/ubuntu/.local/lib/python3.5/site-packages/torch/lib/libc10.so)
frame #2: correlation_forward_cuda(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, int, int, int, int, int) + 0x5e6 (0x7f3d7299ef26 in /usr/local/lib/python3.5/dist-packages/correlation_cuda-0.0.0-py3.5-linux-x86_64.egg/correlation_cuda.cpython-35m-x86_64-linux-gnu.so)
frame #3: + 0x16042 (0x7f3d729ab042 in /usr/local/lib/python3.5/dist-packages/correlation_cuda-0.0.0-py3.5-linux-x86_64.egg/correlation_cuda.cpython-35m-x86_64-linux-gnu.so)
frame #4: + 0x1627e (0x7f3d729ab27e in /usr/local/lib/python3.5/dist-packages/correlation_cuda-0.0.0-py3.5-linux-x86_64.egg/correlation_cuda.cpython-35m-x86_64-linux-gnu.so)
frame #5: + 0x12e76 (0x7f3d729a7e76 in /usr/local/lib/python3.5/dist-packages/correlation_cuda-0.0.0-py3.5-linux-x86_64.egg/correlation_cuda.cpython-35m-x86_64-linux-gnu.so)

frame #9: python3() [0x4ec9a3]
frame #11: python3() [0x4fc63e]
frame #14: THPFunction_do_forward(THPFunction
, _object
) + 0x15c (0x7f3db0235bdc in /home/ubuntu/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #17: python3() [0x5b4846]
frame #21: python3() [0x4ecab7]
frame #25: python3() [0x4ec9a3]
frame #27: python3() [0x4fc63e]
frame #29: python3() [0x5b4846]
frame #33: python3() [0x4ecab7]
frame #37: python3() [0x4ec9a3]
frame #39: python3() [0x4fc63e]
frame #41: python3() [0x5b4846]
frame #44: python3() [0x54548f]
frame #47: python3() [0x4ecab7]
frame #51: python3() [0x4ec9a3]
frame #53: python3() [0x4fc63e]
frame #55: python3() [0x5b4846]
frame #58: python3() [0x544f43]
frame #60: python3() [0x622642]

Please help!

@AlverGant
Copy link
Author

I have tested CUDA and pytorch using this demo code and it ran correctly on gpu
https://github.com/jcjohnson/pytorch-examples pytorch tensors
Also tested CUDNN with MNIST sample code from CUDNN
All installation appear to be normal
Python version is 3.5.2

@AlverGant
Copy link
Author

Solved the issue adding compute_30 to all setup.py and recompiling all torch modules.
It was necesary for Tesla K80

@Exspiravit
Copy link

@AlverGant hi bro, i try to add compute_30 like this
image
and recompiling modules but error persist.

@AlverGant AlverGant reopened this May 25, 2020
@AlverGant
Copy link
Author

@AlverGant hi bro, i try to add compute_30 like this
image
and recompiling modules but error persist.

Hi @Exspiravit that is exactly what I did, is your GPU a Tesla K80? If not you have to set it accordingly as http://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
Are you using CUDA 9.0? and CUDNN 7 7.6.5.32-1?

@Exspiravit
Copy link

image
image
image
Yes i use gpu tesla k80 and have cudnn with cuda 10.0, i dont know if i have some diferences of you.

@AlverGant
Copy link
Author

You are having the same error that I had "cuda call failed" the differences is that I am using cuda 9 and torch 1.0.0 as recommended in the requirements from the developer

@AlverGant
Copy link
Author

And I am using an older driver NVIDIA-SMI 384.130

@AlphaGit
Copy link
Contributor

AlphaGit commented Jun 7, 2020

Hey all, I just created a PR with an updated version of the Collab that should give you the right dependencies and the encode versions. I was able to test it successfully with T4, P4 and K80.

You can check it out at #87, or check the Colab file before its merged from (here).

@RaspberryProgramming
Copy link

Colab Pro allows for tesla V100 gpus, which require '-gencode', 'arch=compute_70,code=sm_70',
in the compiler args file. "DAIN/my_package/compiler_args.py"

By default, '-gencode', 'arch=compute_75,code=sm_75', is uncommented. In order for me to fix it, I needed to uncomment line 44 which has '-gencode', 'arch=compute_70,code=sm_70', and maybe this isn't the correct way, but I commented line 42 which had '-gencode', 'arch=compute_75,code=sm_75'.

I then recompiled DAIN and DAIN PyTorch. Sorry if this is confusing, I'm not exactly experienced at this stuff.

@laomao0
Copy link

laomao0 commented Nov 7, 2021

If you do not want to build CUDA programs.
We provide the CUPY version of those packages.
The cupy files do not need to be built.
please refer to:
https://github.com/laomao0/cupy_packages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants