Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

rcnn example throws CUDNN_STATUS_BAD_PARAM when running under cudnn 6.0 #11240

Closed
ghost opened this issue Jun 12, 2018 · 8 comments
Closed

rcnn example throws CUDNN_STATUS_BAD_PARAM when running under cudnn 6.0 #11240

ghost opened this issue Jun 12, 2018 · 8 comments

Comments

@ghost
Copy link

ghost commented Jun 12, 2018

After I update the mxnet version from 1.1.0 to 1.2.0 and build the repository with CUDA 8.0.61 and cudnn 6.0, the rcnn training throws the following error when evaluating the rpn accuracy.

check failed: e == cuDNN: CUDNN_STATUS_SUCCESS(3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM

The error occurred when executing the following code in example/rcnn/rcnn/core/metric.py:
pred_label = mx.ndarray.argmax_channel(pred).asnumpy().astype('int32')
Any ideas to address this without disabling cudnn or rolling to a former version?

@kalyc
Copy link
Contributor

kalyc commented Jun 12, 2018

Thanks for submitting this issue @xioryu You could post this on discuss.mxnet.io for further details on usage of mxnet.

@kalyc
Copy link
Contributor

kalyc commented Jun 15, 2018

@nswamy could you add label "Question", "CUDA" to this?

@vrakesh
Copy link
Contributor

vrakesh commented Jun 18, 2018

@nswamy requesting a label for "Question", or CUDA to this issue

@ijkguo
Copy link
Contributor

ijkguo commented Jul 13, 2018

It was probably related to old SoftmaxActivation layer. Now changed to mx.sym.softmax in #11373.

@thomelane
Copy link
Contributor

@xioryu are you able to provide some sample code that reproduces this issue? many thanks!

It's not possible to diagnose from just knowing the error occurred on line pred_label = mx.ndarray.argmax_channel(pred).asnumpy().astype('int32'). I'd expect any fatal error in the network to appear when this line is run, just because .asnumpy() blocks and waits for all the async operations to complete (i.e. waits for the network computation to complete).

@ghost
Copy link
Author

ghost commented Aug 20, 2018

@thomelane The new simplified repo is OK.

@ijkguo
Copy link
Contributor

ijkguo commented Aug 20, 2018

The cause of this issue is operator SoftmaxActivation, used in the old complex rcnn example. Two fixes were made and either fixed this issue:

@thomelane
Copy link
Contributor

@xioryu @ijkguo great, and thanks for confirming!

@sandeep-krishnamurthy good to close this ticket now, cheers.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants