Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Test Failure] GPU Test failures across different CUDA versions #14502

Closed
perdasilva opened this issue Mar 22, 2019 · 6 comments
Closed

[Test Failure] GPU Test failures across different CUDA versions #14502

perdasilva opened this issue Mar 22, 2019 · 6 comments

Comments

@perdasilva
Copy link
Contributor

perdasilva commented Mar 22, 2019

Description

Testing mxnet library compiled for the python distribution against different versions of CUDA.

I'm getting a strange failures on all CUDA versions. The tests are being run on a g3.8xlarge instance, within a docker container based on the nvidia/cuda:XXX-cudnn7-devel-ubuntu16.04 (where XXX is the particular version of CUDA).

======================================================================
ERROR: test_gluon_gpu.test_lstmp
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 110, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/test_gluon_gpu.py", line 124, in test_lstmp
    check_rnn_layer_forward(gluon.rnn.LSTM(10, 2, projection_size=5), mx.nd.ones((8, 3, 20)))
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon_rnn.py", line 441, in check_rnn_layer_forward
    out = layer(inputs)
  File "/work/mxnet/python/mxnet/gluon/block.py", line 540, in __call__
    out = self.forward(*args)
  File "/work/mxnet/python/mxnet/gluon/block.py", line 917, in forward
    return self.hybrid_forward(ndarray, x, *args, **params)
  File "/work/mxnet/python/mxnet/gluon/rnn/rnn_layer.py", line 239, in hybrid_forward
    out = self._forward_kernel(F, inputs, states, **kwargs)
  File "/work/mxnet/python/mxnet/gluon/rnn/rnn_layer.py", line 270, in _forward_kernel
    lstm_state_clip_nan=self._lstm_state_clip_nan)
  File "<string>", line 145, in RNN
  File "/work/mxnet/python/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/work/mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [14:22:01] src/operator/./rnn-inl.h:385: hidden layer projection is only supported for GPU with CuDNN later than 7.1.1

Stack trace returned 10 entries:
[bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x42c70a) [0x7fcb9e25170a]
[bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x42cd31) [0x7fcb9e251d31]
[bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x3495ea8) [0x7fcba12baea8]
[bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x349612e) [0x7fcba12bb12e]
[bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x30ea87f) [0x7fcba0f0f87f]
[bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x75f9c5) [0x7fcb9e5849c5]
[bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Imperative::InvokeOp(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode, mxnet::OpStatePtr)+0xb35) [0x7fcba0cdcc45]
[bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0x38c) [0x7fcba0cdd1cc]
[bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x2db2d09) [0x7fcba0bd7d09]
[bt] (9) /work/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeEx+0x6f) [0x7fcba0bd82ff]


-------------------- >> begin captured stdout << ---------------------
checking gradient for lstm0_l0_h2h_bias
checking gradient for lstm0_l0_h2h_weight
checking gradient for lstm0_l0_i2h_weight
checking gradient for lstm0_l0_i2h_bias
checking gradient for lstm0_l0_h2r_weight

--------------------- >> end captured stdout << ----------------------
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1414687138 to reproduce.
--------------------- >> end captured logging << ---------------------
======================================================================
ERROR: test_gluon_gpu.test_gluon_ctc_consistency
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/gpu/test_gluon_gpu.py", line 248, in test_gluon_ctc_consistency
    assert_almost_equal(cpu_data.grad.asnumpy(), gpu_data.grad.asnumpy(), atol=1e-3, rtol=1e-3)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1995, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:27:32] src/operator/./cudnn_rnn-inl.h:759: Check failed: e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH

Stack trace returned 10 entries:
[bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace()+0x54) [0x7f8f4dee2fe4]
[bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x2b) [0x7f8f4dee354b]
[bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::CuDNNRNNOp<mshadow::half::half_t>::Init(mshadow::Stream<mshadow::gpu>*, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x23be) [0x7f8f545bd02e]
[bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::CuDNNRNNOp<mshadow::half::half_t>::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x1373) [0x7f8f545c0323]
[bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::OperatorState::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x4ac) [0x7f8f50d01afc]
[bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::LegacyOpForward(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x18) [0x7f8f50cf6e98]
[bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&), void (*)(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)>::_M_invoke(std::_Any_data const&, mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x20) [0x7f8f50b04140]
[bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x386) [0x7f8f50dae686]
[bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext)#4}>::_M_invoke(std::_Any_data const&, mxnet::RunContext)+0x6d) [0x7f8f50dafd4d]
[bt] (9) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x5c69fad) [0x7f8f5161bfad]

======================================================================
ERROR: test_gluon_gpu.test_rnn_layers_fp16
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/usr/lib/python3.6/site-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 110, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon_rnn.py", line 545, in test_rnn_layers_fp16
    run_rnn_layers('float16', 'float32', mx.gpu())
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon_rnn.py", line 479, in run_rnn_layers
    check_rnn_layer_forward(gluon.rnn.RNN(10, 2, dtype=dtype), mx.nd.ones((8, 3, 20), dtype=dtype), ctx=ctx)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon_rnn.py", line 451, in check_rnn_layer_forward
    np_out = out.asnumpy()
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1995, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:29:19] src/operator/./cudnn_rnn-inl.h:759: Check failed: e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH

Stack trace returned 10 entries:
[bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace()+0x54) [0x7f8f4dee2fe4]
[bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x2b) [0x7f8f4dee354b]
[bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::CuDNNRNNOp<mshadow::half::half_t>::Init(mshadow::Stream<mshadow::gpu>*, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x23be) [0x7f8f545bd02e]
[bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::CuDNNRNNOp<mshadow::half::half_t>::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x1373) [0x7f8f545c0323]
[bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::OperatorState::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x4ac) [0x7f8f50d01afc]
[bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::LegacyOpForward(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x18) [0x7f8f50cf6e98]
[bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&), void (*)(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)>::_M_invoke(std::_Any_data const&, mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x20) [0x7f8f50b04140]
[bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x386) [0x7f8f50dae686]
[bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext)#4}>::_M_invoke(std::_Any_data const&, mxnet::RunContext)+0x6d) [0x7f8f50dafd4d]
[bt] (9) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x5c69fad) [0x7f8f5161bfad]
----------------------------------------------------------------------

I have not yet tried to reproduce it separately outside of Docker on a GPU machine using the current pip package for 1.4.0 or nightly - could be that it's only an issue on master.

I find it strange that the PRs aren't breaking. Since they seem to be based off the same docker image I'm using, running on the same instance type.

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Test, Cuda

@zachgk
Copy link
Contributor

zachgk commented Mar 22, 2019

@mxnet-label-bot add [CUDA, Docker, Bug]

@perdasilva
Copy link
Contributor Author

perdasilva commented Mar 25, 2019

Perhaps the reason CI isn't reporting any issues is that both the version of cudnn in the environment that we set and the cudnn version set by the image are lower than what is required by the test.

So, as it is, the test just ensures an error is raised. I don't know if (m)any of the cudnn functions are being tested right now. Maybe there are bugs slipping through.

@perdasilva
Copy link
Contributor Author

perdasilva commented Mar 25, 2019

Created a PR (#14513) updating CI to use CUDA v10.0 and use the CUDNN_VERSION environment variable value set by the base container to see if I can catch this issue in CI.

@perdasilva
Copy link
Contributor Author

perdasilva commented Mar 26, 2019

I've figured out why the test_lstmp test was failing and have submitted a PR (#14529) to fix that issue.

I have also reached out to @DickJC123 to see if his cuda expertise can help with the other errors related to e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH, which I've started experiencing the in CI PR (#14513)

Here are some jenkins logs:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14513/6/pipeline/271/

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14513/6/pipeline/77

@perdasilva
Copy link
Contributor Author

Closing this issue in favor of #14652

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants