Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

keras-mxnet training failed with FusedOP #16785

Closed
roywei opened this issue Nov 12, 2019 · 3 comments · Fixed by #16798
Closed

keras-mxnet training failed with FusedOP #16785

roywei opened this issue Nov 12, 2019 · 3 comments · Fixed by #16798
Assignees
Labels

Comments

@roywei
Copy link
Member

roywei commented Nov 12, 2019

The following Keras-MXNet example break after FusedOP is introduced in mxnet #15167

https://github.com/awslabs/keras-apache-mxnet/blob/master/examples/cifar10_resnet.py

error:

mxnet.base.MXNetError: [22:37:39] src/nnvm/gradient.cc:213: Operator _FusedOp is non-differentiable because it didn't register FGradient attribute.
Stack trace:
  [bt] (0) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x6b476b) [0x7f3e7ebcc76b]
  [bt] (1) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x3b5bd04) [0x7f3e82073d04]
  [bt] (2) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x3b563a1) [0x7f3e8206e3a1]
  [bt] (3) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x93e2fad) [0x7f3e878fafad]
  [bt] (4) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x38487cd) [0x7f3e81d607cd]
  [bt] (5) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x38eeec6) [0x7f3e81e06ec6]
  [bt] (6) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x38eff66) [0x7f3e81e07f66]
  [bt] (7) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x38f0a4a) [0x7f3e81e08a4a]
  [bt] (8) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(mxnet::exec::GraphExecutor::Init(nnvm::Symbol, mxnet::Context const&, std::map<std::string, mxnet::Context, std::less<std::string>, std::allocator<std::pair<std::string const, mxnet::Context> > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&)+0x7d6) [0x7f3e81e09d06]

Still working on why it failed and trying to reproduce using pure MXNet.

@roywei roywei added the Bug label Nov 12, 2019
@roywei
Copy link
Member Author

roywei commented Nov 12, 2019

This is still failing after #16781

more stack trace:

[07:21:45] /home/ubuntu/mxnet/src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDN
N_AUTOTUNE_DEFAULT to 0 to disable)
 984/1563 [=================>............] - ETA: 40s - loss: 1.6950 - acc: 0.3762Traceback (most recent call last):
  File "cifar10_resnet.py", line 426, in <module>
    callbacks=callbacks)
  File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1433, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training_generator.py", line 217, in fit_generator
    class_weight=class_weight)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1232, in train_on_batch
    outputs = self.train_function(ins)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/mxnet_backend.py", line 5590, in train_function
    data, label, _, data_shapes, label_shapes = self._adjust_module(inputs, 'train')
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/mxnet_backend.py", line 5534, in _adjust_module
    self._module._curr_module.reshape(data_shapes, label_shapes)
  File "/home/ubuntu/mxnet/python/mxnet/module/module.py", line 472, in reshape
    self._exec_group.reshape(self._data_shapes, self._label_shapes)
  File "/home/ubuntu/mxnet/python/mxnet/module/executor_group.py", line 397, in reshape
    self.bind_exec(data_shapes, label_shapes, reshape=True)
  File "/home/ubuntu/mxnet/python/mxnet/module/executor_group.py", line 373, in bind_exec
    allow_up_sizing=True, **dict(data_shapes_i + label_shapes_i))
  File "/home/ubuntu/mxnet/python/mxnet/executor.py", line 458, in reshape
    ctypes.byref(handle)))
  File "/home/ubuntu/mxnet/python/mxnet/base.py", line 255, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [07:22:37] ../src/nnvm/gradient.cc:213: Operator _FusedOp is non-differentiable because it didn't register FGradient attribute.
Stack trace:
  [bt] (0) /home/ubuntu/mxnet/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x34) [0x7fe5ac33dc80]
  [bt] (1) /home/ubuntu/mxnet/python/mxnet/../../build/libmxnet.so(+0x812cd86) [0x7fe5b1402d86]
  [bt] (2) /home/ubuntu/mxnet/python/mxnet/../../build/libmxnet.so(std::_Function_handler<nnvm::Graph (nnvm::Graph), nnvm::Graph (*)(nnvm::Graph)>::_M_invoke(std::_Any_data const&, nnvm::Graph&&)+0x76) [0x7fe5b$
409119]
  [bt] (3) /home/ubuntu/mxnet/python/mxnet/../../build/libmxnet.so(std::function<nnvm::Graph (nnvm::Graph)>::operator()(nnvm::Graph) const+0x60) [0x7fe5b1412934]
[bt] (4) /home/ubuntu/mxnet/python/mxnet/../../build/libmxnet.so(nnvm::ApplyPasses(nnvm::Graph, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::
__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)+0x424) [0x7fe5b4cabea5]
  [bt] (5) /home/ubuntu/mxnet/python/mxnet/../../build/libmxnet.so(nnvm::ApplyPass(nnvm::Graph, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xca) [0x7fe5b1047e78]
  [bt] (6) /home/ubuntu/mxnet/python/mxnet/../../build/libmxnet.so(nnvm::pass::MXGradient(nnvm::Graph, std::vector<nnvm::NodeEntry, std::allocator<nnvm::NodeEntry> >, std::vector<nnvm::NodeEntry, std::allocator<
nnvm::NodeEntry> >, std::vector<nnvm::NodeEntry, std::allocator<nnvm::NodeEntry> >, std::function<nnvm::NodeEntry (std::vector<nnvm::NodeEntry, std::allocator<nnvm::NodeEntry> >&&)>, std::function<int (nnvm::Nod
e const&)>, std::function<nnvm::NodeEntry (nnvm::NodeEntry const&, nnvm::NodeEntry const&)>, std::vector<nnvm::Op const*, std::allocator<nnvm::Op const*> >, std::__cxx11::basic_string<char, std::char_traits<char
>, std::allocator<char> >)+0x693) [0x7fe5b114905c]
  [bt] (7) /home/ubuntu/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::exec::GraphExecutor::InitFullGraph(nnvm::Symbol, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)+0x6fd) [0x7fe5b
1132135]
  [bt] (8) /home/ubuntu/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::exec::GraphExecutor::InitGraph(nnvm::Symbol, mxnet::Context const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std
::allocator<char> >, mxnet::Context, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>,
 std::allocator<char> > const, mxnet::Context> > > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mx
net::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)+0xa5) [0x7fe5b1138403]

@roywei
Copy link
Member Author

roywei commented Nov 12, 2019

cc @ptrendx

@ptrendx ptrendx self-assigned this Nov 12, 2019
@roywei
Copy link
Member Author

roywei commented Nov 12, 2019

setting MXNET_USE_FUSION=0 works. maybe we can disable it by default in MXNet before proper fix.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants