inference results unstable in mxnet_mkl-1.2.0b20180416 #10580

dwSun · 2018-04-17T09:27:50Z

sysinfo

Python 3.6.5
debian sid

desc

I am using mx.mod.Module to build my inference program.
When run a same inference several time without restart my program the results are unstable, only the first one is correct, others are differ from each other.
Then I change back to mxnet-mkl 1.1.0, the results become same again.

TaoLv · 2018-04-17T09:48:30Z

Is it possible for you to give a repeatable model for this issue? Thanks.

dwSun · 2018-04-17T10:31:30Z

try this.
issue_10580.gz

python3 inference.py

pengzhao-intel · 2018-04-17T11:27:42Z

did you try new mkldnn backbend?

zheng-da · 2018-04-17T16:31:24Z

could you provide us a minimum script to reproduce the error? Thanks

TaoLv · 2018-04-17T16:50:52Z

@dwSun Thanks for the scripts. Here is my update:
This issue can be reproduced on master branch with mkldnn enabled. I have tried several old commits, it seems this issue was introduced in since #9918 , which is just for updating mkldnn version. So I doubt there are some bugs in mkldnn. Need more investigation.
@xinyu-intel @pengzhao-intel

zheng-da · 2018-04-17T16:56:55Z

Does it mean the bug is in the MKLDNN library?

TaoLv · 2018-04-17T17:05:51Z

I guess yes, but not sure. I opened MXNET_MKLDNN_DEBUG but no complaints. Still need minimum case to reproduce it.

TaoLv · 2018-04-18T02:58:22Z

On latest master, this issue for @dwSun 's script can be resolved by removing below line from mkldnn_convolution.cc
https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/mkldnn/mkldnn_convolution.cc#L286

weight.MKLDNNDataReorderAsync(fwd.fwd_pd.weights_primitive_desc());

However this change only works for inference. We still need more comprehensive solution for it.
I feel that we can't push async operation to execution engine from an operator body. It may cause dead lock or data race. @zheng-da @pengzhao-intel please take a look.

zheng-da · 2018-04-18T04:46:15Z

The reason we push async here is to change the layout of weight arrays during inference so that we don't need to change the layout every time.
Originally, the code changed the data layout inside the array directly, and it caused a race condition.

How is @dwSun code different from here https://github.com/apache/incubator-mxnet/blob/master/tests/python/gpu/test_gluon_model_zoo_gpu.py#L41
The code works here.

TaoLv · 2018-04-19T15:32:23Z

update:
@dwSun could you help to try this branch to see if this issue still there?
https://github.com/TaoLv/incubator-mxnet/tree/fix-SetMKLMem

Thanks.

dwSun · 2018-04-20T07:10:45Z

@TaoLv can you provide a pre-compiled pip package?
Thanks

TaoLv · 2018-04-20T07:27:37Z

Please ignore that. @zheng-da has submitted the final fix in #10624 . You can try the nightly build after that PR is merged. Thanks.

zheng-da · 2018-04-27T06:03:15Z

@dwSun the bug should have been fixed. the code has been merged to the master branch. Could you please try and see if the fix has solved your problem? Thanks.

dwSun · 2018-04-27T08:20:50Z

Is there any nightly build or something like this?
I compiled it follow instructions from https://mxnet.incubator.apache.org/install/index.html and https://zh.mxnet.io/blog/mkldnn.
It crashed like this:

% python3 inference.py                                                                    ✹ ✭
[16:16:44] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.0.0. Attempting to upgrade...
[16:16:44] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
inference.py:26: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead
  img = cv2.imdecode(np.fromstring(img_bytes, np.uint8), 0)
terminate called after throwing an instance of 'dmlc::Error'
  what():  [16:16:44] src/engine/./threaded_engine.h:379: std::exception
A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 8 entries:
[bt] (0) /home/david/Code/ml/mx/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x40) [0x7fda168f3640]
[bt] (1) /home/david/Code/ml/mx/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x29) [0x7fda168f40e9]
[bt] (2) /home/david/Code/ml/mx/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0xa52) [0x7fda1926ee82]
[bt] (3) /home/david/Code/ml/mx/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x126) [0x7fda19282326]
[bt] (4) /home/david/Code/ml/mx/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_run()+0x3a) [0x7fda192816ca]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb96f) [0x7fda290d696f]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x75aa) [0x7fda2f93f5aa]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fda2ea8ecbf]


[1]    31728 abort      python3 inference.py

No idea what I should do next (。﹏。).

zheng-da · 2018-04-27T17:08:17Z

what is inference.py? I need to reproduce your problem locally.

dwSun · 2018-04-28T06:04:22Z

https://github.com/apache/incubator-mxnet/files/1919272/issue_10580.gz
it is the code to reproduce this bug.

zheng-da · 2018-04-28T06:21:21Z

@dwSun when I unzip the file, it's a single file of 13.4MB. It doesn't contain inference.py.

dwSun · 2018-04-28T06:32:06Z

@zheng-da try this command:

tar -vxf issue_10580.gz

zheng-da · 2018-04-29T01:56:58Z

I tried your script on my own machine with PR #10731.
it works fine. Maybe you can try again after the PR is merged?

dwSun · 2018-05-01T02:00:06Z

waiting for PR #10731 to be merged

dwSun · 2018-05-08T14:26:03Z

just tested with mxnet-mkl-1.2.0b20180508, it works well.

zheng-da mentioned this issue Apr 20, 2018

[MXNET-351] Fix a bug in the MKLDNN integration. #10624

Merged

7 tasks

zheng-da mentioned this issue Apr 27, 2018

Fix a bug in getting MKLDNN memory #10731

Merged

7 tasks

szha closed this as completed May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inference results unstable in mxnet_mkl-1.2.0b20180416 #10580

inference results unstable in mxnet_mkl-1.2.0b20180416 #10580

dwSun commented Apr 17, 2018

TaoLv commented Apr 17, 2018

dwSun commented Apr 17, 2018 •

edited

Loading

pengzhao-intel commented Apr 17, 2018

zheng-da commented Apr 17, 2018

TaoLv commented Apr 17, 2018

zheng-da commented Apr 17, 2018

TaoLv commented Apr 17, 2018

TaoLv commented Apr 18, 2018

zheng-da commented Apr 18, 2018

TaoLv commented Apr 19, 2018

dwSun commented Apr 20, 2018

TaoLv commented Apr 20, 2018

zheng-da commented Apr 27, 2018

dwSun commented Apr 27, 2018

zheng-da commented Apr 27, 2018 •

edited

Loading

dwSun commented Apr 28, 2018

zheng-da commented Apr 28, 2018

dwSun commented Apr 28, 2018

zheng-da commented Apr 29, 2018 •

edited

Loading

dwSun commented May 1, 2018

dwSun commented May 8, 2018

inference results unstable in mxnet_mkl-1.2.0b20180416 #10580

inference results unstable in mxnet_mkl-1.2.0b20180416 #10580

Comments

dwSun commented Apr 17, 2018

sysinfo

desc

TaoLv commented Apr 17, 2018

dwSun commented Apr 17, 2018 • edited Loading

pengzhao-intel commented Apr 17, 2018

zheng-da commented Apr 17, 2018

TaoLv commented Apr 17, 2018

zheng-da commented Apr 17, 2018

TaoLv commented Apr 17, 2018

TaoLv commented Apr 18, 2018

zheng-da commented Apr 18, 2018

TaoLv commented Apr 19, 2018

dwSun commented Apr 20, 2018

TaoLv commented Apr 20, 2018

zheng-da commented Apr 27, 2018

dwSun commented Apr 27, 2018

zheng-da commented Apr 27, 2018 • edited Loading

dwSun commented Apr 28, 2018

zheng-da commented Apr 28, 2018

dwSun commented Apr 28, 2018

zheng-da commented Apr 29, 2018 • edited Loading

dwSun commented May 1, 2018

dwSun commented May 8, 2018

dwSun commented Apr 17, 2018 •

edited

Loading

zheng-da commented Apr 27, 2018 •

edited

Loading

zheng-da commented Apr 29, 2018 •

edited

Loading