Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

inference results unstable in mxnet_mkl-1.2.0b20180416 #10580

Closed
dwSun opened this issue Apr 17, 2018 · 21 comments
Closed

inference results unstable in mxnet_mkl-1.2.0b20180416 #10580

dwSun opened this issue Apr 17, 2018 · 21 comments

Comments

@dwSun
Copy link
Contributor

dwSun commented Apr 17, 2018

sysinfo

Python 3.6.5
debian sid

desc

I am using mx.mod.Module to build my inference program.
When run a same inference several time without restart my program the results are unstable, only the first one is correct, others are differ from each other.
Then I change back to mxnet-mkl 1.1.0, the results become same again.

@TaoLv
Copy link
Member

TaoLv commented Apr 17, 2018

Is it possible for you to give a repeatable model for this issue? Thanks.

@dwSun
Copy link
Contributor Author

dwSun commented Apr 17, 2018

try this.
issue_10580.gz

python3 inference.py

@pengzhao-intel
Copy link
Contributor

did you try new mkldnn backbend?

@zheng-da
Copy link
Contributor

could you provide us a minimum script to reproduce the error? Thanks

@TaoLv
Copy link
Member

TaoLv commented Apr 17, 2018

@dwSun Thanks for the scripts. Here is my update:
This issue can be reproduced on master branch with mkldnn enabled. I have tried several old commits, it seems this issue was introduced in since #9918 , which is just for updating mkldnn version. So I doubt there are some bugs in mkldnn. Need more investigation.
@xinyu-intel @pengzhao-intel

@zheng-da
Copy link
Contributor

Does it mean the bug is in the MKLDNN library?

@TaoLv
Copy link
Member

TaoLv commented Apr 17, 2018

I guess yes, but not sure. I opened MXNET_MKLDNN_DEBUG but no complaints. Still need minimum case to reproduce it.

@TaoLv
Copy link
Member

TaoLv commented Apr 18, 2018

On latest master, this issue for @dwSun 's script can be resolved by removing below line from mkldnn_convolution.cc
https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/mkldnn/mkldnn_convolution.cc#L286

weight.MKLDNNDataReorderAsync(fwd.fwd_pd.weights_primitive_desc());

However this change only works for inference. We still need more comprehensive solution for it.
I feel that we can't push async operation to execution engine from an operator body. It may cause dead lock or data race. @zheng-da @pengzhao-intel please take a look.

@zheng-da
Copy link
Contributor

The reason we push async here is to change the layout of weight arrays during inference so that we don't need to change the layout every time.
Originally, the code changed the data layout inside the array directly, and it caused a race condition.

How is @dwSun code different from here https://github.com/apache/incubator-mxnet/blob/master/tests/python/gpu/test_gluon_model_zoo_gpu.py#L41
The code works here.

@TaoLv
Copy link
Member

TaoLv commented Apr 19, 2018

update:
@dwSun could you help to try this branch to see if this issue still there?
https://github.com/TaoLv/incubator-mxnet/tree/fix-SetMKLMem

Thanks.

@dwSun
Copy link
Contributor Author

dwSun commented Apr 20, 2018

@TaoLv can you provide a pre-compiled pip package?
Thanks

@TaoLv
Copy link
Member

TaoLv commented Apr 20, 2018

Please ignore that. @zheng-da has submitted the final fix in #10624 . You can try the nightly build after that PR is merged. Thanks.

@zheng-da
Copy link
Contributor

@dwSun the bug should have been fixed. the code has been merged to the master branch. Could you please try and see if the fix has solved your problem? Thanks.

@dwSun
Copy link
Contributor Author

dwSun commented Apr 27, 2018

Is there any nightly build or something like this?
I compiled it follow instructions from https://mxnet.incubator.apache.org/install/index.html and https://zh.mxnet.io/blog/mkldnn.
It crashed like this:

% python3 inference.py                                                                    ✹ ✭
[16:16:44] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.0.0. Attempting to upgrade...
[16:16:44] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
inference.py:26: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead
  img = cv2.imdecode(np.fromstring(img_bytes, np.uint8), 0)
terminate called after throwing an instance of 'dmlc::Error'
  what():  [16:16:44] src/engine/./threaded_engine.h:379: std::exception
A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 8 entries:
[bt] (0) /home/david/Code/ml/mx/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x40) [0x7fda168f3640]
[bt] (1) /home/david/Code/ml/mx/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x29) [0x7fda168f40e9]
[bt] (2) /home/david/Code/ml/mx/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0xa52) [0x7fda1926ee82]
[bt] (3) /home/david/Code/ml/mx/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x126) [0x7fda19282326]
[bt] (4) /home/david/Code/ml/mx/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_run()+0x3a) [0x7fda192816ca]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb96f) [0x7fda290d696f]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x75aa) [0x7fda2f93f5aa]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fda2ea8ecbf]


[1]    31728 abort      python3 inference.py

No idea what I should do next (。﹏。).

@zheng-da
Copy link
Contributor

zheng-da commented Apr 27, 2018

what is inference.py? I need to reproduce your problem locally.

@dwSun
Copy link
Contributor Author

dwSun commented Apr 28, 2018

https://github.com/apache/incubator-mxnet/files/1919272/issue_10580.gz
it is the code to reproduce this bug.

@zheng-da
Copy link
Contributor

@dwSun when I unzip the file, it's a single file of 13.4MB. It doesn't contain inference.py.

@dwSun
Copy link
Contributor Author

dwSun commented Apr 28, 2018

@zheng-da try this command:

tar -vxf issue_10580.gz

@zheng-da
Copy link
Contributor

zheng-da commented Apr 29, 2018

I tried your script on my own machine with PR #10731.
it works fine. Maybe you can try again after the PR is merged?

@dwSun
Copy link
Contributor Author

dwSun commented May 1, 2018

waiting for PR #10731 to be merged

@dwSun
Copy link
Contributor Author

dwSun commented May 8, 2018

just tested with mxnet-mkl-1.2.0b20180508, it works well.

@szha szha closed this as completed May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants