-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Deadlock happend while calling MXNDArraySyncCopyToCPU() ? #12923
Comments
Are there any hints for the reason why the thread is blocked when calling the MXNDArraySyncCopyToCPU method ? Just like under some special suitations or usage ? |
Hi, MXNet does not support multithreaded interaction with its frontend APIs. We rather require a sticky-thread for this. This means that you have to follow the dispatcher model which dedicates one thread during the entire lifecycle of your application to internet with MXNet. It's important that you don't only use a mutex since we depend on the threadlocal variables that are assigned to the dispatcher thread. |
Thanks for the reply, I did not make it clear before. We started 8 processes on one machine, and only 1 thread per process uses mxnet (other threads handle different works). We called the python engine in a C++ program that uses the mxnet-python api. Is there a problem with this usage? |
That sounds good to me. Could you maybe show some minimal example that allows to reproduce the problem? I'll let somebody else follow up on your issue since we're now getting to the Python-API. |
@mxnet-label-bot [Python, Thread Safety] |
Hello, sorry to reply so late. This problem no longer occurs after changing the engine type of MxNet from the default ThreadedEnginePerDevice to ThreadedEngine. I hope this can give you some clues. |
@coconutyao Good to see that your issue was resolved. I'm closing this issue. Please feel free to re-open if closed in error. @lanking520 Can you please close this issue ? Thanks! |
@coconutyao Close the issue for now. Please feel free to reopen it if you are still facing the problem |
We have been troubled by the problem for a few days, so we need everyone's help, thank you!
Environment:
GPU: Tesla P4; CPU: Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz.
Appearance:
The program receives the Image data as a server. After a period of time, the program starts to appear similar to Deadlock (may be caused by some requests, but cannot be accurately reproduced)
We tested on mxnet versions 1.0, 1.2, and 1.3, and the program showed the same appearance.
Program running process:
We called the python engine in a C++ multithreaded program that uses the mxnet-python api. As can be seen from the stack information, MXNDArraySyncCopyToCPU() waits for a condition variable during execution, and the program will always be stuck in this place.
Stack information:
Thread 85 (Thread 0x7f3cba52f700 (LWP 41394)):
#0 0x00007f3d582fd6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f3d580979bc in __gthread_cond_wait (__mutex=, __cond=) at /data/home/xxx/gcc-build/gcc-4.9.4/build/x86_64-redhat-linux/libstdc++-v3/include/x86_64-redhat-linux/bits/gthr-default.h:864
#2 std::condition_variable::wait (this=, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:52
#3 0x00007f3c7bcb86d5 in ?? () from my_app/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so
#4 0x00007f3c7bd94b4d in ?? () from my_app/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so
#5 0x00007f3c7be7e9c3 in ?? () from my_app/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so
#6 0x00007f3c7bc516db in MXNDArraySyncCopyToCPU () from my_app/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so
#7 0x00007f3d53e15adc in ffi_call_unix64 () from my_app/libs/./libffi.so.6
#8 0x00007f3d53e15282 in ffi_call () from my_app/libs/./libffi.so.6
#9 0x00007f3bfdd09376 in _call_function_pointer (argcount=3, resmem=0x7f3b3c1c4040, restype=, atypes=, avalues=0x7f3b3c1c4010, pProc=0x7f3c7bc516b0 , flags=4353) at /home/xxx/minonda/conda-bld/python-2.7_1482296880985/work/Python-2.7.13/Modules/_ctypes/callproc.c:841
#10 _ctypes_callproc (pProc=0x7f3c7bc516b0 , argtuple=0x7f3b3c1c4130, flags=4353, argtypes=, restype=0x1616b80, checker=0x0) at /home/xxx/minonda/conda-bld/python-2.7_1482296880985/work/Python-2.7.13/Modules/_ctypes/callproc.c:1184
#11 0x00007f3bfdd00db3 in PyCFuncPtr_call (self=, inargs=, kwds=0x0) at /home/xxx/minonda/conda-bld/python-2.7_1482296880985/work/Python-2.7.13/Modules/_ctypes/_ctypes.c:3979
#12 0x00007f3d52c42e93 in PyObject_Call (func=0x7f3d2a11a050, arg=, kw=) at Objects/abstract.c:2547
#13 0x00007f3d52cf580d in do_call (nk=, na=, pp_stack=0x7f3b3c1c43b8, func=0x7f3d2a11a050) at Python/ceval.c:4569
#14 call_function (oparg=, pp_stack=0x7f3b3c1c43b8) at Python/ceval.c:4374
#15 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:2989
#16 0x00007f3d52cf7c3e in PyEval_EvalCodeEx (co=0x7f3d3f730030, globals=, locals=, args=, argcount=1, kws=0x7f3d2a186fd0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3584
#17 0x00007f3d52cf71f7 in fast_function (nk=, na=1, n=, pp_stack=0x7f3b3c1c45d8, func=0x7f3d3f6ee5f0) at Python/ceval.c:4447
#18 call_function (oparg=, pp_stack=0x7f3b3c1c45d8) at Python/ceval.c:4372
#19 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:2989
#20 0x00007f3d52cf7345 in fast_function (nk=, na=, n=, pp_stack=0x7f3b3c1c4748, func=0x7f3d2aea9c80) at Python/ceval.c:4437
#21 call_function (oparg=, pp_stack=0x7f3b3c1c4748) at Python/ceval.c:4372
#22 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:2989
#23 0x00007f3d52cf7c3e in PyEval_EvalCodeEx (co=0x7f3d528fcc30, globals=, locals=, args=, argcount=2, kws=0x7f3d2a18dc68, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3584
#24 0x00007f3d52cf71f7 in fast_function (nk=, na=2, n=, pp_stack=0x7f3b3c1c4968, func=0x7f3d2a33f0c8) at Python/ceval.c:4447
#25 call_function (oparg=, pp_stack=0x7f3b3c1c4968) at Python/ceval.c:4372
#26 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:2989
#27 0x00007f3d52cf7345 in fast_function (nk=, na=, n=, pp_stack=0x7f3b3c1c4ad8, func=0x7f3d2a33f410) at Python/ceval.c:4437
#28 call_function (oparg=, pp_stack=0x7f3b3c1c4ad8) at Python/ceval.c:4372
#29 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:2989
#30 0x00007f3d52cf7c3e in PyEval_EvalCodeEx (co=0x7f3d52963db0, globals=, locals=, args=, argcount=1, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3584
#31 0x00007f3d52c72a61 in function_call (func=0x7f3d2a33f8c0, arg=0x7f3d529377d0, kw=0x0) at Objects/funcobject.c:523
#32 0x00007f3d52c42e93 in PyObject_Call (func=0x7f3d2a33f8c0, arg=, kw=) at Objects/abstract.c:2547
#33 0x00007f3d52ced7b3 in PyEval_CallObjectWithKeywords (func=0x7f3d2a33f8c0, arg=0x7f3d529377d0, kw=) at Python/ceval.c:4221
#34 0x00007f3d52d13468 in PyEval_CallMethod (obj=, methodname=, format=) at Python/modsupport.c:612
#35 0x00007f3d5303141f in ?? ()
#36 0x0000000000000000 in ?? ()
In addition:
there are occasions when other threads are blocked at the same time, such as the stack information below, which is the stack information of an unrelated CPU thread. The strange thing is that there is actually libmxnet.so:
Thread 70 (Thread 0x7f3b0bff6700 (LWP 41409)):
#0 0x00007f3d582fd6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f3d580979bc in __gthread_cond_wait (__mutex=, __cond=) at /data/home/xxx/gcc-build/gcc-4.9.4/build/x86_64-redhat-linux/libstdc++-v3/include/x86_64-redhat-linux/bits/gthr-default.h:864
#2 std::condition_variable::wait (this=, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:52
#3 0x00007f3c7bcb88a3 in ?? () from my_app/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so
#4 0x00007f3c7bcc0339 in ?? () from my_app/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so
#5 0x00007f3d577c4702 in fork () from /lib64/libc.so.6
......
The text was updated successfully, but these errors were encountered: