This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Hangs training on P100 #8695
Comments
I met the same problem when trainning with multiple machines. |
There was a bug fix for deadlock a few days ago. Is it still happening for
most recent version of master?
On 2017年11月21日 周二 at 00:29 SheSung ***@***.***> wrote:
I met the same problem when trainning with multiple machines.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8695 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFSeqAiEZuL9uEuDH_mbqTY0QETv73bKks5s4on0gaJpZM4QibDs>
.
--
Best Regards,
Haibin Lin
Department of Computer Science
School of Computer Science
Carnegie Mellon University
|
i encountered the deadlock also....
@eric-haibin-lin when is the deadlock fixed, can you give the md5 code of the commit or link of it? |
Please try the latest version of MXNet and create a new issue if you encounter the problem again. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
I am trying to train imagenet using the default resnet on a single node having upto 4 P100s.. When I use the master branch, I see hangs. When I attached gdb I see the following stack trace. If there are useful inputs, I can debug the problem more. The problem happens with more than 2 GPUs. With 2GPUs, I can run upto several epochs. However when I use 4 GPUs, it hangs within first epoch.
(gdb) bt
#0 0x00003fffac2cdd60 in pthread_cond_wait@@GLIBC_2.17 () at /lib64/libpthread.so.0
#1 0x00003fff4777608c in std::condition_variable::wait(std::unique_lockstd::mutex&) () at /lib64/libstdc++.so.6
#2 0x00003fff6a3e236c in std::condition_variable::waitmxnet::engine::ThreadedEngine::WaitForVar(mxnet::Engine::VarHandle)::__lambda18(std::unique_lockstd::mutex &, mxnet::engine::ThreadedEngine::__lambda18) (this=0x3fff2c001198, __lock=..., __p=...) at /usr/include/c++/4.8.2/condition_variable:93
#3 0x00003fff6a3e1d10 in mxnet::engine::ThreadedEngine::WaitForVar(mxnet::engine::Var*) (this=0x3fff2c001150, var=0x3bff50a6a900) at src/engine/threaded_engine.cc:358
#4 0x00003fff699b6cc8 in mxnet::NDArray::WaitToWrite() const (this=0x3bff49fa0cf0) at include/mxnet/./ndarray.h:330
#5 0x00003fff69be4c88 in mxnet::NDArray::SyncCopyToCPU(void*, unsigned long) const (this=0x3bff49fa0cf0, data=0x3bff9c9862c0, size=32) at src/ndarray/ndarray.cc:1210
#6 0x00003fff6a44d190 in MXNDArraySyncCopyToCPU(NDArrayHandle, void*, size_t) (handle=0x3bff49fa0cf0, data=0x3bff9c9862c0, size=32) at src/c_api/c_api.cc:253
#7 0x00003fffabed7254 in () at /lib64/libffi.so.6
#8 0x00003fffabed5f50 in ffi_call () at /lib64/libffi.so.6
#9 0x00003fffa5247b24 in _ctypes_callproc () at /usr/lib64/python2.7/lib-dynload/_ctypes.so
#10 0x00003fffa523a6ac in PyCFuncPtr_call () at /usr/lib64/python2.7/lib-dynload/_ctypes.so
#11 0x00003fffac361444 in PyObject_Call () at /lib64/libpython2.7.so.1.0
#12 0x00003fffac4669f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#13 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#14 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#15 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#16 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#17 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#18 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#19 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#20 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#21 0x00003fffac468c70 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#22 0x00003fffac468c70 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#23 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#24 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#25 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#26 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#27 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#28 0x00003fffac46cc64 in PyEval_EvalCode () at /lib64/libpython2.7.so.1.0
#29 0x00003fffac4a0528 in PyRun_FileExFlags () at /lib64/libpython2.7.so.1.0
#30 0x00003fffac4a274c in PyRun_SimpleFileExFlags () at /lib64/libpython2.7.so.1.0
#31 0x00003fffac4a2e9c in PyRun_AnyFileExFlags () at /lib64/libpython2.7.so.1.0
#32 0x00003fffac4beb7c in Py_Main () at /lib64/libpython2.7.so.1.0
#33 0x0000000010000738 in main ()
The text was updated successfully, but these errors were encountered: