imagenet hanging in the end? #4391

amithr1 · 2016-12-27T16:30:18Z

I have cloned the latest mxnet and tried running imagenet (a very small subset of the images).
What I found is that mxnet in distributed training using dist_device_sync hangs in the end with two workers.

Running cifar was ok and didnt hang in the end.
I get something like ( on two workers):

[11:19:09] src/io/iter_image_recordio.cc:221: ImageRecordIOParser: /tmp/amithr/imagenet_mini_train.rec, use 4 threads for decoding..
[11:16:32] src/io/iter_image_recordio.cc:221: ImageRecordIOParser: /tmp/amithr/imagenet_mini_train.rec, use 4 threads for decoding..
INFO:root:Start training with [gpu(1), gpu(2)]
INFO:root:Start training with [gpu(1), gpu(2)]
INFO:root:Epoch[0] Batch [20] Speed: 33.72 samples/sec Train-accuracy=0.515625
INFO:root:Epoch[0] Batch [20] Speed: 33.80 samples/sec Train-accuracy=0.467187
INFO:root:Epoch[0] Resetting Data Iterator
INFO:root:Epoch[0] Time cost=35.018

amithr1 · 2016-12-27T18:53:42Z

When I do a gdb attach to the two workers..I get the following trace from first worker:

#0 0x00003fffb410dd30 in pthread_cond_wait@@GLIBC_2.17 () from /usr/lib64/power8/libpthread.so.0
#1 0x00003fffb3965f6c in std::condition_variable::wait(std::unique_lockstd::mutex&) () from /usr/lib64/libstdc++.so.6
#2 0x00003fffa6d2a3e8 in mxnet::engine::ThreadedEngine::WaitForVar(mxnet::engine::Var*) () from /tmp/amithr/mxnet-new/mxnet/python/mxnet/../../lib/libmxnet.so
#3 0x00003fffa6710500 in mxnet::NDArray::SyncCopyToCPU(void*, unsigned long) const () from /tmp/amithr/mxnet-new/mxnet/python/mxnet/../../lib/libmxnet.so
#4 0x00003fffa6da8458 in MXNDArraySyncCopyToCPU () from /tmp/amithr/mxnet-new/mxnet/python/mxnet/../../lib/libmxnet.so
#5 0x00003fffaca57294 in ?? () from /usr/lib64/libffi.so.6
#6 0x00003fffaca55f90 in ffi_call () from /usr/lib64/libffi.so.6
#7 0x00003fffb3687b24 in _ctypes_callproc () from /usr/lib64/python2.7/lib-dynload/_ctypes.so

Second worker:

#0 0x00003fff82c5dd30 in pthread_cond_wait@@GLIBC_2.17 () from /usr/lib64/power8/libpthread.so.0
#1 0x00003fff824b5f6c in std::condition_variable::wait(std::unique_lockstd::mutex&) () from /usr/lib64/libstdc++.so.6
#2 0x00003fff7577d6ec in waitps::Postoffice::Barrier(int)::__lambda6 (__p=..., __lock=..., this=) at /usr/include/c++/4.8.2/condition_variable:93
#3 ps::Postoffice::Barrier (this=0x3fff799ecdd8 ps::Postoffice::Get()::e, node_group=) at src/postoffice.cc:131
#4 0x00003fff75764794 in mxnet::kvstore::KVStoreDist::~KVStoreDist() () from /tmp/amithr/mxnet-new/mxnet/python/mxnet/../../lib/libmxnet.so
#5 0x00003fff75764830 in mxnet::kvstore::KVStoreDist::~KVStoreDist() () from /tmp/amithr/mxnet-new/mxnet/python/mxnet/../../lib/libmxnet.so
#6 0x00003fff75739954 in MXKVStoreFree () from /tmp/amithr/mxnet-new/mxnet/python/mxnet/../../lib/libmxnet.so

piiswrong · 2016-12-27T20:59:12Z

its due to each worker has different number of batches. You can kill the hanging ones

mli · 2016-12-27T21:52:27Z

tracked on https://github.com/dmlc/mxnet/projects/3

amithr1 · 2016-12-28T05:07:27Z

I tried commenting out Engine::Get()->WaitForAll() in destructor of kvstore_dist.h. That seemed to help..after trying to even out the batches to the two workers by choosing number of images to be processed.

mli · 2016-12-28T05:21:50Z

delete waitforall could be dangerous. a better method is having a data iterator wrapper so that it outputs exactly num_examples/batch_size/kv.num_workers batches.

yajiedesign · 2017-09-28T07:18:33Z

This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks!

yajiedesign closed this as completed Sep 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

imagenet hanging in the end? #4391

imagenet hanging in the end? #4391

amithr1 commented Dec 27, 2016

amithr1 commented Dec 27, 2016

piiswrong commented Dec 27, 2016

mli commented Dec 27, 2016

amithr1 commented Dec 28, 2016

mli commented Dec 28, 2016

yajiedesign commented Sep 28, 2017

imagenet hanging in the end? #4391

imagenet hanging in the end? #4391

Comments

amithr1 commented Dec 27, 2016

amithr1 commented Dec 27, 2016

piiswrong commented Dec 27, 2016

mli commented Dec 27, 2016

amithr1 commented Dec 28, 2016

mli commented Dec 28, 2016

yajiedesign commented Sep 28, 2017