Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

imagenet hanging in the end? #4391

Closed
amithr1 opened this issue Dec 27, 2016 · 6 comments
Closed

imagenet hanging in the end? #4391

amithr1 opened this issue Dec 27, 2016 · 6 comments

Comments

@amithr1
Copy link

amithr1 commented Dec 27, 2016

I have cloned the latest mxnet and tried running imagenet (a very small subset of the images).
What I found is that mxnet in distributed training using dist_device_sync hangs in the end with two workers.

Running cifar was ok and didnt hang in the end.
I get something like ( on two workers):

[11:19:09] src/io/iter_image_recordio.cc:221: ImageRecordIOParser: /tmp/amithr/imagenet_mini_train.rec, use 4 threads for decoding..
[11:16:32] src/io/iter_image_recordio.cc:221: ImageRecordIOParser: /tmp/amithr/imagenet_mini_train.rec, use 4 threads for decoding..
INFO:root:Start training with [gpu(1), gpu(2)]
INFO:root:Start training with [gpu(1), gpu(2)]
INFO:root:Epoch[0] Batch [20] Speed: 33.72 samples/sec Train-accuracy=0.515625
INFO:root:Epoch[0] Batch [20] Speed: 33.80 samples/sec Train-accuracy=0.467187
INFO:root:Epoch[0] Resetting Data Iterator
INFO:root:Epoch[0] Time cost=35.018

@amithr1
Copy link
Author

amithr1 commented Dec 27, 2016

When I do a gdb attach to the two workers..I get the following trace from first worker:

#0 0x00003fffb410dd30 in pthread_cond_wait@@GLIBC_2.17 () from /usr/lib64/power8/libpthread.so.0
#1 0x00003fffb3965f6c in std::condition_variable::wait(std::unique_lockstd::mutex&) () from /usr/lib64/libstdc++.so.6
#2 0x00003fffa6d2a3e8 in mxnet::engine::ThreadedEngine::WaitForVar(mxnet::engine::Var*) () from /tmp/amithr/mxnet-new/mxnet/python/mxnet/../../lib/libmxnet.so
#3 0x00003fffa6710500 in mxnet::NDArray::SyncCopyToCPU(void*, unsigned long) const () from /tmp/amithr/mxnet-new/mxnet/python/mxnet/../../lib/libmxnet.so
#4 0x00003fffa6da8458 in MXNDArraySyncCopyToCPU () from /tmp/amithr/mxnet-new/mxnet/python/mxnet/../../lib/libmxnet.so
#5 0x00003fffaca57294 in ?? () from /usr/lib64/libffi.so.6
#6 0x00003fffaca55f90 in ffi_call () from /usr/lib64/libffi.so.6
#7 0x00003fffb3687b24 in _ctypes_callproc () from /usr/lib64/python2.7/lib-dynload/_ctypes.so

Second worker:

#0 0x00003fff82c5dd30 in pthread_cond_wait@@GLIBC_2.17 () from /usr/lib64/power8/libpthread.so.0
#1 0x00003fff824b5f6c in std::condition_variable::wait(std::unique_lockstd::mutex&) () from /usr/lib64/libstdc++.so.6
#2 0x00003fff7577d6ec in waitps::Postoffice::Barrier(int)::__lambda6 (__p=..., __lock=..., this=) at /usr/include/c++/4.8.2/condition_variable:93
#3 ps::Postoffice::Barrier (this=0x3fff799ecdd8 ps::Postoffice::Get()::e, node_group=) at src/postoffice.cc:131
#4 0x00003fff75764794 in mxnet::kvstore::KVStoreDist::~KVStoreDist() () from /tmp/amithr/mxnet-new/mxnet/python/mxnet/../../lib/libmxnet.so
#5 0x00003fff75764830 in mxnet::kvstore::KVStoreDist::~KVStoreDist() () from /tmp/amithr/mxnet-new/mxnet/python/mxnet/../../lib/libmxnet.so
#6 0x00003fff75739954 in MXKVStoreFree () from /tmp/amithr/mxnet-new/mxnet/python/mxnet/../../lib/libmxnet.so

@piiswrong
Copy link
Contributor

its due to each worker has different number of batches. You can kill the hanging ones

@mli
Copy link
Member

mli commented Dec 27, 2016

@amithr1
Copy link
Author

amithr1 commented Dec 28, 2016

I tried commenting out Engine::Get()->WaitForAll() in destructor of kvstore_dist.h. That seemed to help..after trying to even out the batches to the two workers by choosing number of images to be processed.

@mli
Copy link
Member

mli commented Dec 28, 2016

delete waitforall could be dangerous. a better method is having a data iterator wrapper so that it outputs exactly num_examples/batch_size/kv.num_workers batches.

@yajiedesign
Copy link
Contributor

This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants