-
Notifications
You must be signed in to change notification settings - Fork 6.8k
imagenet hanging in the end? #4391
Comments
When I do a gdb attach to the two workers..I get the following trace from first worker: #0 0x00003fffb410dd30 in pthread_cond_wait@@GLIBC_2.17 () from /usr/lib64/power8/libpthread.so.0 Second worker: #0 0x00003fff82c5dd30 in pthread_cond_wait@@GLIBC_2.17 () from /usr/lib64/power8/libpthread.so.0 |
its due to each worker has different number of batches. You can kill the hanging ones |
I tried commenting out Engine::Get()->WaitForAll() in destructor of kvstore_dist.h. That seemed to help..after trying to even out the batches to the two workers by choosing number of images to be processed. |
delete waitforall could be dangerous. a better method is having a data iterator wrapper so that it outputs exactly |
This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks! |
I have cloned the latest mxnet and tried running imagenet (a very small subset of the images).
What I found is that mxnet in distributed training using dist_device_sync hangs in the end with two workers.
Running cifar was ok and didnt hang in the end.
I get something like ( on two workers):
[11:19:09] src/io/iter_image_recordio.cc:221: ImageRecordIOParser: /tmp/amithr/imagenet_mini_train.rec, use 4 threads for decoding..
[11:16:32] src/io/iter_image_recordio.cc:221: ImageRecordIOParser: /tmp/amithr/imagenet_mini_train.rec, use 4 threads for decoding..
INFO:root:Start training with [gpu(1), gpu(2)]
INFO:root:Start training with [gpu(1), gpu(2)]
INFO:root:Epoch[0] Batch [20] Speed: 33.72 samples/sec Train-accuracy=0.515625
INFO:root:Epoch[0] Batch [20] Speed: 33.80 samples/sec Train-accuracy=0.467187
INFO:root:Epoch[0] Resetting Data Iterator
INFO:root:Epoch[0] Time cost=35.018
The text was updated successfully, but these errors were encountered: