dist_train keep waiting with multiple GPUs and samples_per_gpu = 1 #4

zvadaszi · 2020-10-08T16:03:37Z

First of all, thank you for your work and for your repo.

Environment:

pytorch 1.5.1
cuda 10.2
cudnn 7.6.5
mmdetection 2.3.0
4xV100 16GB

My config file is based on: vfnet_r50_fpn_mstrain_2x, modified to a custom dataset having large images (2560x1440) and mainly small objects 10-60px

Training with multiple GPUs and samples_per_gpu = 1, workers_per_gpu = 1, train hangs at the beginning with all GPU_Util at 100%.
Training with multiple GPUs, samples_per_gpu = 2, workers_per_gpu = 2 (and smaller image size) train goes well.

Somehow similar to this issue: 2193

hyz-xmaster · 2020-10-09T01:27:18Z

Hi @zvadaszi, thank you for your information. I think this bug is most likely related to those issues about ATSS. I have updated the repo according to those fixes. You may try it again.

hyz-xmaster added a commit that referenced this issue Oct 9, 2020

fix dist_train bug #4

c99f51c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dist_train keep waiting with multiple GPUs and samples_per_gpu = 1 #4

dist_train keep waiting with multiple GPUs and samples_per_gpu = 1 #4

zvadaszi commented Oct 8, 2020

hyz-xmaster commented Oct 9, 2020

dist_train keep waiting with multiple GPUs and samples_per_gpu = 1 #4

dist_train keep waiting with multiple GPUs and samples_per_gpu = 1 #4

Comments

zvadaszi commented Oct 8, 2020

hyz-xmaster commented Oct 9, 2020