Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dist_train keep waiting with multiple GPUs and samples_per_gpu = 1 #4

Open
zvadaszi opened this issue Oct 8, 2020 · 1 comment
Open

Comments

@zvadaszi
Copy link

zvadaszi commented Oct 8, 2020

First of all, thank you for your work and for your repo.

Environment:

pytorch 1.5.1
cuda 10.2
cudnn 7.6.5
mmdetection 2.3.0
4xV100 16GB

My config file is based on: vfnet_r50_fpn_mstrain_2x, modified to a custom dataset having large images (2560x1440) and mainly small objects 10-60px

  1. Training with multiple GPUs and samples_per_gpu = 1, workers_per_gpu = 1, train hangs at the beginning with all GPU_Util at 100%.

  2. Training with multiple GPUs, samples_per_gpu = 2, workers_per_gpu = 2 (and smaller image size) train goes well.

Somehow similar to this issue: 2193

hyz-xmaster added a commit that referenced this issue Oct 9, 2020
@hyz-xmaster
Copy link
Owner

Hi @zvadaszi, thank you for your information. I think this bug is most likely related to those issues about ATSS. I have updated the repo according to those fixes. You may try it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants