-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dist_train keep waiting when filter_empty_gt=False #2193
Comments
Hi @panjianning , |
Thanks for your reply. I use CocoDataset and change nothing except the annotation path. BTW, Single-gpu trainning is ok when filter_empty_gt=False |
@ZwwWayne I print gt_labels in |
UpdateMy NCCL version is 2.5.7. After setting the following env variable, the dead lock disapeared, but I got negative loss...
@ZwwWayne @hellock When a batch contains only background images, the return losses of cascade rcnn misses keys After I add this row in |
I also tried filter_empty_gt=False but it seems not work. Hope you can get a better score in the leaderboard. @panjianning |
Maybe it has something to do with this issue and this line |
I can add negative sample to train, and I'm sure the training dataset becomes more, but the mAP seems just soso. @panjianning |
I only meet the issue when I use distributed trainning. It gives me a better score in first round so I always set it to false in the second round. |
@hellock
, and modified some code in RandomFlip.call as following:
The modification solved the "need at least one array to stack" error, but the training still crash with some message: " RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one."
When some batch has no gt, the 'loss_bbox' and 'loss_mask' in log_vars on some GPU may not exist, but other GPUs with gts may still perform all_reduce with 'loss_bbox' and 'loss_mask' across all GPUs, this operation will make the GPUs with gts wail permanently...
The above modification is just temporary expedient, because the code in train.py is shared by all kinds of detectors. I think this should be fixed by modification the detectors class to make the classes return zeros when there are no gts in batch, also all_reduce performed on different groups should work too(I think...) |
so will it mess up the backpropagation? |
I got worse score with this modification. |
This should be fixed by #2280. |
Co-authored-by: Weisu Yin <[email protected]>
Environment:
My config file: cascade_rcnn_dconv_c3-c5_r50_fpn_1x.py, with only the ann path modified.
filter_empty_gt=False
(20% of my images are background only.), there are not any error messages, but the trainning process are always waiting..., so I have to keyboardinterupt it:filter_empty_gt=True
, every thing is ok.The text was updated successfully, but these errors were encountered: