-
-
Notifications
You must be signed in to change notification settings - Fork 16.6k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌟💡 YOLOv5 Study: batch size #2377
Comments
@glenn-jocher May be when we train for large number of epochs then we don't see significant improvement. I did experiment for batch size of 32 and 48 and I got better result when I trained with larger Batch Size. I trained for 50 epochs. And it happened on multiple datasets. |
@abhiagwl4262 we always recommend you train on the largest batch-size possible, not so much for better performance, as the above results don't indicate higher performance with higher batch size, but certainly for faster training and better resource utilization. Multi-GPU may add another angle to the above story though, as larger batch sizes there may help contribute to better results, at least in early training, since the batchnorm stats are split there among your CUDA devices. |
@glenn-jocher Is High Batch good, even for very small dataset e.g 200 images per class ? |
@abhiagwl4262 maybe, as long as you maintain a similar number of iterations. For very small datasets this may require significantly increasing training epochs, i.e. to several thousand, or until you observe overfitting. |
Hey, good thing to study. But i need to notice that results with sync BN are not reproducible for me. I have trained yolo m model on 8 tesla a100 gpus with batch size 256 because ddp only supports gloo backend and 0 GPU was loaded 50% more than others. (cuda 11). It will be good to compare syncbn with BN training. |
@cszer thanks for the comments! Yes a --sync study would be interesting as well. What you are your observations with and without --sync? Excess CUDA device 0 memory usage was previously related to too-large batch sizes on device 0 when testing, but this bug was fixed on February 6th as part of PR #2148. If your results are from before that then you may want to update your code and see if the problem has been fixed. |
1-2 05:095 map lower on coco |
@cszer oh wow, that's a significant difference. Do you mean that you see a drop of -1 or -2 mAP on COCO when not using |
@glenn-jocher One very strange observation. I am able to run 48 batch size on single GPU and not able to run batch size 64 even on 2 GPUs. Is there some bug in multi-GPU implementation ? |
@abhiagwl4262 if you believe you have a reproducible problem, please raise a new issue using the 🐛 Bug Report template, providing screenshots and a minimum reproducible example to help us better understand and diagnose your problem. Thank you! |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Study 🤔
I did a quick study to examine the effect of varying batch size on YOLOv5 trainings. The study trained YOLOv5s on COCO for 300 epochs with
--batch-size
at 8 different values:[16, 20, 32, 40, 64, 80, 96, 128]
.We've tried to make the train code batch-size agnostic, so that users get similar results at any batch size. This means users on a 11 GB 2080 Ti should be able to produce the same results as users on a 24 GB 3090 or a 40 GB A100, with smaller GPUs simply using smaller batch sizes.
We do this by scaling loss with batch size, and also by scaling weight decay with batch size. At batch sizes smaller than 64 we accumulate loss before optimizing, and at batch sizes above 64 we optimize after every batch.
Results 😃
Initial results vary significantly with batch size, but final results are nearly identical (good!).
Closeup of [email protected]:0.95:
One oddity that stood out is val objectness loss, which did vary with batch-size. I'm not sure why, as val-box and val-cls did not vary much, and neither did the 3 train losses. I don't know what this means or if there's any room for concern (or improvement).
The text was updated successfully, but these errors were encountered: