Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌟💡 YOLOv5 Study: batch size #2377

Closed
glenn-jocher opened this issue Mar 5, 2021 · 10 comments
Closed

🌟💡 YOLOv5 Study: batch size #2377

glenn-jocher opened this issue Mar 5, 2021 · 10 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 5, 2021

Study 🤔

I did a quick study to examine the effect of varying batch size on YOLOv5 trainings. The study trained YOLOv5s on COCO for 300 epochs with --batch-size at 8 different values: [16, 20, 32, 40, 64, 80, 96, 128].

We've tried to make the train code batch-size agnostic, so that users get similar results at any batch size. This means users on a 11 GB 2080 Ti should be able to produce the same results as users on a 24 GB 3090 or a 40 GB A100, with smaller GPUs simply using smaller batch sizes.

We do this by scaling loss with batch size, and also by scaling weight decay with batch size. At batch sizes smaller than 64 we accumulate loss before optimizing, and at batch sizes above 64 we optimize after every batch.

Results 😃

Initial results vary significantly with batch size, but final results are nearly identical (good!).
Screen Shot 2021-03-05 at 1 22 03 PM

Closeup of [email protected]:0.95:
Screen Shot 2021-03-05 at 1 27 33 PM

One oddity that stood out is val objectness loss, which did vary with batch-size. I'm not sure why, as val-box and val-cls did not vary much, and neither did the 3 train losses. I don't know what this means or if there's any room for concern (or improvement).
Screen Shot 2021-03-05 at 1 27 21 PM

@glenn-jocher glenn-jocher added question Further information is requested documentation Improvements or additions to documentation and removed question Further information is requested labels Mar 5, 2021
@glenn-jocher glenn-jocher self-assigned this Mar 5, 2021
@abhiagwl4262
Copy link

@glenn-jocher May be when we train for large number of epochs then we don't see significant improvement. I did experiment for batch size of 32 and 48 and I got better result when I trained with larger Batch Size. I trained for 50 epochs. And it happened on multiple datasets.

@glenn-jocher
Copy link
Member Author

@abhiagwl4262 we always recommend you train on the largest batch-size possible, not so much for better performance, as the above results don't indicate higher performance with higher batch size, but certainly for faster training and better resource utilization.

Multi-GPU may add another angle to the above story though, as larger batch sizes there may help contribute to better results, at least in early training, since the batchnorm stats are split there among your CUDA devices.

@abhiagwl4262
Copy link

@glenn-jocher Is High Batch good, even for very small dataset e.g 200 images per class ?

@glenn-jocher
Copy link
Member Author

@abhiagwl4262 maybe, as long as you maintain a similar number of iterations. For very small datasets this may require significantly increasing training epochs, i.e. to several thousand, or until you observe overfitting.

@cszer
Copy link

cszer commented Mar 8, 2021

Hey, good thing to study. But i need to notice that results with sync BN are not reproducible for me. I have trained yolo m model on 8 tesla a100 gpus with batch size 256 because ddp only supports gloo backend and 0 GPU was loaded 50% more than others. (cuda 11). It will be good to compare syncbn with BN training.

@glenn-jocher
Copy link
Member Author

glenn-jocher commented Mar 8, 2021

@cszer thanks for the comments! Yes a --sync study would be interesting as well. What you are your observations with and without --sync?

Excess CUDA device 0 memory usage was previously related to too-large batch sizes on device 0 when testing, but this bug was fixed on February 6th as part of PR #2148. If your results are from before that then you may want to update your code and see if the problem has been fixed.

@cszer
Copy link

cszer commented Mar 9, 2021

1-2 05:095 map lower on coco

@glenn-jocher
Copy link
Member Author

@cszer oh wow, that's a significant difference. Do you mean that you see a drop of -1 or -2 mAP on COCO when not using --sync-bn on a 8x A100 YOLOv5m training at --batch 256? That's much larger than I would have expected. Did you train for 300 epochs?

@abhiagwl4262
Copy link

@glenn-jocher One very strange observation. I am able to run 48 batch size on single GPU and not able to run batch size 64 even on 2 GPUs. Is there some bug in multi-GPU implementation ?

@glenn-jocher
Copy link
Member Author

@abhiagwl4262 if you believe you have a reproducible problem, please raise a new issue using the 🐛 Bug Report template, providing screenshots and a minimum reproducible example to help us better understand and diagnose your problem. Thank you!

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants