Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the effective batch size? #131

Closed
timsoraro opened this issue Mar 11, 2020 · 7 comments
Closed

What is the effective batch size? #131

timsoraro opened this issue Mar 11, 2020 · 7 comments

Comments

@timsoraro
Copy link

timsoraro commented Mar 11, 2020

Hello, and thank you for your contribution.

I noticed there both

train_micro_batch_size_per_gpu
train_batch_size

I'm not from an ML background, I just want to know how many data examples each GPU works with on every iteration of the below for loop, so I can set it to maximize my GPU's memory and performance.

for i, batch in enumerate(trainloader):

How can I know that? And how can I set the above variables differently so as to change the effective batch size?

Let's say, I have 240,000 data examples. When I do print(len(trainloader)), it accurately shows 30,000 (240,000/8gpu's). But since my train_batch_size is 32 I would imagine there to be around 937 steps for one epoch, but the above for loop still walks over 30,000.

@tjruwase
Copy link
Contributor

tjruwase commented Mar 11, 2020

Thanks for using DeepSpeed.

The answer to your question is train_micro_batch_size_per_gpu, i.e. number of data samples in each GPU for each loop iteration. Since you are not from an ML background, let me further explain the relationship with train_batch_size and gradient_accumulation_steps.

gradient_accumulation_steps is how many loop iterations before the model is updated, e.g. by calling model.step(), while train_batch_size is total number of examples processed across all GPUs before model is updated.
Thus, train_batch_size = #GPUs * train_micro_batch_size_per_gpu * gradient_accumulation_steps

For your specific example, let's assume model is updated every loop iteration, i.e., gradient_accumulation_steps=1. Yes, it would take around 937 steps to complete one epoch of processing the entire 240,000 data examples. However, this is how parameters match up:

train_micro_batch_size_per_gpu = 32
gradient_accumulation_steps = 1
train_batch_size = 32 * 8 * 1 = 256

Hope that helps.

@timsoraro
Copy link
Author

timsoraro commented Mar 12, 2020

Thank you very much for your detailed explanation, I get it now. But still, When I do:

print(len(trainloader))

It always prints 60,000 (240000/4gpu's), where I expect it to print 15,000 when train_micro_batch_size_per_gpu is 4 and train_batch_size is 16.
I checked and changing of the batch size does take affect on the training (takes more time per step).

@timsoraro
Copy link
Author

@tjruwase Can you please shed some light?

@tjruwase
Copy link
Contributor

As far as I know print(len(trainloader)) returns total number of data samples in the rank irrespective of batch size configurations. It is not an indicator of number of training steps (or loop iterations) to process all the data examples.

@timsoraro
Copy link
Author

Okay, so would I be right to assume that on every training step I'm running on train_micro_batch_size_per_gpu*gpu's examples (assuming gradient_accumulation_steps is 1)?

@tjruwase
Copy link
Contributor

That is correct.

@timsoraro
Copy link
Author

Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants