What is the effective batch size? #131

timsoraro · 2020-03-11T18:17:02Z

Hello, and thank you for your contribution.

I noticed there both

train_micro_batch_size_per_gpu
train_batch_size

I'm not from an ML background, I just want to know how many data examples each GPU works with on every iteration of the below for loop, so I can set it to maximize my GPU's memory and performance.

for i, batch in enumerate(trainloader):

How can I know that? And how can I set the above variables differently so as to change the effective batch size?

Let's say, I have 240,000 data examples. When I do print(len(trainloader)), it accurately shows 30,000 (240,000/8gpu's). But since my train_batch_size is 32 I would imagine there to be around 937 steps for one epoch, but the above for loop still walks over 30,000.

tjruwase · 2020-03-11T18:38:26Z

Thanks for using DeepSpeed.

The answer to your question is train_micro_batch_size_per_gpu, i.e. number of data samples in each GPU for each loop iteration. Since you are not from an ML background, let me further explain the relationship with train_batch_size and gradient_accumulation_steps.

gradient_accumulation_steps is how many loop iterations before the model is updated, e.g. by calling model.step(), while train_batch_size is total number of examples processed across all GPUs before model is updated.
Thus, train_batch_size = #GPUs * train_micro_batch_size_per_gpu * gradient_accumulation_steps

For your specific example, let's assume model is updated every loop iteration, i.e., gradient_accumulation_steps=1. Yes, it would take around 937 steps to complete one epoch of processing the entire 240,000 data examples. However, this is how parameters match up:

train_micro_batch_size_per_gpu = 32
gradient_accumulation_steps = 1
train_batch_size = 32 * 8 * 1 = 256

Hope that helps.

timsoraro · 2020-03-12T09:00:50Z

Thank you very much for your detailed explanation, I get it now. But still, When I do:

print(len(trainloader))

It always prints 60,000 (240000/4gpu's), where I expect it to print 15,000 when train_micro_batch_size_per_gpu is 4 and train_batch_size is 16.
I checked and changing of the batch size does take affect on the training (takes more time per step).

timsoraro · 2020-03-12T19:18:51Z

@tjruwase Can you please shed some light?

tjruwase · 2020-03-12T20:25:10Z

As far as I know print(len(trainloader)) returns total number of data samples in the rank irrespective of batch size configurations. It is not an indicator of number of training steps (or loop iterations) to process all the data examples.

timsoraro · 2020-03-13T09:00:25Z

Okay, so would I be right to assume that on every training step I'm running on train_micro_batch_size_per_gpu*gpu's examples (assuming gradient_accumulation_steps is 1)?

tjruwase · 2020-03-13T09:01:17Z

That is correct.

timsoraro · 2020-03-13T10:30:17Z

Thank you very much!

timsoraro closed this as completed Mar 13, 2020

tjruwase mentioned this issue Dec 21, 2020

zero_optimization.cpu_offload: true leads to a silent crash #610

Closed

Seong-yeop pushed a commit to Seong-yeop/DeepSpeed that referenced this issue Nov 10, 2021

Disable memory usage profiler (microsoft#131)

c1b206c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the effective batch size? #131

What is the effective batch size? #131

timsoraro commented Mar 11, 2020 •

edited

Loading

tjruwase commented Mar 11, 2020 •

edited

Loading

timsoraro commented Mar 12, 2020 •

edited

Loading

timsoraro commented Mar 12, 2020

tjruwase commented Mar 12, 2020

timsoraro commented Mar 13, 2020

tjruwase commented Mar 13, 2020

timsoraro commented Mar 13, 2020

What is the effective batch size? #131

What is the effective batch size? #131

Comments

timsoraro commented Mar 11, 2020 • edited Loading

tjruwase commented Mar 11, 2020 • edited Loading

timsoraro commented Mar 12, 2020 • edited Loading

timsoraro commented Mar 12, 2020

tjruwase commented Mar 12, 2020

timsoraro commented Mar 13, 2020

tjruwase commented Mar 13, 2020

timsoraro commented Mar 13, 2020

timsoraro commented Mar 11, 2020 •

edited

Loading

tjruwase commented Mar 11, 2020 •

edited

Loading

timsoraro commented Mar 12, 2020 •

edited

Loading