Fix Large-BS Dataloader Bug #835

Quentin-Anthony · 2023-03-14T18:49:12Z

Currently, if train_iters * seqlen * gradient_accumulation_steps * micro_batch_size * world_size > 2147483647 our dataloader's sample_idx overflows leading to the cryptic error:

  File "torch/utils/data/_utils/collate.py", line 141, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [2049] at entry 0 and [198705720] at entry 16

Which will happen once the torch dataloader reaches the overflowed sample_idx. Simply storing to np.int64 instead of np.int32 will do the job, but will waste memory and disk. I added a simple switch between the default int32 and new int64 dataset builders and tested it works.

…pt-neox into large_bs_dataloader

Quentin-Anthony · 2023-03-15T04:58:57Z

I went the easy route and just created two separate dataset building functions (build_sample_idx_int32 and build_sample_idx_int64), then switch between them depending on whether the number of samples will overflow with int32.

Now we won't waste memory with int64 for the majority of runs that won't overflow with int32. Tested with both overflowing and non-overflowing cases without issue.

Prototype fix of large-bs dataloader

7d682df

Quentin-Anthony requested a review from a team as a code owner March 14, 2023 18:49

Quentin-Anthony requested review from StellaAthena and ShivanshuPurohit March 14, 2023 18:49

Quentin-Anthony marked this pull request as draft March 14, 2023 18:49

github-actions and others added 3 commits March 14, 2023 18:49

Update NeoXArgs docs automatically

3268af3

Allow the dataset builder to choose int32 or int64 at runtime

2d2eecd

Merge branch 'large_bs_dataloader' of https://github.com/EleutherAI/g…

5e2675e

…pt-neox into large_bs_dataloader

Quentin-Anthony marked this pull request as ready for review March 15, 2023 04:56

Merge branch 'main' into large_bs_dataloader

a7e3ac3

StellaAthena previously approved these changes Mar 16, 2023

View reviewed changes

Update NeoXArgs docs automatically

e25ded7

StellaAthena merged commit 91b72d9 into main Mar 16, 2023

github-actions bot dismissed StellaAthena’s stale review via e25ded7 March 16, 2023 04:03

Quentin-Anthony mentioned this pull request Mar 21, 2023

Data loader error when using example Enron training data #715

Closed

This was referenced Mar 29, 2023

Adding data to continue training failed. #860

Closed

Negative document indices caused by 64 bit integer stored in a 32 bit integer array. #493

Closed

Quentin-Anthony mentioned this pull request May 10, 2023

RuntimeError: stack expects each tensor to be equal size #929

Closed

StellaAthena deleted the large_bs_dataloader branch August 8, 2023 19:36

StellaAthena restored the large_bs_dataloader branch August 8, 2023 19:36

StellaAthena deleted the large_bs_dataloader branch October 13, 2023 14:54

Quentin-Anthony mentioned this pull request Nov 5, 2023

[BUG] Dataloader Overflow Errors Zyphra/Megatron-LM#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Large-BS Dataloader Bug #835

Fix Large-BS Dataloader Bug #835

Quentin-Anthony commented Mar 14, 2023 •

edited

Loading

Quentin-Anthony commented Mar 15, 2023

Fix Large-BS Dataloader Bug #835

Fix Large-BS Dataloader Bug #835

Conversation

Quentin-Anthony commented Mar 14, 2023 • edited Loading

Quentin-Anthony commented Mar 15, 2023

Quentin-Anthony commented Mar 14, 2023 •

edited

Loading