[BUG] Dataloader Overflow Errors #3

Quentin-Anthony · 2023-11-05T22:36:45Z

@eric-weiss-zyphra discovered that upstream Megatron-LM is still on the old dataloader scheme (as opposed to gpt-neox), leading to overflow errors like:

File "torch/utils/data/_utils/collate.py", line 141, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [2049] at entry 0 and [198705720] at entry 16

I created a solution for this a while back in EleutherAI/gpt-neox#835, which we should apply to Megatron-LM, test that it works, and then contribute back to upstream

The text was updated successfully, but these errors were encountered:

Quentin-Anthony · 2023-11-11T23:09:35Z

We haven't been able to reproduce this in a while, so deprioritizing for now.

Quentin-Anthony assigned eric-weiss-zyphra Nov 5, 2023

Quentin-Anthony added the wontfix This will not be worked on label Nov 11, 2023

Quentin-Anthony unassigned eric-weiss-zyphra Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Dataloader Overflow Errors #3

[BUG] Dataloader Overflow Errors #3

Quentin-Anthony commented Nov 5, 2023

Quentin-Anthony commented Nov 11, 2023

[BUG] Dataloader Overflow Errors #3

[BUG] Dataloader Overflow Errors #3

Comments

Quentin-Anthony commented Nov 5, 2023

Quentin-Anthony commented Nov 11, 2023