Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Dataloader Overflow Errors #3

Open
Quentin-Anthony opened this issue Nov 5, 2023 · 1 comment
Open

[BUG] Dataloader Overflow Errors #3

Quentin-Anthony opened this issue Nov 5, 2023 · 1 comment
Labels
wontfix This will not be worked on

Comments

@Quentin-Anthony
Copy link
Collaborator

@eric-weiss-zyphra discovered that upstream Megatron-LM is still on the old dataloader scheme (as opposed to gpt-neox), leading to overflow errors like:

File "torch/utils/data/_utils/collate.py", line 141, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [2049] at entry 0 and [198705720] at entry 16

I created a solution for this a while back in EleutherAI/gpt-neox#835, which we should apply to Megatron-LM, test that it works, and then contribute back to upstream

@Quentin-Anthony
Copy link
Collaborator Author

We haven't been able to reproduce this in a while, so deprioritizing for now.

@Quentin-Anthony Quentin-Anthony added the wontfix This will not be worked on label Nov 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants