-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG][0.6.7] garbage output for multi-gpu with tutorial #2113
Comments
This is also reproducible on GPT-J-6B model if you simply switch it |
I am also seeing this, but not with every model. I do see it when using the tutorial model as well though. |
Thank you for reporting this! I've verified we can repro this on our side as well, but only when using >1 gpus. There's a gap currently in our CI tests for multi-gpu and certain models. We'll be fixing as soon as possible. |
Hi @zcrypt0 and @lanking520 , Sorry for my delay! I just pushed a fix for this. Could you please try to see if the issue is fixed? |
I installed from your PR commit. With the In fact, with that model, i see the issue even when using I also tested the script that @lanking520 posted and I get the following error:
I double checked by reverting the deepspeed installation to master and the test script still gives that error, so it's possible its something in my environment, although other models seem to work. |
@zcrypt0 I think this must be related to some issue on your CUDA driver/library, since you even did not pass the first phase of creating a CUBLAS handle. Could you please try reinstalling them? |
@RezaYazdaniAminabadi I am going to test out this script on a set of ampere gpus and see how it goes. EDIT: I installed from master and ran the script on an 2xA100s. This was the output.
|
I was also getting junk output following the tutorial. I can confirm that after building DeepSpeed from master that the issue seems resolved from GPT Neo 2.7B. I am however having another issue with regard to memory usage. Even when I specify torch.half(or torch.float16), the model seems to use the full VRAM on both GPUS. For example, running GPTJ on dual 3090s leads to OOM issues with usage over 24GB on each. Also, and perhaps I am misunderstanding the use of this tool, but isnt the VRAM usage supposed to be split over the multiple GPUs? So I would expect roughly 6-7GB usage per GPU rather than 24GB for each. I give more details here #2227 |
Closing, the original issue is resolved and new issue is moved to #2227 |
Describe the bug
When running GPU = 2 started to see garbage output generated.
To Reproduce
I am running with 2 GPU instance with V100, also reproducible using A100.
Just follow this example: https://www.deepspeed.ai/tutorials/inference-tutorial/
Expected behavior
Should be just normal?
ds_report output
Screenshots
System info (please complete the following information):
The text was updated successfully, but these errors were encountered: