-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Cannot run DeepSpeed with transformers on NVIDIA Tesla T4 GPU #2001
Comments
Hi @lanking520, I just tried all your repro steps above and was not able to repro the stack trace. Can you confirm what |
Also, just to double check, you can run fine if you remove |
@jeffra Here is the pip list
After commenting the
Do you think this might be related to the NVLink support? |
@jeffra just tested NVIDIA V100 GPU series with 4GPU onboard, the code just works fine, which match to what you have tested. So I think we can just narrow down the issue to on NVIDIA Tesla T4 GPU compatibility. Some interesting findings: Output on single GPU:
Output on 4 GPUs doesn's smelled quite right:
|
Thanks for trying this on multiple GPUs, I will try this on my end to see what the issue is. I will let you know once I fix it. Best, |
I'm observing the same behavior on an A100-SXM4-40GB system. Using the same steps to reproduce as @lanking520, the single GPU run works but multiple GPUs produces garbage output.
I've tried running different versions of torch and cuda without any change in behavior. Update |
@lanking520, I think there are two separate issues in this thread.
Let's focus this issue on (1) and focus on (2) in #2113. I finally have access to a T4 and I am still unable to reproduce the original issue. I've tried the same torch/deepspeed/transformers versions as you originally reported and also the latest version of each. I am also using your same docker container and setup. Can you confirm that original issue is still reproduceable?
Here's the entire log of the run using your code snippet as well: https://gist.github.com/jeffra/b6966e155a57ec388444e13dd8b66402 The generated text is:
|
Describe the bug
Cannot run DeepSpeed with transformers on the ubuntu 20.04 with single GPU. GPU: NVIDIA T4
Error:
To Reproduce
Create container
Inside container
Code to run
Grab the code from tutorial: https://www.deepspeed.ai/tutorials/inference-tutorial/#:~:text=DeepSpeed%2DInference%20introduces%20several%20features,to%20reduce%20latency%20for%20inference.
run
ds_report output
Please run
ds_report
to give us details about your setup.The text was updated successfully, but these errors were encountered: