-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: vLLM 0.5.3 is getting stuck at LLAMA 3.1 405B FP8 model loading #6700
Comments
is the model downloaded? |
@youkaichao the model is downloaded to disk |
try to follow https://docs.vllm.ai/en/latest/getting_started/debugging.html ? the output is quite limited, and I can tell nothing from it. |
@youkaichao the Engine is just stuck at that part, it did not even start downloading the model. So the problem is happening at P2P detection mechanism |
please report with detailed commands, and turn on as much output as possible. |
Anywhere you save the model. Run for 10 times, there might be a chance for you to hit this issue. The machine is stuck at P2P detection |
after you run successfully, the p2p detection result should be stored in a file, and will not trigger p2p detection in the next time. |
Some more logs @youkaichao this is where it stuck
Stuck at this line for 20minute |
follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to see where it is stuck then? it is difficult to say what is the root cause from the log, sometimes it might be caused by hardware / driver issue. |
for those of you who encounters this, you can add p2p communication between GPU, when enabled, can be fast and efficient. but if there are some hardware/driver issues, this can be error-prone. |
reopen the issue given we found it is repeatable |
So the issue is we are encountering a race condition for ZeroMQ module. There are chances the Sender/Receiver ended up in a deadlock states. This is reproducible randomly in P5.48xl (H100 x 8), or G6.12xl (L4 x 4) regardless of custom reduce or not. We are working on a workaround to resolve the issue. |
Your current environment
🐛 Describe the bug
With bare minimum configs
The text was updated successfully, but these errors were encountered: