-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vLLM running on a Ray Cluster Hanging on Initializing #2826
Comments
Hi, Let me know if this solves your issue as well :) |
Thanks for the idea, I did try it but didn't work for me. Same hanging issue but I went off for dinner and came back to this message:
Which I think is a new error message compared to the thread I linked - but googling didn't provide me with any great insight into fixing.
Which is always a little impressive to be honest.... |
I am experiencing the same at the moment. For me, it happens with GPTQ quantisation with tp=4. I have tried the following settings / combinations of settings without any luck: NCCL_P2P_DISABLE=1 Latest vLLM, compiled from source. It hangs at approx 12995GB VRAM on each card across 4x3090. 70b model llama2. Finally hung at this after approx 1h:
|
Tried same options as above, and using Ray, but it did not help. What did work was using a GPTQ model, as it seems that only AWQ models hang (only tried those two on multi-GPU) |
i've the same issue did you found a solution ? @ffolkes1911 |
I have the similar issue, but it can eventually work after about 40 minutes, I have describe the detail in #2959 |
Hey WoosuKwon. I just cloned the repo and built it and then started Ray on two machines and then initiated vLLM with tensors=4. The result is that vLLM is hanging and not moving past the "Initializing an LLM engine with config:...." While I think that PR no doubt fixed some problem, it doesn't appear to have fixed this problem - which is that using Ray Cluster across two different machines results in vLLM hanging and not starting. |
|
Thanks very much. |
Thank you very much, I have fixed the problem, the problem is that I have multiple network card, so I use the |
try “ray stop” command, it does work for me. |
How did you fix this? I got the same issue in vllm version 0.3.3 on A100 2 cards.Thanks in advance |
It isn't clear what is at fault here. Whether it be vLLM or Ray.
There is a thread here on the ray forums that outlines the issue, it is 16 days old, there is no reply to it.
https://discuss.ray.io/t/running-vllm-script-on-multi-node-cluster/13533
Taking from that thread, but this is identical for me.
I have exactly this same problem. The thread details the other points, that the "ray status" seems to show nodes working and communicating, that it stays like this for an age then eventually crashes with some error messages. Everything in that thread is identical to what is happening for me.
Unfortunately the Ray forums probably don't want to engage because it is vLLM - and I am concerned that vLLM won't want to engage because it is Ray.....
The text was updated successfully, but these errors were encountered: