-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed inference on multi machine (error Invalid peer device id) #2795
Comments
I am experiencing the same issue trying to load a model across two EC2 instances. I get the same traceback as well. I'm running the latest versions of both vLLM and Ray with Python 3.11. |
I too have encountered this issue - I would add that the Ray Cluster is connected, and a "View the dashboard at ......" appears But I too then receive the 'AssertionError: Invalid peer device id' I did notice the line above was "if not torch.cuda.can_device_access_peer(rank, i):" So I spooled up a python instance and tried this out:
Hopefully that will help someone narrow down on what the issue is? |
It looks like |
Thanks @umarbutler I also downgraded and got past the device ID issue. Now I am connecting to the cluster but it is hanging at Initializing an LLM engine - hangs for the longest time and eventually I get the message (and I am talking 20 minutes later):
I then have to kill all the processes as it simply won't stop... This is attempting to run two machines in the same network using
Shows the nodes, all connected, I can ping between the machines, the ports are open - I don't know why it is failing at this point. |
@Kaotic3 I didn't encounter that issue after downgrading however I did encounter an issue wherein my Ray nodes were not connecting to one another, I think because I also downgraded my Ray thinking the latest version would cause compatibility issues with version 0.2.7 of vLLM. As it turns out, upgrading to the latest version of Ray solved that. So perhaps you could try that? In terms of the commands you're running, those are the same commands I'm running so not sure what the issue could be. |
Thanks @umarbutler I also downgraded and run successfully. With issue of @Kaotic3, i run "NCCL_SOCKET_IFNAME=eth0 python test.py" instead of "python test.py". I hope it will help you. |
@bieenr This issue should remain open just in case closing it was not a mistake, we still need it fixed in subsequent versions. Clearly, the fact that it works on the previous version indicates that a bug was introduced on the side of vLLM. |
This problem is due to the custom all reduce kernel, which is now disabled by default. In v0.3.1, this problem should not persist. |
@WoosukKwon Not a bug with kernel itself. It's just that P2P check cannot run properly when CUDA_VISIBLE_DEVICES is set. Will be fixed by #2760. |
I'm a newbie, and I'm running an example at https://docs.vllm.ai/en/latest/serving/distributed_serving.html locally with 2 machines, each with an RTX 3090 GPU. I changed tensor_parallel_size to 2 and model to "vinai/PhoGPT-4B".
On the head node, I run:
NCCL_SOCKET_IFNAME=eth0 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0 ray start --head.
On the other nodes, I run:
NCCL_SOCKET_IFNAME=eth0 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0 ray start --address='10.0.0.1'.
Then, on the head node, when I run the example code: python main.py, I get the following error:
The text was updated successfully, but these errors were encountered: