Distributed inference on multi machine (error Invalid peer device id) #2795

bieenr · 2024-02-07T02:12:39Z

I'm a newbie, and I'm running an example at https://docs.vllm.ai/en/latest/serving/distributed_serving.html locally with 2 machines, each with an RTX 3090 GPU. I changed tensor_parallel_size to 2 and model to "vinai/PhoGPT-4B".
On the head node, I run:
NCCL_SOCKET_IFNAME=eth0 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0 ray start --head.
On the other nodes, I run:
NCCL_SOCKET_IFNAME=eth0 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0 ray start --address='10.0.0.1'.
Then, on the head node, when I run the example code: python main.py, I get the following error:

Traceback (most recent call last):
  File "/data2/bientd/vllm/test.py", line 25, in <module>
    llm = LLM(model="facebook/opt-13b", tensor_parallel_size=2,download_dir='/data2/bientd/')#,pipeline_parallel_size=3 don't support
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 109, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 356, in from_engine_args
    engine = cls(*engine_configs,
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 109, in __init__
    self._init_workers_ray(placement_group)
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 271, in _init_workers_ray
    self._run_workers("init_model")
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 983, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/worker/worker.py", line 87, in init_model
    init_custom_ar()
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/model_executor/parallel_utils/custom_all_reduce.py", line 44, in init_custom_ar
    if not _can_p2p(rank, world_size):
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/model_executor/parallel_utils/custom_all_reduce.py", line 137, in _can_p2p
    if not torch.cuda.can_device_access_peer(rank, i):
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 464, in can_device_access_peer
    raise AssertionError("Invalid peer device id")
AssertionError: Invalid peer device id

The text was updated successfully, but these errors were encountered:

umarbutler · 2024-02-08T02:27:14Z

I am experiencing the same issue trying to load a model across two EC2 instances. I get the same traceback as well. I'm running the latest versions of both vLLM and Ray with Python 3.11.

Kaotic3 · 2024-02-08T15:34:39Z

I too have encountered this issue - I would add that the Ray Cluster is connected, and a "View the dashboard at ......" appears

But I too then receive the 'AssertionError: Invalid peer device id'

I did notice the line above was "if not torch.cuda.can_device_access_peer(rank, i):"

So I spooled up a python instance and tried this out:

torch.cuda.can_device_access_peer(0, 0) - False
torch.cuda.can_device_access_peer(0, 1) - True
torch.cuda.can_device_access_peer(0, 2) - Invalid peer device id
torch.cuda.can_device_access_peer(1, 0) - True
torch.cuda.can_device_access_peer(1, 1) - False
torch.cuda.can_device_access_peer(1, 2) - Invalid peer device id

Hopefully that will help someone narrow down on what the issue is?

umarbutler · 2024-02-09T00:39:48Z

It looks like can_device_access_peer.py was created two weeks ago (#2192). I have just downgraded to version 0.2.7 of vLLM and I can confirm that Ray now works.

Kaotic3 · 2024-02-09T12:06:09Z

Thanks @umarbutler I also downgraded and got past the device ID issue.

Now I am connecting to the cluster but it is hanging at Initializing an LLM engine - hangs for the longest time and eventually I get the message (and I am talking 20 minutes later):

(RayWorkerVllm pid=5003, ip=x.x.x.x) [E socket.cpp:922] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.1.1, 35005).

I then have to kill all the processes as it simply won't stop...

This is attempting to run two machines in the same network using ray start --head and ray start --address=x.x.x.x:6379 - which I thought would be the simplest way to run this.

ray status

Shows the nodes, all connected, I can ping between the machines, the ports are open - I don't know why it is failing at this point.

umarbutler · 2024-02-09T14:03:25Z

@Kaotic3 I didn't encounter that issue after downgrading however I did encounter an issue wherein my Ray nodes were not connecting to one another, I think because I also downgraded my Ray thinking the latest version would cause compatibility issues with version 0.2.7 of vLLM. As it turns out, upgrading to the latest version of Ray solved that. So perhaps you could try that?

In terms of the commands you're running, those are the same commands I'm running so not sure what the issue could be.

bieenr · 2024-02-10T14:54:08Z

Thanks @umarbutler I also downgraded and run successfully. With issue of @Kaotic3, i run "NCCL_SOCKET_IFNAME=eth0 python test.py" instead of "python test.py". I hope it will help you.

umarbutler · 2024-02-10T14:58:10Z

@bieenr This issue should remain open just in case closing it was not a mistake, we still need it fixed in subsequent versions. Clearly, the fact that it works on the previous version indicates that a bug was introduced on the side of vLLM.

WoosukKwon · 2024-02-14T06:52:03Z

This problem is due to the custom all reduce kernel, which is now disabled by default. In v0.3.1, this problem should not persist.

hanzhi713 · 2024-02-15T08:56:13Z

@WoosukKwon Not a bug with kernel itself. It's just that P2P check cannot run properly when CUDA_VISIBLE_DEVICES is set. Will be fixed by #2760.

bieenr closed this as completed Feb 10, 2024

bieenr reopened this Feb 10, 2024

umarbutler mentioned this issue Feb 14, 2024

[v0.3.1] Release Tracker #2859

Closed

5 tasks

WoosukKwon added the bug Something isn't working label Feb 14, 2024

hanzhi713 mentioned this issue Feb 15, 2024

Some fixes for custom allreduce kernels #2760

Merged

WoosukKwon closed this as completed in #2760 Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed inference on multi machine (error Invalid peer device id) #2795

Distributed inference on multi machine (error Invalid peer device id) #2795

bieenr commented Feb 7, 2024 •

edited

Loading

umarbutler commented Feb 8, 2024 •

edited

Loading

Kaotic3 commented Feb 8, 2024

umarbutler commented Feb 9, 2024 •

edited

Loading

Kaotic3 commented Feb 9, 2024

umarbutler commented Feb 9, 2024

bieenr commented Feb 10, 2024

umarbutler commented Feb 10, 2024

WoosukKwon commented Feb 14, 2024

hanzhi713 commented Feb 15, 2024

Distributed inference on multi machine (error Invalid peer device id) #2795

Distributed inference on multi machine (error Invalid peer device id) #2795

Comments

bieenr commented Feb 7, 2024 • edited Loading

umarbutler commented Feb 8, 2024 • edited Loading

Kaotic3 commented Feb 8, 2024

umarbutler commented Feb 9, 2024 • edited Loading

Kaotic3 commented Feb 9, 2024

umarbutler commented Feb 9, 2024

bieenr commented Feb 10, 2024

umarbutler commented Feb 10, 2024

WoosukKwon commented Feb 14, 2024

hanzhi713 commented Feb 15, 2024

bieenr commented Feb 7, 2024 •

edited

Loading

umarbutler commented Feb 8, 2024 •

edited

Loading

umarbutler commented Feb 9, 2024 •

edited

Loading