[Core][Distributed] improve p2p access check #4992

youkaichao · 2024-05-22T21:52:28Z

Done

youkaichao · 2024-05-22T22:58:39Z

Previously, we use the following check for actual p2p access in case cuda driver is broken:

# code partly borrowed from
# https://github.com/turboderp/exllamav2/blob/1c67f97f3d2a968605a9c31ab791a05c85bb7879/exllamav2/compat.py#L10
# License: MIT
def _can_actually_p2p(idx_a, idx_b):
    dev_i = f"cuda:{idx_a}"
    dev_j = f"cuda:{idx_b}"
    a = torch.randn(5, device=dev_i) + 123.0
    b = a.to(dev_j)
    c = b.to(dev_i)
    return torch.all(a == c).cpu().item()

However, pytorch somehow fixes the bug, and it will always return True, no matter whether p2p is available:

import torch
torch.cuda.can_device_access_peer(0, 1) # False
_can_actually_p2p(0, 1) # True

This is reported in #4770 (comment) .

youkaichao · 2024-05-22T23:00:04Z

cc @hanzhi713

youkaichao · 2024-05-23T06:57:29Z

@WoosukKwon ready for review

WoosukKwon

LGTM! Thanks for the PR! Left some minor comments.

vllm/distributed/device_communicators/custom_all_reduce.py

vllm/distributed/device_communicators/custom_all_reduce_utils.py

youkaichao · 2024-05-29T06:23:03Z

@WoosukKwon thanks for the very detailed review!

youkaichao · 2024-05-29T07:21:23Z

Since we still don't have ci machines with p2p capability, I tested this PR locally.

cc @simon-mo for nvlink machines.

youkaichao added 6 commits May 22, 2024 14:51

move files

2a54ffb

add files

31d215c

fix format

dc06672

fix import

b7ed666

add verbose comments

51e1f59

enforce CUDA_VISIBLE_DEVICES

fad32de

youkaichao changed the title ~~[WIP][Core][Distributed] improve p2p access check~~ [Core][Distributed] improve p2p access check May 22, 2024

update comments

20324bb

youkaichao requested a review from WoosukKwon May 23, 2024 06:57

youkaichao added 2 commits May 23, 2024 09:51

add nv forum link

c56b38e

Merge branch 'main' into p2p_check

9c7b8b8

WoosukKwon self-assigned this May 28, 2024

WoosukKwon approved these changes May 29, 2024

View reviewed changes

youkaichao added 6 commits May 28, 2024 23:01

Merge remote-tracking branch 'origin' into p2p_check

f31939d

add type annotation

1c21915

only allow export one function

770e508

fix format

b730c80

use strict assert

89f6bd3

change init

34900d3

fix fork machines

4047484

youkaichao enabled auto-merge (squash) May 29, 2024 07:21

Merge remote-tracking branch 'origin' into p2p_check

dab94e8

youkaichao merged commit 594392d into vllm-project:main May 29, 2024
64 checks passed

blinkbear pushed a commit to blinkbear/vllm that referenced this pull request May 29, 2024

[Core][Distributed] improve p2p access check (vllm-project#4992)

c859d82

youkaichao deleted the p2p_check branch May 29, 2024 15:16

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 31, 2024

[Core][Distributed] improve p2p access check (vllm-project#4992)

6bdfb4f

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 8, 2024

[Core][Distributed] improve p2p access check (vllm-project#4992)

5bde5ba

joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024

[Core][Distributed] improve p2p access check (vllm-project#4992)

b484450

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jul 14, 2024

[Core][Distributed] improve p2p access check (vllm-project#4992)

3986c3e

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Core][Distributed] improve p2p access check (vllm-project#4992)

aa414c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Distributed] improve p2p access check #4992

[Core][Distributed] improve p2p access check #4992

youkaichao commented May 22, 2024 •

edited

Loading

youkaichao commented May 22, 2024

youkaichao commented May 22, 2024

youkaichao commented May 23, 2024

WoosukKwon left a comment

youkaichao commented May 29, 2024

youkaichao commented May 29, 2024

[Core][Distributed] improve p2p access check #4992

[Core][Distributed] improve p2p access check #4992

Conversation

youkaichao commented May 22, 2024 • edited Loading

youkaichao commented May 22, 2024

youkaichao commented May 22, 2024

youkaichao commented May 23, 2024

WoosukKwon left a comment

Choose a reason for hiding this comment

youkaichao commented May 29, 2024

youkaichao commented May 29, 2024

youkaichao commented May 22, 2024 •

edited

Loading