-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Distributed] improve p2p access check #4992
Conversation
Previously, we use the following check for actual p2p access in case cuda driver is broken: # code partly borrowed from
# https://github.com/turboderp/exllamav2/blob/1c67f97f3d2a968605a9c31ab791a05c85bb7879/exllamav2/compat.py#L10
# License: MIT
def _can_actually_p2p(idx_a, idx_b):
dev_i = f"cuda:{idx_a}"
dev_j = f"cuda:{idx_b}"
a = torch.randn(5, device=dev_i) + 123.0
b = a.to(dev_j)
c = b.to(dev_i)
return torch.all(a == c).cpu().item() However, pytorch somehow fixes the bug, and it will always return import torch
torch.cuda.can_device_access_peer(0, 1) # False
_can_actually_p2p(0, 1) # True This is reported in #4770 (comment) . |
cc @hanzhi713 |
@WoosukKwon ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the PR! Left some minor comments.
vllm/distributed/device_communicators/custom_all_reduce_utils.py
Outdated
Show resolved
Hide resolved
vllm/distributed/device_communicators/custom_all_reduce_utils.py
Outdated
Show resolved
Hide resolved
vllm/distributed/device_communicators/custom_all_reduce_utils.py
Outdated
Show resolved
Hide resolved
vllm/distributed/device_communicators/custom_all_reduce_utils.py
Outdated
Show resolved
Hide resolved
vllm/distributed/device_communicators/custom_all_reduce_utils.py
Outdated
Show resolved
Hide resolved
vllm/distributed/device_communicators/custom_all_reduce_utils.py
Outdated
Show resolved
Hide resolved
@WoosukKwon thanks for the very detailed review! |
Since we still don't have ci machines with p2p capability, I tested this PR locally. cc @simon-mo for nvlink machines. |
Done