-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Distributed] refactor custom allreduce to support multiple tp groups #4754
Conversation
@hanzhi713 Could you please also take a look if you have time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youkaichao Thanks for submitting the PR and many thanks for walking through the PR offline. Please check out my comments, most of which are style issues.
from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator | ||
_TP_PYNCCL_COMMUNICATOR = PyNcclCommunicator( | ||
group=_TP_CPU_GROUP, | ||
device=_LOCAL_RANK, | ||
) | ||
|
||
# Initialize a custom fast all-reduce implementation. | ||
if _ENABLE_CUSTOM_ALL_REDUCE: | ||
from vllm.distributed.device_communicators.custom_all_reduce import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need lazy import here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The circular import:
vllm/distributed/__init__.py
imports parallel_state
and communication_op
. If parallel_state
imports from vllm.distributed.device_communicators.custom_all_reduce
in the top level, then this is a circular import because custom_all_reduce
imports get_tensor_model_parallel_cpu_group
from parallel_state
.
Therefore, either parallel_state
or custom_all_reduce
has to use lazy import to break the circular import.
I use lazy import in parallel_state
, to be consistent with how we import pynccl
.
for sz in test_sizes: | ||
for dtype in [torch.float32, torch.float16, torch.bfloat16]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe we can use dtype
as an input parameter so that the dtypes can be tested separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite difficult. The input of test_custom_allreduce
, is coupled with the input of multi_process_tensor_parallel
, which is used elsewhere. In other words, we cannot modify the input parameter inside just tests/distributed/test_custom_all_reduce.py
🤣
您的邮件已收到,我会尽快回复!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for addressing my comments!
LGTM |
Base branch was modified
Previously custom allreduce is attached to a module, and only bound to the world group.
With this PR, it is bound correctly to the tp group.