[Core][Distributed] refactor custom allreduce to support multiple tp groups #4754

youkaichao · 2024-05-10T23:25:02Z

Previously custom allreduce is attached to a module, and only bound to the world group.

With this PR, it is bound correctly to the tp group.

remove import inside function (need to remove default group None option)

WoosukKwon · 2024-05-12T04:43:21Z

@hanzhi713 Could you please also take a look if you have time?

vllm/distributed/parallel_state.py

WoosukKwon

@youkaichao Thanks for submitting the PR and many thanks for walking through the PR offline. Please check out my comments, most of which are style issues.

tests/distributed/test_comm_ops.py

tests/distributed/test_custom_all_reduce.py

WoosukKwon · 2024-05-12T06:43:50Z

vllm/distributed/parallel_state.py

    from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator
    _TP_PYNCCL_COMMUNICATOR = PyNcclCommunicator(
        group=_TP_CPU_GROUP,
        device=_LOCAL_RANK,
    )

+    # Initialize a custom fast all-reduce implementation.
+    if _ENABLE_CUSTOM_ALL_REDUCE:
+        from vllm.distributed.device_communicators.custom_all_reduce import (


Why do we need lazy import here?

The circular import:

vllm/distributed/__init__.py imports parallel_state and communication_op. If parallel_state imports from vllm.distributed.device_communicators.custom_all_reduce in the top level, then this is a circular import because custom_all_reduce imports get_tensor_model_parallel_cpu_group from parallel_state.

Therefore, either parallel_state or custom_all_reduce has to use lazy import to break the circular import.

I use lazy import in parallel_state, to be consistent with how we import pynccl.

vllm/distributed/communication_op.py

WoosukKwon · 2024-05-12T06:49:03Z

tests/distributed/test_custom_all_reduce.py

    for sz in test_sizes:
        for dtype in [torch.float32, torch.float16, torch.bfloat16]:


nit: Maybe we can use dtype as an input parameter so that the dtypes can be tested separately?

Quite difficult. The input of test_custom_allreduce, is coupled with the input of multi_process_tensor_parallel, which is used elsewhere. In other words, we cannot modify the input parameter inside just tests/distributed/test_custom_all_reduce.py 🤣

vllm/distributed/device_communicators/custom_all_reduce.py

bingfengyiren · 2024-05-12T17:32:38Z

您的邮件已收到，我会尽快回复！

WoosukKwon

LGTM! Thanks for addressing my comments!

hanzhi713 · 2024-05-12T23:56:48Z

LGTM

…groups (vllm-project#4754)

youkaichao added 8 commits May 10, 2024 15:21

refactor custom allreduce

f951065

add back conflict

dcffb0e

remove is_capturing function

1103176

refactor custom allreduce to support groups

3c37b44

add group argument in custom allreduce

04203ce

add multiple group test for custom allreduce

5c67662

add warmup for custom allreduce; clean up tests

11b1f77

remove end_capture and begin_capture, only capture is user-facing

f540151

youkaichao requested a review from WoosukKwon May 11, 2024 00:30

youkaichao added 3 commits May 10, 2024 19:44

fix capture function

6da50d2

fix del when ca is not available

75906f0

fix comm ops test

a6bf2ca

WoosukKwon self-assigned this May 12, 2024

youkaichao commented May 12, 2024

View reviewed changes

vllm/distributed/parallel_state.py Outdated Show resolved Hide resolved

youkaichao added 2 commits May 11, 2024 22:31

fix amd gpu test

dd71361

add sync after allreduce warmup

8fd22fd

WoosukKwon reviewed May 12, 2024

View reviewed changes

WoosukKwon removed their assignment May 12, 2024

WoosukKwon added the action-required label May 12, 2024

simplify arg name

129eada

youkaichao added 8 commits May 12, 2024 10:34

simplify arg name

819fcce

consistent arg order

606c732

fix ca capture

105d14c

fix all gather object

b8e2acf

ensure group is non nccl

0b9950f

rename to graph_mode

f218da7

fix comment

b084222

introduce graph_capture

cde6d58

youkaichao added 2 commits May 12, 2024 11:37

fix lint

d3ffb24

remove lazy import in ca

f286b73

youkaichao removed the action-required label May 12, 2024

youkaichao requested a review from WoosukKwon May 12, 2024 19:04

youkaichao added 2 commits May 12, 2024 12:08

add cleanup after warmup

17e967c

move tp device group warmup to tests

446581d

WoosukKwon approved these changes May 12, 2024

View reviewed changes

youkaichao enabled auto-merge (squash) May 12, 2024 23:26

auto-merge was automatically disabled May 13, 2024 00:46
Base branch was modified

WoosukKwon merged commit 702bee4 into vllm-project:main May 13, 2024
53 of 55 checks passed

youkaichao deleted the ca_refactor branch May 13, 2024 00:49

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 19, 2024

[Core][Distributed] refactor custom allreduce to support multiple tp …

b5c4711

…groups (vllm-project#4754)

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024

[Core][Distributed] refactor custom allreduce to support multiple tp …

0493233

…groups (vllm-project#4754)

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Core][Distributed] refactor custom allreduce to support multiple tp …

296cba6

…groups (vllm-project#4754)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Distributed] refactor custom allreduce to support multiple tp groups #4754

[Core][Distributed] refactor custom allreduce to support multiple tp groups #4754

youkaichao commented May 10, 2024 •

edited

Loading

WoosukKwon commented May 12, 2024

WoosukKwon left a comment

WoosukKwon May 12, 2024

youkaichao May 12, 2024

WoosukKwon May 12, 2024

youkaichao May 12, 2024

bingfengyiren commented May 12, 2024 via email

WoosukKwon left a comment

hanzhi713 commented May 12, 2024

		for sz in test_sizes:
		for dtype in [torch.float32, torch.float16, torch.bfloat16]:

[Core][Distributed] refactor custom allreduce to support multiple tp groups #4754

[Core][Distributed] refactor custom allreduce to support multiple tp groups #4754

Conversation

youkaichao commented May 10, 2024 • edited Loading

WoosukKwon commented May 12, 2024

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon May 12, 2024

Choose a reason for hiding this comment

youkaichao May 12, 2024

Choose a reason for hiding this comment

WoosukKwon May 12, 2024

Choose a reason for hiding this comment

youkaichao May 12, 2024

Choose a reason for hiding this comment

bingfengyiren commented May 12, 2024 via email

WoosukKwon left a comment

Choose a reason for hiding this comment

hanzhi713 commented May 12, 2024

youkaichao commented May 10, 2024 •

edited

Loading