[Core] separate distributed_init from worker #3904

youkaichao · 2024-04-07T23:19:20Z

Currently, distributed_init is coupled with worker, e.g. it requires a parallel_config, and only works with gpu backend (nccl).

This PR is the first step to refactor distributed_init, to make it general (only depends on CPU and does not need any vllm internal types).

This way, distributed_init can be used in a standalone way, e.g. in tests, devices other than GPU, etc. A demonstration is that we can also use it in CPU backend.

Going further, I plan to move vllm.model_executor.parallel_utils to vllm.parallel_utils , because it should have nothing to do with model_executor.

youkaichao · 2024-04-08T00:07:52Z

With this refactor, we have the following process group:

global process group, with gloo backend, can be used in any situation. this is initialized by vllm.model_executor.parallel_utils.parallel_state.init_distributed_environment
tensor parallel process group, with user-specified backend, can be used in tensor parallel, initialized by vllm.model_executor.parallel_utils.parallel_state.ensure_model_parallel_initialized with backend option (default is nccl)
pipeline parallel process group, currently it is initialized together with the tensor parallel process group

youkaichao · 2024-04-08T21:50:37Z

What caused confusion here, is the advanced usage of torch.distributed.

I asked the pytorch team, and got the following very helpful message:

inside every process, dist.init_process_group should be called only once
after that, they can form new group, via dist.new_group, with different ranks or backend from that used in dist.init_process_group
nccl backend needs more care to be taken, e.g. it requires proper torch.cuda.set_device(dist.get_rank()), cannot be used to do collective computation inside one GPU. gloo is more general, more stable, works for both cpu and gpu, but slower than nccl for multi-gpu setting (which is the designed usecase of nccl).

Here is a basic example:

# test.py
import torch
import torch.distributed as dist
dist.init_process_group(backend='gloo')
group0 = dist.group.WORLD
data = torch.ones((5, 5, 5))
dist.all_reduce(data)
# prints 4, gloo does allreduce for 4 GPU tensors
print(f"{data.mean().item()}")
if dist.get_rank() in [0, 1]:
    group1 = dist.new_group(ranks=[0, 1], backend="nccl")
    torch.cuda.set_device(dist.get_rank())
else:
    group1 = dist.new_group(ranks=[2, 3], backend="gloo")
data1 = torch.ones((5, 5, 5)).cuda()
# prints 2
# for rank 0, 1, the allreduce is done by nccl, and requires setting `torch.cuda.set_device`
# for rank 2, 3, the allreduce is done by gloo, it does not require setting anything. general cpu/gpu tensors are fine
dist.all_reduce(data1, group=group1)
print(f"{data1.mean().cpu().item()}")

Run this with torchrun --nproc-per-node 4 test.py, we will get:

4.0
4.0
4.0
4.0
2.0
2.0
2.0
2.0

youkaichao · 2024-04-08T22:16:29Z

In fact, even when we call dist.init_process_group(backend='nccl'), nccl backend is not created until the first gpu communication happens:

# test.py
import torch
import torch.distributed as dist
dist.init_process_group(backend='nccl')
torch.cuda.set_device(dist.get_rank())
a = torch.ones((5, 5, 5)).cuda()
print("before broadcast")
dist.broadcast(a, src=0)
print(f"{a.mean().cpu().item()}")

Run with export NCCL_DEBUG=TRACE; torchrun --nproc-per-node 2 test.py :

--omitted--
before broadcast
before broadcast
--omitted--
flaminio:2341839:2341901 [0] NCCL INFO comm 0x6cc3ce0 rank 0 nranks 2 cudaDev 0 busId 6000 commId 0x82b0333d5875b736 - Init COMPLETE
flaminio:2341840:2341902 [1] NCCL INFO comm 0x85fbb30 rank 1 nranks 2 cudaDev 1 busId 7000 commId 0x82b0333d5875b736 - Init COMPLETE
1.0
1.0
flaminio:2341839:2341907 [0] NCCL INFO [Service thread] Connection closed by localRank 0
flaminio:2341840:2341908 [1] NCCL INFO [Service thread] Connection closed by localRank 1
flaminio:2341839:2341839 [0] NCCL INFO comm 0x6cc3ce0 rank 0 nranks 2 cudaDev 0 busId 6000 - Abort COMPLETE
flaminio:2341840:2341840 [1] NCCL INFO comm 0x85fbb30 rank 1 nranks 2 cudaDev 1 busId 7000 - Abort COMPLETE

zhuohan123

Please change based on our offline discussion. More specifically, make sure the default torch communication still uses device-specific communicator. Then, we can have a separate communication group just for CPU gloo communication.

vllm/model_executor/parallel_utils/communication_op.py

vllm/worker/worker.py

youkaichao · 2024-04-08T23:00:38Z

The discussion is inspiring, indeed the user experience would be better if a simple call to torch.distributed.all_reduce uses device-specific communication backend.

Refactored according to the discussion, @zhuohan123 PTAL.

zhuohan123

LGTM! Left some small comments.

vllm/model_executor/parallel_utils/parallel_state.py

vllm/worker/worker.py

separate distributed_init from worker

3dbbf23

youkaichao requested a review from zhuohan123 April 7, 2024 23:20

youkaichao added 4 commits April 7, 2024 16:26

fix parallel_state in cpu

54862d7

fix import

909041f

fix ensure_model_parallel_initialized

e3737fa

fix get_tensor_model_parallel_group in communication op

ad3b3d9

be aware of group for cuda operation

f582b57

youkaichao requested a review from WoosukKwon April 8, 2024 01:39

zhuohan123 self-assigned this Apr 8, 2024

zhuohan123 requested changes Apr 8, 2024

View reviewed changes

vllm/model_executor/parallel_utils/communication_op.py Outdated Show resolved Hide resolved

vllm/worker/worker.py Outdated Show resolved Hide resolved

zhuohan123 added the action-required label Apr 8, 2024

youkaichao added 2 commits April 8, 2024 15:47

add two global world

1f15a88

refactor according to zhuohan

639338b

youkaichao removed the action-required label Apr 8, 2024

youkaichao requested a review from zhuohan123 April 8, 2024 23:20

zhuohan123 approved these changes Apr 9, 2024

View reviewed changes

vllm/model_executor/parallel_utils/parallel_state.py Outdated Show resolved Hide resolved

vllm/worker/worker.py Outdated Show resolved Hide resolved

youkaichao added 5 commits April 8, 2024 21:55

rename to init_worker_distributed_environment

3720e97

use default backend in initialize_model_parallel

48e5742

remove duplicate backend arg in cpu

ca7a992

fix backend for ensure_model_parallel_initialized

c7400b7

fix dangling reference

41dcc5b

youkaichao enabled auto-merge (squash) April 9, 2024 08:06

youkaichao merged commit 6d592eb into vllm-project:main Apr 9, 2024
35 checks passed

youkaichao deleted the standalone_distributed_init branch April 9, 2024 15:03

This was referenced Apr 9, 2024

[RFC]: Interface and Abstraction for Distributed Inference Environment #3587

Closed

[Core] Eliminate parallel worker per-step task scheduling overhead #3763

Closed

SageMoore pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 11, 2024

[Core] separate distributed_init from worker (vllm-project#3904)

56413cb

andy-neuma pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 12, 2024

[Core] separate distributed_init from worker (vllm-project#3904)

f7db9ea

youkaichao mentioned this pull request Apr 16, 2024

[Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication #4024

Merged

youkaichao mentioned this pull request Apr 22, 2024

[Core][Distributed] use cpu/gloo to initialize pynccl #4248

Merged

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024

[Core] separate distributed_init from worker (vllm-project#3904)

cee255f

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Core] separate distributed_init from worker (vllm-project#3904)

8b08ed6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] separate distributed_init from worker #3904

[Core] separate distributed_init from worker #3904

youkaichao commented Apr 7, 2024

youkaichao commented Apr 8, 2024

youkaichao commented Apr 8, 2024

youkaichao commented Apr 8, 2024

zhuohan123 left a comment

youkaichao commented Apr 8, 2024

zhuohan123 left a comment

[Core] separate distributed_init from worker #3904

[Core] separate distributed_init from worker #3904

Conversation

youkaichao commented Apr 7, 2024

youkaichao commented Apr 8, 2024

youkaichao commented Apr 8, 2024

youkaichao commented Apr 8, 2024

zhuohan123 left a comment

Choose a reason for hiding this comment

youkaichao commented Apr 8, 2024

zhuohan123 left a comment

Choose a reason for hiding this comment