pipeline parallel support in the future？ #387

irasin · 2023-07-07T08:40:19Z

I wonder will you support pipeline parallel in the future？If the answer is yes, maybe the whole system need to be designed again?

KimmiShi · 2023-07-10T08:17:27Z

Mark. Is pipeline parallel more efficient than Tensor Parallel in inference?

irasin · 2023-07-10T09:50:08Z

@KimmiShi , it depends on the gpu env you used. Tensor parallel can reduce the total end2end latency since the gemm size in the model becomes smaller than the full model version, but it need the cost time of communication to be small enough.
For devices which support nvlink，I do think tensor parallel is more efficient than pipeline parallel.

KimmiShi · 2023-07-10T10:37:27Z

@KimmiShi , it depends on the gpu env you used. Tensor parallel can reduce the total end2end latency since the gemm size in the model becomes smaller than the full model version

Thanks, the e2e latency point of view is interesting.

esaliya · 2023-07-28T16:27:14Z

The parallel_state.py shows pipeline groups created, but are pipeline scheduling not supported yet?

irasin · 2023-10-08T10:26:17Z

Is there any progress about pipeline parallel now?

irasin · 2023-11-21T10:11:02Z

Hi, @WoosukKwon
I have supported blocking-type pipeline parallel of llama in my personal fork, https://github.com/irasin/vllm/tree/support_pp

To support a model with pipeline prallel requires the following changes

The load weight and forward functions of each model need to support different pipeline stages.
The worker needs to determine the input and output according to the pipeline stage.

But here's a big problem currently is that in a forward step, if we have three pipeline stages, the workers of stage1 and stage2 need to block until the workers of stage3 complete the inference, which causes a lot of time waste.

Can you take a look at the code and give some comments?

learninmou · 2023-11-24T06:25:57Z

can vllm support pipeline parallelism with multiple nodes ?

irasin · 2023-11-24T08:52:27Z

can vllm support pipeline parallelism with multiple nodes ?

Hi, @learninmou,
I'm not familiar with ray's support for multi-node, but I think it should be easy to add multi-node-multi-device tp/pp support.
A common practice is to use inter-nodes pp and intra-nodes tp for very large models.

lapp0 · 2023-12-12T06:49:56Z

Could someone please help me understand what is missing for pipeline parallel? It apparently has dead code in parallel_state.py which is blocked by an exception in config.py

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/parallel_utils/parallel_state.py

        pipeline_model_parallel_size: number of GPUs used for pipeline model
            parallelism.

    Let's say we have a total of 8 GPUs denoted by g0 ... g7 and we
    use 2 GPUs to parallelize the model tensor, and 4 GPUs to parallelize
    the model pipeline. The present function will
    create 4 tensor model-parallel groups and 2 pipeline model-parallel groups:
        4 tensor model-parallel groups:
            [g0, g1], [g2, g3], [g4, g5], [g6, g7]
        2 pipeline model-parallel groups:
            [g0, g2, g4, g6], [g1, g3, g5, g7]
    Note that for efficiency, the caller should make sure adjacent ranks
    are on the same DGX box. For example if we are using 2 DGX-1 boxes
    with a total of 16 GPUs, rank 0 to 7 belong to the first box and
    ranks 8 to 15 belong to the second box.

vllm/vllm/config.py

Line 340 in cb3f30c

"Pipeline parallelism is not supported yet.")

        if self.pipeline_parallel_size > 1:
            raise NotImplementedError(
                "Pipeline parallelism is not supported yet.")

lapp0 · 2023-12-13T00:30:57Z

huggingface/transformers#13690

Now, while adding TP is relatively easy, adding PP is very complex in the current state of HF models because they include many features that interfere with implementing PP - due to the requirements:

for the model to be nn.Sequential and

inputs/outputs to be simple tensors with the first dimension of batch size.

So to implement PP we will most likely have to fork each model, strip the unnecessary for scalability features and only then be able to implement PP.

https://huggingface.co/docs/transformers/v4.15.0/parallelism

Pipeline Parallel (PP) is almost identical to a naive MP, but it solves the GPU idling problem, by chunking the incoming batch into micro-batches and artificially creating a pipeline, which allows different GPUs to concurrently participate in the computation process.
...

Problems with traditional Pipeline API solutions:

have to modify the model quite heavily, because Pipeline requires one to rewrite the normal flow of modules into a nn.Sequential sequence of the same, which may require changes to the design of the model.

currently the Pipeline API is very restricted. If you had a bunch of python variables being passed in the very first stage of the Pipeline, you will have to find a way around it. Currently, the pipeline interface requires either a single Tensor or a tuple of Tensors as the only input and output. These tensors must have a batch size as the very first dimension, since pipeline is going to chunk the mini batch into micro-batches. Possible improvements are being discussed here [wip] [Pipe] supporting None and non-Tensors in forward's input/output pytorch/pytorch#50693

conditional control flow at the level of pipe stages is not possible - e.g., Encoder-Decoder models like T5 require special workarounds to handle a conditional encoder stage.

have to arrange each layer so that the output of one model becomes an input to the other model.
...
🤗 Transformers status: as of this writing none of the models supports full-PP. GPT2 and T5 models have naive PP support. The main obstacle is being unable to convert the models to nn.Sequential and have all the inputs to be Tensors. This is because currently the models include many features that make the conversion very complicated, and will need to be removed to accomplish that.

Other approaches:

DeepSpeed, Varuna and SageMaker use the concept of an Interleaved Pipeline

Additionally, TensorRT-LLM has a pipeline parallel implementation (for their C++ backend).

Lvjinhong · 2023-12-27T07:18:04Z

Hi, @WoosukKwon I have supported blocking-type pipeline parallel of llama in my personal fork, https://github.com/irasin/vllm/tree/support_pp

To support a model with pipeline prallel requires the following changes

The load weight and forward functions of each model need to support different pipeline stages.

The worker needs to determine the input and output according to the pipeline stage.

But here's a big problem currently is that in a forward step, if we have three pipeline stages, the workers of stage1 and stage2 need to block until the workers of stage3 complete the inference, which causes a lot of time waste.

Can you take a look at the code and give some comments?

I wanted to inquire about the current state of your personal fork project. Is it functioning correctly at the moment? Have the issues you encountered been resolved? Additionally, I'm curious if you've conducted any tests to assess the actual effectiveness of pipeline parallelism .

For your information, my setup consists of 8 A800 PCIE GPUs, and I am running the llama 70b model.

Additionally, in my tests involving tensor parallelism, I observed that the throughput is higher with eight GPUs compared to four. This outcome puzzles me, as generally, the communication cost over PCIe is quite high.
And I believe that pipeline parallelism would be more efficient for my needs.

irasin · 2023-12-27T09:49:02Z

@Lvjinhong，the pp works only for llama because I have no time to do the adaptation for other models
My implementation is blocking-pp for different workers, so the performance is bad than tp.

For your setup, it's possible that tp8 performance is better than tp4. Because if you use larger tp size, the gemm size in each device will be smaller. The time saved by reducing gemm size is greater than the time increased by all reduce, so the final latency becomes smaller.

I do not recommand to use pp here since my original goal is for the case that if your device number is odd, like 3 gpus which can not run tp.

lapp0 · 2023-12-28T05:11:36Z

Did a bit more digging for some more reference pipeline parallel implementations, and tried to interpret how each works.

The deepspeed option seems much cleaner and more generic to me.

Deepspeed (there are a few examples using `deepspeed.pipe.PipelineModule`)

Method

deepspeed's PipelineModule automatically manages data flows
individual transformer layers are specified using LayerSpecs

Docs: https://deepspeed.readthedocs.io/en/latest/pipeline.html

Modules to be parallelized with pipeline parallelism.

The key constraint that enables pipeline parallelism is the representation of the forward pass as a sequence of layers and the enforcement of a simple interface between them. The forward pass is implicitly defined by the module layers. The key assumption is that the output of each layer can be directly fed as input to the next, like a torch.nn.Sequence. The forward pass is implicitly:

classdeepspeed.pipe.LayerSpec(typename, *module_args, **module_kwargs)[source]
Building block for specifying pipeline-parallel modules.

LayerSpec stores the type information and parameters for each stage in a PipelineModule. For example:

   nn.Sequence(
       torch.nn.Linear(self.in_dim, self.hidden_dim, bias=False),

    torch.nn.Linear(self.hidden_hidden, self.out_dim)
)

becomes

layer_specs = [
    LayerSpec(torch.nn.Linear, self.in_dim, self.hidden_dim, bias=False),
    LayerSpec(torch.nn.Linear, self.hidden_hidden, self.out_dim)]
]

Alternatively there is Together.AI's OpenChatKit

Method:

Splits into microbatches sent to a different GPU designated by pp_rank.
Designates CUDA streams torch_recv_stream / torch_send_stream which allows queueing up new workloads asyncronously
Organizes pipeline via pre_node_rank and post_node_rank

SUMMARY: * refactor to use single socket * cleanup comments / logging * add `do_log_stats` * add `abort`

hmellor · 2024-08-28T18:58:00Z

Pipeline parallel is supported now https://docs.vllm.ai/en/latest/serving/distributed_serving.html

…ct#387) This PR removes additional `multiprocessing.Process` object created as a workaround for resolving multi-card stall issue.

WoosukKwon added the feature request label Jul 8, 2023

zhuohan123 mentioned this issue Jul 18, 2023

[Roadmap] vLLM Development Roadmap: H2 2023 #244

Closed

76 tasks

Lvjinhong mentioned this issue Dec 27, 2023

Is pipeline parallelism not supported yet? #2226

Closed

taoluo mentioned this issue Mar 22, 2024

Add Sarathi-Serve support in vLLM #3121

Draft

joerunde pushed a commit to joerunde/vllm that referenced this issue Jul 31, 2024

Features / Cleanup for MP Frontend (vllm-project#387)

1f33286

SUMMARY: * refactor to use single socket * cleanup comments / logging * add `do_log_stats` * add `abort`

hmellor closed this as completed Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline parallel support in the future？ #387

pipeline parallel support in the future？ #387

irasin commented Jul 7, 2023

KimmiShi commented Jul 10, 2023

irasin commented Jul 10, 2023

KimmiShi commented Jul 10, 2023

esaliya commented Jul 28, 2023

irasin commented Oct 8, 2023

irasin commented Nov 21, 2023 •

edited

Loading

learninmou commented Nov 24, 2023 •

edited

Loading

irasin commented Nov 24, 2023

lapp0 commented Dec 12, 2023 •

edited

Loading

lapp0 commented Dec 13, 2023

Lvjinhong commented Dec 27, 2023 •

edited

Loading

irasin commented Dec 27, 2023

lapp0 commented Dec 28, 2023 •

edited

Loading

hmellor commented Aug 28, 2024

pipeline parallel support in the future？ #387

pipeline parallel support in the future？ #387

Comments

irasin commented Jul 7, 2023

KimmiShi commented Jul 10, 2023

irasin commented Jul 10, 2023

KimmiShi commented Jul 10, 2023

esaliya commented Jul 28, 2023

irasin commented Oct 8, 2023

irasin commented Nov 21, 2023 • edited Loading

learninmou commented Nov 24, 2023 • edited Loading

irasin commented Nov 24, 2023

lapp0 commented Dec 12, 2023 • edited Loading

lapp0 commented Dec 13, 2023

Lvjinhong commented Dec 27, 2023 • edited Loading

irasin commented Dec 27, 2023

lapp0 commented Dec 28, 2023 • edited Loading

Deepspeed (there are a few examples using deepspeed.pipe.PipelineModule)

Alternatively there is Together.AI's OpenChatKit

hmellor commented Aug 28, 2024

irasin commented Nov 21, 2023 •

edited

Loading

learninmou commented Nov 24, 2023 •

edited

Loading

lapp0 commented Dec 12, 2023 •

edited

Loading

Lvjinhong commented Dec 27, 2023 •

edited

Loading

lapp0 commented Dec 28, 2023 •

edited

Loading

Deepspeed (there are a few examples using `deepspeed.pipe.PipelineModule`)