Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipeline parallel support in the future? #387

Closed
irasin opened this issue Jul 7, 2023 · 14 comments
Closed

pipeline parallel support in the future? #387

irasin opened this issue Jul 7, 2023 · 14 comments

Comments

@irasin
Copy link
Contributor

irasin commented Jul 7, 2023

I wonder will you support pipeline parallel in the future?If the answer is yes, maybe the whole system need to be designed again?

@KimmiShi
Copy link

Mark. Is pipeline parallel more efficient than Tensor Parallel in inference?

@irasin
Copy link
Contributor Author

irasin commented Jul 10, 2023

@KimmiShi , it depends on the gpu env you used. Tensor parallel can reduce the total end2end latency since the gemm size in the model becomes smaller than the full model version, but it need the cost time of communication to be small enough.
For devices which support nvlink,I do think tensor parallel is more efficient than pipeline parallel.

@KimmiShi
Copy link

@KimmiShi , it depends on the gpu env you used. Tensor parallel can reduce the total end2end latency since the gemm size in the model becomes smaller than the full model version

Thanks, the e2e latency point of view is interesting.

@esaliya
Copy link
Contributor

esaliya commented Jul 28, 2023

The parallel_state.py shows pipeline groups created, but are pipeline scheduling not supported yet?

@irasin
Copy link
Contributor Author

irasin commented Oct 8, 2023

Is there any progress about pipeline parallel now?

@irasin
Copy link
Contributor Author

irasin commented Nov 21, 2023

Hi, @WoosukKwon
I have supported blocking-type pipeline parallel of llama in my personal fork, https://github.com/irasin/vllm/tree/support_pp

To support a model with pipeline prallel requires the following changes

  1. The load weight and forward functions of each model need to support different pipeline stages.
  2. The worker needs to determine the input and output according to the pipeline stage.

But here's a big problem currently is that in a forward step, if we have three pipeline stages, the workers of stage1 and stage2 need to block until the workers of stage3 complete the inference, which causes a lot of time waste.

Can you take a look at the code and give some comments?

@learninmou
Copy link
Contributor

learninmou commented Nov 24, 2023

can vllm support pipeline parallelism with multiple nodes ?

@irasin
Copy link
Contributor Author

irasin commented Nov 24, 2023

can vllm support pipeline parallelism with multiple nodes ?

Hi, @learninmou,
I'm not familiar with ray's support for multi-node, but I think it should be easy to add multi-node-multi-device tp/pp support.
A common practice is to use inter-nodes pp and intra-nodes tp for very large models.

@lapp0
Copy link

lapp0 commented Dec 12, 2023

Could someone please help me understand what is missing for pipeline parallel? It apparently has dead code in parallel_state.py which is blocked by an exception in config.py

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/parallel_utils/parallel_state.py

        pipeline_model_parallel_size: number of GPUs used for pipeline model
            parallelism.

    Let's say we have a total of 8 GPUs denoted by g0 ... g7 and we
    use 2 GPUs to parallelize the model tensor, and 4 GPUs to parallelize
    the model pipeline. The present function will
    create 4 tensor model-parallel groups and 2 pipeline model-parallel groups:
        4 tensor model-parallel groups:
            [g0, g1], [g2, g3], [g4, g5], [g6, g7]
        2 pipeline model-parallel groups:
            [g0, g2, g4, g6], [g1, g3, g5, g7]
    Note that for efficiency, the caller should make sure adjacent ranks
    are on the same DGX box. For example if we are using 2 DGX-1 boxes
    with a total of 16 GPUs, rank 0 to 7 belong to the first box and
    ranks 8 to 15 belong to the second box.

"Pipeline parallelism is not supported yet.")

        if self.pipeline_parallel_size > 1:
            raise NotImplementedError(
                "Pipeline parallelism is not supported yet.")

@lapp0
Copy link

lapp0 commented Dec 13, 2023

huggingface/transformers#13690

Now, while adding TP is relatively easy, adding PP is very complex in the current state of HF models because they include many features that interfere with implementing PP - due to the requirements:

  • for the model to be nn.Sequential and
  • inputs/outputs to be simple tensors with the first dimension of batch size.

So to implement PP we will most likely have to fork each model, strip the unnecessary for scalability features and only then be able to implement PP.

https://huggingface.co/docs/transformers/v4.15.0/parallelism

Pipeline Parallel (PP) is almost identical to a naive MP, but it solves the GPU idling problem, by chunking the incoming batch into micro-batches and artificially creating a pipeline, which allows different GPUs to concurrently participate in the computation process.
...

Problems with traditional Pipeline API solutions:

  • have to modify the model quite heavily, because Pipeline requires one to rewrite the normal flow of modules into a nn.Sequential sequence of the same, which may require changes to the design of the model.
  • currently the Pipeline API is very restricted. If you had a bunch of python variables being passed in the very first stage of the Pipeline, you will have to find a way around it. Currently, the pipeline interface requires either a single Tensor or a tuple of Tensors as the only input and output. These tensors must have a batch size as the very first dimension, since pipeline is going to chunk the mini batch into micro-batches. Possible improvements are being discussed here [wip] [Pipe] supporting None and non-Tensors in forward's input/output pytorch/pytorch#50693
  • conditional control flow at the level of pipe stages is not possible - e.g., Encoder-Decoder models like T5 require special workarounds to handle a conditional encoder stage.
  • have to arrange each layer so that the output of one model becomes an input to the other model.
    ...
    🤗 Transformers status: as of this writing none of the models supports full-PP. GPT2 and T5 models have naive PP support. The main obstacle is being unable to convert the models to nn.Sequential and have all the inputs to be Tensors. This is because currently the models include many features that make the conversion very complicated, and will need to be removed to accomplish that.

Other approaches:

DeepSpeed, Varuna and SageMaker use the concept of an Interleaved Pipeline

Additionally, TensorRT-LLM has a pipeline parallel implementation (for their C++ backend).

@Lvjinhong
Copy link

Lvjinhong commented Dec 27, 2023

Hi, @WoosukKwon I have supported blocking-type pipeline parallel of llama in my personal fork, https://github.com/irasin/vllm/tree/support_pp

To support a model with pipeline prallel requires the following changes

  1. The load weight and forward functions of each model need to support different pipeline stages.
  2. The worker needs to determine the input and output according to the pipeline stage.

But here's a big problem currently is that in a forward step, if we have three pipeline stages, the workers of stage1 and stage2 need to block until the workers of stage3 complete the inference, which causes a lot of time waste.

Can you take a look at the code and give some comments?

I wanted to inquire about the current state of your personal fork project. Is it functioning correctly at the moment? Have the issues you encountered been resolved? Additionally, I'm curious if you've conducted any tests to assess the actual effectiveness of pipeline parallelism .

For your information, my setup consists of 8 A800 PCIE GPUs, and I am running the llama 70b model.

Additionally, in my tests involving tensor parallelism, I observed that the throughput is higher with eight GPUs compared to four. This outcome puzzles me, as generally, the communication cost over PCIe is quite high.
And I believe that pipeline parallelism would be more efficient for my needs.

@irasin
Copy link
Contributor Author

irasin commented Dec 27, 2023

@Lvjinhong,the pp works only for llama because I have no time to do the adaptation for other models
My implementation is blocking-pp for different workers, so the performance is bad than tp.

For your setup, it's possible that tp8 performance is better than tp4. Because if you use larger tp size, the gemm size in each device will be smaller. The time saved by reducing gemm size is greater than the time increased by all reduce, so the final latency becomes smaller.

I do not recommand to use pp here since my original goal is for the case that if your device number is odd, like 3 gpus which can not run tp.

@lapp0
Copy link

lapp0 commented Dec 28, 2023

Did a bit more digging for some more reference pipeline parallel implementations, and tried to interpret how each works.

The deepspeed option seems much cleaner and more generic to me.

Deepspeed (there are a few examples using deepspeed.pipe.PipelineModule)

Method

  • deepspeed's PipelineModule automatically manages data flows
  • individual transformer layers are specified using LayerSpecs

Docs: https://deepspeed.readthedocs.io/en/latest/pipeline.html

Modules to be parallelized with pipeline parallelism.

The key constraint that enables pipeline parallelism is the representation of the forward pass as a sequence of layers and the enforcement of a simple interface between them. The forward pass is implicitly defined by the module layers. The key assumption is that the output of each layer can be directly fed as input to the next, like a torch.nn.Sequence. The forward pass is implicitly:

classdeepspeed.pipe.LayerSpec(typename, *module_args, **module_kwargs)[source]
Building block for specifying pipeline-parallel modules.

LayerSpec stores the type information and parameters for each stage in a PipelineModule. For example:

   nn.Sequence(
       torch.nn.Linear(self.in_dim, self.hidden_dim, bias=False),

    torch.nn.Linear(self.hidden_hidden, self.out_dim)
)

becomes

layer_specs = [
    LayerSpec(torch.nn.Linear, self.in_dim, self.hidden_dim, bias=False),
    LayerSpec(torch.nn.Linear, self.hidden_hidden, self.out_dim)]
]

Alternatively there is Together.AI's OpenChatKit

Method:

  • Splits into microbatches sent to a different GPU designated by pp_rank.
  • Designates CUDA streams torch_recv_stream / torch_send_stream which allows queueing up new workloads asyncronously
  • Organizes pipeline via pre_node_rank and post_node_rank

joerunde pushed a commit to joerunde/vllm that referenced this issue Jul 31, 2024
SUMMARY:
* refactor to use single socket
* cleanup comments / logging
* add `do_log_stats`
* add `abort`
@hmellor
Copy link
Collaborator

hmellor commented Aug 28, 2024

Pipeline parallel is supported now https://docs.vllm.ai/en/latest/serving/distributed_serving.html

@hmellor hmellor closed this as completed Aug 28, 2024
dtrifiro pushed a commit to dtrifiro/vllm that referenced this issue Oct 15, 2024
…ct#387)

This PR removes additional `multiprocessing.Process` object created as a
workaround for resolving multi-card stall issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants