-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pipeline parallel support in the future? #387
Comments
Mark. Is pipeline parallel more efficient than Tensor Parallel in inference? |
@KimmiShi , it depends on the gpu env you used. Tensor parallel can reduce the total end2end latency since the gemm size in the model becomes smaller than the full model version, but it need the cost time of communication to be small enough. |
Thanks, the e2e latency point of view is interesting. |
The |
Is there any progress about pipeline parallel now? |
Hi, @WoosukKwon To support a model with pipeline prallel requires the following changes
But here's a big problem currently is that in a forward step, if we have three pipeline stages, the workers of stage1 and stage2 need to block until the workers of stage3 complete the inference, which causes a lot of time waste. Can you take a look at the code and give some comments? |
can vllm support pipeline parallelism with multiple nodes ? |
Hi, @learninmou, |
Could someone please help me understand what is missing for pipeline parallel? It apparently has dead code in
Line 340 in cb3f30c
|
huggingface/transformers#13690
https://huggingface.co/docs/transformers/v4.15.0/parallelism
Additionally, TensorRT-LLM has a pipeline parallel implementation (for their C++ backend). |
I wanted to inquire about the current state of your personal fork project. Is it functioning correctly at the moment? Have the issues you encountered been resolved? Additionally, I'm curious if you've conducted any tests to assess the actual effectiveness of pipeline parallelism . For your information, my setup consists of 8 A800 PCIE GPUs, and I am running the llama 70b model. Additionally, in my tests involving tensor parallelism, I observed that the throughput is higher with eight GPUs compared to four. This outcome puzzles me, as generally, the communication cost over PCIe is quite high. |
@Lvjinhong,the pp works only for llama because I have no time to do the adaptation for other models For your setup, it's possible that tp8 performance is better than tp4. Because if you use larger tp size, the gemm size in each device will be smaller. The time saved by reducing gemm size is greater than the time increased by all reduce, so the final latency becomes smaller. I do not recommand to use pp here since my original goal is for the case that if your device number is odd, like 3 gpus which can not run tp. |
Did a bit more digging for some more reference pipeline parallel implementations, and tried to interpret how each works. The deepspeed option seems much cleaner and more generic to me. Deepspeed (there are a few examples using
|
SUMMARY: * refactor to use single socket * cleanup comments / logging * add `do_log_stats` * add `abort`
Pipeline parallel is supported now https://docs.vllm.ai/en/latest/serving/distributed_serving.html |
…ct#387) This PR removes additional `multiprocessing.Process` object created as a workaround for resolving multi-card stall issue.
I wonder will you support pipeline parallel in the future?If the answer is yes, maybe the whole system need to be designed again?
The text was updated successfully, but these errors were encountered: