-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distributed model executor abstraction #3191
Conversation
Thanks @zhuohan123, this is the kind of thing I had in mind in this comment #2898 (comment)! If you like I can rework #2898 to plug into your abstraction. |
Yes, #2898 is the exact next PR I'm thinking about after this PR. This PR is still WIP and I might change things here and there. Let me ping you once this PR is finalized :) |
@zhuohan123 sounds great... yeah I meant once you were finished with this, no rush at all! |
USE_RAY_COMPILED_DAG = bool(os.getenv("VLLM_USE_RAY_COMPILED_DAG", 0)) | ||
|
||
|
||
class RayDistributedModelExecutor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest adding a common abstract class for different model executors, so that they all implement the same public API.
@Yard1 Just FYI, one change that I made in this PR is that I moved |
I think it should be fine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhuohan123 Awesome! Thanks for the great work! This refactor substantially cleans up the current system architecture while providing better extensibility 😄.
Overall, I'm happy with the PR and only have small concerns:
- I don't feel
RayDistributedExecutor
andSingleGPUModelRunner
are good names. I'd proposeRayGPURunner
(orRayGPUExecutor
) andGPURunner
instead. WDYT? - As a result of the refactoring, there is some duplicated code between
RayDistributedExecutor
andSingleGPUModelRunner
. Can we reduce the duplication?
Please check out my review for more details.
tests/models/test_marlin.py
Outdated
Note: GPTQ and Marlin do not have bitwise correctness. | ||
As a result, in this test, we just confirm that the top selected tokens of the | ||
Note: GPTQ and Marlin do not have bitwise correctness. | ||
As a result, in this test, we just confirm that the top selected tokens of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering: Is our formatter not able to catch this kind of trailing whitespaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah seems like this is the case. cc @simon-mo
@zhuohan123 this looks great thanks! Related to @WoosukKwon's comment above though, it feels like there's a fair amount of duplication of logic between the implementations which would need to be updated in multiple places any time it changes. Especially when also thinking about how to rework the multiprocessing abstraction from #2898. I think that could be addressed with another abstraction layer beneath your ModelExecutor one covering a subset of the implementations - specifically all of the current GPU-process based ones, but not neuron. WDYT? I'd be happy to show what I mean in another branch. I agree with @Yard1 that it would be good to include an abstract base class. |
…oject/vllm into add-executor-abstraction
@njhill @WoosukKwon Regarding duplicated code. I think I have tried my best to pull out shared codes between the two executors. How about we merge this PR first, and then see whether we can further reduce code logic duplication? |
@WoosukKwon This PR is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for addressing my review!
Hi I don't have the context of this pr, but how can we enable SingleGPUModelRunner path? |
SUMMARY: * upstream merge (sync) up to `54be8a0` ## NOTES - Updated ruff configs had line limits. Had to clean up a lot of files manually. I think `./format.sh` runs yapf and ruff only on the `nm-vllm/vllm` directory whereas our automation runs on everything in the `nm-vllm`, so it was a bit tricky for me to catch why the automation was failing. cc @varun-sundar-rabindranath please review the benchmark directory in detail ### Primary upstream changes: #### Kernels - [`batched_rotary_embedding` ](vllm-project@7e9bd08) - [`gelu_tanh_and_mul`]() #### Core - [`LLMEngine` refactor](vllm-project#3191) <<< adds new layer of abstraction to vLLM. **All should look at this** TEST PLAN: - nightly automation --------- Signed-off-by: Tao He <[email protected]> Signed-off-by: Yuan Tang <[email protected]> Signed-off-by: Sherlock113 <[email protected]> Co-authored-by: Ronen Schaffer <[email protected]> Co-authored-by: Mustafa Eyceoz <[email protected]> Co-authored-by: Roy <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Massimiliano Pronesti <[email protected]> Co-authored-by: 44670 <[email protected]> Co-authored-by: zhaoyang-star <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Jared Moore <[email protected]> Co-authored-by: Philipp Moritz <[email protected]> Co-authored-by: Cade Daniel <[email protected]> Co-authored-by: 张大成 <[email protected]> Co-authored-by: zhangdacheng <[email protected]> Co-authored-by: Jingru <[email protected]> Co-authored-by: Dylan Hawk <[email protected]> Co-authored-by: Tao He <[email protected]> Co-authored-by: Ganesh Jagadeesan <[email protected]> Co-authored-by: Allen.Dou <[email protected]> Co-authored-by: Liangfu Chen <[email protected]> Co-authored-by: CHU Tianxiang <[email protected]> Co-authored-by: Jae-Won Chung <[email protected]> Co-authored-by: Seonghyeon <[email protected]> Co-authored-by: Billy Cao <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: felixzhu555 <[email protected]> Co-authored-by: br3no <[email protected]> Co-authored-by: simon-mo <[email protected]> Co-authored-by: Sherry <[email protected]> Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Huarong <[email protected]> Co-authored-by: huohuarong <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: alexm <[email protected]> Co-authored-by: zixiao <[email protected]> Co-authored-by: cloudhan <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Jason Cox <[email protected]> Co-authored-by: Zhuohan Li <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: TianYu GUO <[email protected]> Co-authored-by: Jialun Lyu <[email protected]> Co-authored-by: ttbachyinsda <[email protected]> Co-authored-by: guofangze <[email protected]> Co-authored-by: Antoni Baum <[email protected]> Co-authored-by: Avnish Narayan <[email protected]> Co-authored-by: Chen Wang <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: lcskrishna <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: Chujie Zheng <[email protected]> Co-authored-by: TechxGenus <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: jacobthebanana <[email protected]> Co-authored-by: whyiug <[email protected]> Co-authored-by: Terry <[email protected]> Co-authored-by: Douglas Lehr <[email protected]> Co-authored-by: kliuae <[email protected]> Co-authored-by: DAIZHENWEI <[email protected]> Co-authored-by: Sherlock Xu <[email protected]> Co-authored-by: Bo-Wen Wang <[email protected]> Co-authored-by: Ronan McGovern <[email protected]> Co-authored-by: Hui Liu <[email protected]> Co-authored-by: 陈序 <[email protected]> Co-authored-by: Or Sharir <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Thomas Parnell <[email protected]> Co-authored-by: Dan Clark <[email protected]> Co-authored-by: Daniel Clark <[email protected]> Co-authored-by: youkaichao <[email protected]>
By default if you use 1 GPU you will go to that path. |
This PR pulls out the distributed worker manager part of
LLMEngine
to a new set of classes, namelyModelExecutor
s. This can benefit us in separating the code for different hardware backends, as well as enabling support of single-box distributed execution without ray. Specifically, this PR implements 4 types of model executors:SingleGPUModelExecutor
: Previous code path when not using ray.SingleGPUModelExecutorAsync
:SingleGPUModelExecutor
+ several async function calls.RayDistributedModelExecutor
: Previous distributed implementation with ray.RayDistributedModelExecutorAsync
:RayDistributedModelExecutor
+ several async function calls.TODOs after this PR
model_executor
->model
.