Add distributed model executor abstraction #3191

zhuohan123 · 2024-03-05T06:27:23Z

This PR pulls out the distributed worker manager part of LLMEngine to a new set of classes, namely ModelExecutors. This can benefit us in separating the code for different hardware backends, as well as enabling support of single-box distributed execution without ray. Specifically, this PR implements 4 types of model executors:

SingleGPUModelExecutor: Previous code path when not using ray.
SingleGPUModelExecutorAsync: SingleGPUModelExecutor + several async function calls.
RayDistributedModelExecutor: Previous distributed implementation with ray.
RayDistributedModelExecutorAsync: RayDistributedModelExecutor + several async function calls.

TODOs after this PR

Separate neuron's code path to a separate model executor.
Rename directory name model_executor -> model.
Add multi-processing based distributed execution (Make ray optional for single-node deployment #2898).

njhill · 2024-03-05T23:01:03Z

Thanks @zhuohan123, this is the kind of thing I had in mind in this comment #2898 (comment)! If you like I can rework #2898 to plug into your abstraction.

zhuohan123 · 2024-03-06T00:23:03Z

Thanks @zhuohan123, this is the kind of thing I had in mind in this comment #2898 (comment)! If you like I can rework #2898 to plug into your abstraction.

Yes, #2898 is the exact next PR I'm thinking about after this PR. This PR is still WIP and I might change things here and there. Let me ping you once this PR is finalized :)

njhill · 2024-03-06T00:44:02Z

@zhuohan123 sounds great... yeah I meant once you were finished with this, no rush at all!

Yard1 · 2024-03-06T18:55:40Z

vllm/executor/ray_distributed_executor.py

+USE_RAY_COMPILED_DAG = bool(os.getenv("VLLM_USE_RAY_COMPILED_DAG", 0))
+
+
+class RayDistributedModelExecutor:


suggest adding a common abstract class for different model executors, so that they all implement the same public API.

zhuohan123 · 2024-03-07T00:03:04Z

@Yard1 Just FYI, one change that I made in this PR is that I moved PlacementGroup into the ParallelConfig. Please let me know if this is not a good idea.

Yard1 · 2024-03-07T00:08:00Z

I think it should be fine

WoosukKwon

@zhuohan123 Awesome! Thanks for the great work! This refactor substantially cleans up the current system architecture while providing better extensibility 😄.

Overall, I'm happy with the PR and only have small concerns:

I don't feel RayDistributedExecutor and SingleGPUModelRunner are good names. I'd propose RayGPURunner (or RayGPUExecutor) and GPURunner instead. WDYT?
As a result of the refactoring, there is some duplicated code between RayDistributedExecutor and SingleGPUModelRunner. Can we reduce the duplication?

Please check out my review for more details.

WoosukKwon · 2024-03-08T07:30:32Z

tests/models/test_marlin.py

-Note: GPTQ and Marlin do not have bitwise correctness. 
-As a result, in this test, we just confirm that the top selected tokens of the 
+Note: GPTQ and Marlin do not have bitwise correctness.
+As a result, in this test, we just confirm that the top selected tokens of the


Just wondering: Is our formatter not able to catch this kind of trailing whitespaces?

Yeah seems like this is the case. cc @simon-mo

vllm/engine/llm_engine.py

vllm/executor/single_gpu_executor_async.py

vllm/executor/utils.py

vllm/executor/single_gpu_executor_async.py

vllm/executor/ray_distributed_executor.py

vllm/engine/llm_engine.py

njhill · 2024-03-09T00:09:24Z

@zhuohan123 this looks great thanks! Related to @WoosukKwon's comment above though, it feels like there's a fair amount of duplication of logic between the implementations which would need to be updated in multiple places any time it changes.

Especially when also thinking about how to rework the multiprocessing abstraction from #2898.

I think that could be addressed with another abstraction layer beneath your ModelExecutor one covering a subset of the implementations - specifically all of the current GPU-process based ones, but not neuron. WDYT? I'd be happy to show what I mean in another branch.

I agree with @Yard1 that it would be good to include an abstract base class.

…oject/vllm into add-executor-abstraction

zhuohan123 · 2024-03-10T07:17:13Z

@njhill @WoosukKwon Regarding duplicated code. I think I have tried my best to pull out shared codes between the two executors. How about we merge this PR first, and then see whether we can further reduce code logic duplication?

zhuohan123 · 2024-03-10T07:18:09Z

@WoosukKwon This PR is ready for review.

WoosukKwon

LGTM! Thanks for addressing my review!

binarycrayon · 2024-03-15T03:12:16Z

Hi I don't have the context of this pr, but how can we enable SingleGPUModelRunner path?

@varun-sundar-rabindranath

SUMMARY: * upstream merge (sync) up to `54be8a0` ## NOTES - Updated ruff configs had line limits. Had to clean up a lot of files manually. I think `./format.sh` runs yapf and ruff only on the `nm-vllm/vllm` directory whereas our automation runs on everything in the `nm-vllm`, so it was a bit tricky for me to catch why the automation was failing. cc @varun-sundar-rabindranath please review the benchmark directory in detail ### Primary upstream changes: #### Kernels - [`batched_rotary_embedding` ](vllm-project@7e9bd08) - [`gelu_tanh_and_mul`]() #### Core - [`LLMEngine` refactor](vllm-project#3191) <<< adds new layer of abstraction to vLLM. **All should look at this** TEST PLAN: - nightly automation --------- Signed-off-by: Tao He <[email protected]> Signed-off-by: Yuan Tang <[email protected]> Signed-off-by: Sherlock113 <[email protected]> Co-authored-by: Ronen Schaffer <[email protected]> Co-authored-by: Mustafa Eyceoz <[email protected]> Co-authored-by: Roy <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Massimiliano Pronesti <[email protected]> Co-authored-by: 44670 <[email protected]> Co-authored-by: zhaoyang-star <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Jared Moore <[email protected]> Co-authored-by: Philipp Moritz <[email protected]> Co-authored-by: Cade Daniel <[email protected]> Co-authored-by: 张大成 <[email protected]> Co-authored-by: zhangdacheng <[email protected]> Co-authored-by: Jingru <[email protected]> Co-authored-by: Dylan Hawk <[email protected]> Co-authored-by: Tao He <[email protected]> Co-authored-by: Ganesh Jagadeesan <[email protected]> Co-authored-by: Allen.Dou <[email protected]> Co-authored-by: Liangfu Chen <[email protected]> Co-authored-by: CHU Tianxiang <[email protected]> Co-authored-by: Jae-Won Chung <[email protected]> Co-authored-by: Seonghyeon <[email protected]> Co-authored-by: Billy Cao <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: felixzhu555 <[email protected]> Co-authored-by: br3no <[email protected]> Co-authored-by: simon-mo <[email protected]> Co-authored-by: Sherry <[email protected]> Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Huarong <[email protected]> Co-authored-by: huohuarong <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: alexm <[email protected]> Co-authored-by: zixiao <[email protected]> Co-authored-by: cloudhan <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Jason Cox <[email protected]> Co-authored-by: Zhuohan Li <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: TianYu GUO <[email protected]> Co-authored-by: Jialun Lyu <[email protected]> Co-authored-by: ttbachyinsda <[email protected]> Co-authored-by: guofangze <[email protected]> Co-authored-by: Antoni Baum <[email protected]> Co-authored-by: Avnish Narayan <[email protected]> Co-authored-by: Chen Wang <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: lcskrishna <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: Chujie Zheng <[email protected]> Co-authored-by: TechxGenus <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: jacobthebanana <[email protected]> Co-authored-by: whyiug <[email protected]> Co-authored-by: Terry <[email protected]> Co-authored-by: Douglas Lehr <[email protected]> Co-authored-by: kliuae <[email protected]> Co-authored-by: DAIZHENWEI <[email protected]> Co-authored-by: Sherlock Xu <[email protected]> Co-authored-by: Bo-Wen Wang <[email protected]> Co-authored-by: Ronan McGovern <[email protected]> Co-authored-by: Hui Liu <[email protected]> Co-authored-by: 陈序 <[email protected]> Co-authored-by: Or Sharir <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Thomas Parnell <[email protected]> Co-authored-by: Dan Clark <[email protected]> Co-authored-by: Daniel Clark <[email protected]> Co-authored-by: youkaichao <[email protected]>

zhuohan123 · 2024-03-21T20:45:07Z

Hi I don't have the context of this pr, but how can we enable SingleGPUModelRunner path?

By default if you use 1 GPU you will go to that path.

zhuohan123 added 3 commits March 5, 2024 06:27

Add distributed model executor abstraction

d8c0998

fix

16de289

fix

15a1fe7

zhuohan123 added 3 commits March 6, 2024 08:15

Merge branch 'main' into add-executor-abstraction

ac2e888

format

675190d

health check

2592130

zhuohan123 changed the title ~~[WIP] Add distributed model executor abstraction~~ Add distributed model executor abstraction Mar 6, 2024

zhuohan123 added 2 commits March 6, 2024 09:03

pull out common functionalities and fix tests

e381ca3

Merge branch 'main' into add-executor-abstraction

3bdda0b

zhuohan123 requested review from Yard1 and WoosukKwon March 6, 2024 09:14

fix lora test

c348371

Yard1 reviewed Mar 6, 2024

View reviewed changes

zhuohan123 requested a review from cadedaniel March 8, 2024 01:14

WoosukKwon reviewed Mar 8, 2024

View reviewed changes

zhuohan123 added 9 commits March 9, 2024 02:19

Fix style

002c67f

fix review comments

198e794

rename

f82841b

Merge branch 'main' into add-executor-abstraction

390dbaf

Add base class

ebcd813

refactor async executors

fe2ef93

fix async style

89e0cac

Merge branch 'add-executor-abstraction' of https://github.com/vllm-pr…

22ee8ca

…oject/vllm into add-executor-abstraction

lazy import

1c77da8

zhuohan123 requested a review from WoosukKwon March 10, 2024 07:18

Merge branch 'main' into add-executor-abstraction

4b4206d

WoosukKwon approved these changes Mar 11, 2024

View reviewed changes

zhuohan123 merged commit 4c92270 into main Mar 11, 2024
24 checks passed

esmeetu mentioned this pull request Mar 12, 2024

[BugFix] Fix async engine running on ray #3343

Closed

starmpcc pushed a commit to starmpcc/vllm that referenced this pull request Mar 14, 2024

Add distributed model executor abstraction (vllm-project#3191)

ec75970

robertgshaw2-neuralmagic mentioned this pull request Mar 15, 2024

Upstream sync 2024 03 14 neuralmagic/nm-vllm#127

Merged

dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024

Add distributed model executor abstraction (vllm-project#3191)

a5793f3

zhuohan123 deleted the add-executor-abstraction branch April 26, 2024 00:27

njhill mentioned this pull request May 1, 2024

[Core] Centralize GPU Worker construction #4419

Merged

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

Add distributed model executor abstraction (vllm-project#3191)

0395610

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed model executor abstraction #3191

Add distributed model executor abstraction #3191

zhuohan123 commented Mar 5, 2024 •

edited

Loading

njhill commented Mar 5, 2024

zhuohan123 commented Mar 6, 2024

njhill commented Mar 6, 2024

Yard1 Mar 6, 2024 •

edited

Loading

zhuohan123 commented Mar 7, 2024

Yard1 commented Mar 7, 2024

WoosukKwon left a comment

WoosukKwon Mar 8, 2024

zhuohan123 Mar 8, 2024

njhill commented Mar 9, 2024

zhuohan123 commented Mar 10, 2024

zhuohan123 commented Mar 10, 2024

WoosukKwon left a comment

binarycrayon commented Mar 15, 2024

zhuohan123 commented Mar 21, 2024

		USE_RAY_COMPILED_DAG = bool(os.getenv("VLLM_USE_RAY_COMPILED_DAG", 0))


		class RayDistributedModelExecutor:

Add distributed model executor abstraction #3191

Add distributed model executor abstraction #3191

Conversation

zhuohan123 commented Mar 5, 2024 • edited Loading

njhill commented Mar 5, 2024

zhuohan123 commented Mar 6, 2024

njhill commented Mar 6, 2024

Yard1 Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

zhuohan123 commented Mar 7, 2024

Yard1 commented Mar 7, 2024

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon Mar 8, 2024

Choose a reason for hiding this comment

zhuohan123 Mar 8, 2024

Choose a reason for hiding this comment

njhill commented Mar 9, 2024

zhuohan123 commented Mar 10, 2024

zhuohan123 commented Mar 10, 2024

WoosukKwon left a comment

Choose a reason for hiding this comment

binarycrayon commented Mar 15, 2024

zhuohan123 commented Mar 21, 2024

zhuohan123 commented Mar 5, 2024 •

edited

Loading

Yard1 Mar 6, 2024 •

edited

Loading