Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distributed model executor abstraction #3191

Merged
merged 19 commits into from
Mar 11, 2024
Merged

Conversation

zhuohan123
Copy link
Member

@zhuohan123 zhuohan123 commented Mar 5, 2024

This PR pulls out the distributed worker manager part of LLMEngine to a new set of classes, namely ModelExecutors. This can benefit us in separating the code for different hardware backends, as well as enabling support of single-box distributed execution without ray. Specifically, this PR implements 4 types of model executors:

  1. SingleGPUModelExecutor: Previous code path when not using ray.
  2. SingleGPUModelExecutorAsync: SingleGPUModelExecutor + several async function calls.
  3. RayDistributedModelExecutor: Previous distributed implementation with ray.
  4. RayDistributedModelExecutorAsync: RayDistributedModelExecutor + several async function calls.

TODOs after this PR

@njhill
Copy link
Member

njhill commented Mar 5, 2024

Thanks @zhuohan123, this is the kind of thing I had in mind in this comment #2898 (comment)! If you like I can rework #2898 to plug into your abstraction.

@zhuohan123
Copy link
Member Author

Thanks @zhuohan123, this is the kind of thing I had in mind in this comment #2898 (comment)! If you like I can rework #2898 to plug into your abstraction.

Yes, #2898 is the exact next PR I'm thinking about after this PR. This PR is still WIP and I might change things here and there. Let me ping you once this PR is finalized :)

@njhill
Copy link
Member

njhill commented Mar 6, 2024

@zhuohan123 sounds great... yeah I meant once you were finished with this, no rush at all!

@zhuohan123 zhuohan123 changed the title [WIP] Add distributed model executor abstraction Add distributed model executor abstraction Mar 6, 2024
@zhuohan123 zhuohan123 requested review from Yard1 and WoosukKwon March 6, 2024 09:14
USE_RAY_COMPILED_DAG = bool(os.getenv("VLLM_USE_RAY_COMPILED_DAG", 0))


class RayDistributedModelExecutor:
Copy link
Collaborator

@Yard1 Yard1 Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest adding a common abstract class for different model executors, so that they all implement the same public API.

@zhuohan123
Copy link
Member Author

@Yard1 Just FYI, one change that I made in this PR is that I moved PlacementGroup into the ParallelConfig. Please let me know if this is not a good idea.

@Yard1
Copy link
Collaborator

Yard1 commented Mar 7, 2024

I think it should be fine

@zhuohan123 zhuohan123 requested a review from cadedaniel March 8, 2024 01:14
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhuohan123 Awesome! Thanks for the great work! This refactor substantially cleans up the current system architecture while providing better extensibility 😄.

Overall, I'm happy with the PR and only have small concerns:

  1. I don't feel RayDistributedExecutor and SingleGPUModelRunner are good names. I'd propose RayGPURunner (or RayGPUExecutor) and GPURunner instead. WDYT?
  2. As a result of the refactoring, there is some duplicated code between RayDistributedExecutor and SingleGPUModelRunner. Can we reduce the duplication?

Please check out my review for more details.

Note: GPTQ and Marlin do not have bitwise correctness.
As a result, in this test, we just confirm that the top selected tokens of the
Note: GPTQ and Marlin do not have bitwise correctness.
As a result, in this test, we just confirm that the top selected tokens of the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering: Is our formatter not able to catch this kind of trailing whitespaces?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah seems like this is the case. cc @simon-mo

vllm/engine/llm_engine.py Outdated Show resolved Hide resolved
vllm/executor/single_gpu_executor_async.py Outdated Show resolved Hide resolved
vllm/executor/utils.py Outdated Show resolved Hide resolved
vllm/executor/single_gpu_executor_async.py Outdated Show resolved Hide resolved
vllm/executor/ray_distributed_executor.py Outdated Show resolved Hide resolved
vllm/executor/ray_distributed_executor.py Outdated Show resolved Hide resolved
vllm/executor/ray_distributed_executor.py Outdated Show resolved Hide resolved
vllm/engine/llm_engine.py Outdated Show resolved Hide resolved
vllm/engine/llm_engine.py Outdated Show resolved Hide resolved
@njhill
Copy link
Member

njhill commented Mar 9, 2024

@zhuohan123 this looks great thanks! Related to @WoosukKwon's comment above though, it feels like there's a fair amount of duplication of logic between the implementations which would need to be updated in multiple places any time it changes.

Especially when also thinking about how to rework the multiprocessing abstraction from #2898.

I think that could be addressed with another abstraction layer beneath your ModelExecutor one covering a subset of the implementations - specifically all of the current GPU-process based ones, but not neuron. WDYT? I'd be happy to show what I mean in another branch.

I agree with @Yard1 that it would be good to include an abstract base class.

@zhuohan123
Copy link
Member Author

@njhill @WoosukKwon Regarding duplicated code. I think I have tried my best to pull out shared codes between the two executors. How about we merge this PR first, and then see whether we can further reduce code logic duplication?

@zhuohan123
Copy link
Member Author

@WoosukKwon This PR is ready for review.

@zhuohan123 zhuohan123 requested a review from WoosukKwon March 10, 2024 07:18
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for addressing my review!

@zhuohan123 zhuohan123 merged commit 4c92270 into main Mar 11, 2024
24 checks passed
starmpcc pushed a commit to starmpcc/vllm that referenced this pull request Mar 14, 2024
@binarycrayon
Copy link

Hi I don't have the context of this pr, but how can we enable SingleGPUModelRunner path?

robertgshaw2-neuralmagic added a commit to neuralmagic/nm-vllm that referenced this pull request Mar 15, 2024
SUMMARY:
* upstream merge (sync) up to `54be8a0`

## NOTES

- Updated ruff configs had line limits. Had to clean up a lot of files
manually. I think `./format.sh` runs yapf and ruff only on the
`nm-vllm/vllm` directory whereas our automation runs on everything in
the `nm-vllm`, so it was a bit tricky for me to catch why the automation
was failing. cc @varun-sundar-rabindranath please review the benchmark
directory in detail

### Primary upstream changes:

#### Kernels
- [`batched_rotary_embedding`
](vllm-project@7e9bd08)
- [`gelu_tanh_and_mul`]()

#### Core
- [`LLMEngine` refactor](vllm-project#3191)
<<< adds new layer of abstraction to vLLM. **All should look at this**

TEST PLAN:
- nightly automation

---------

Signed-off-by: Tao He <[email protected]>
Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: Sherlock113 <[email protected]>
Co-authored-by: Ronen Schaffer <[email protected]>
Co-authored-by: Mustafa Eyceoz <[email protected]>
Co-authored-by: Roy <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: Massimiliano Pronesti <[email protected]>
Co-authored-by: 44670 <[email protected]>
Co-authored-by: zhaoyang-star <[email protected]>
Co-authored-by: Harry Mellor <[email protected]>
Co-authored-by: Jared Moore <[email protected]>
Co-authored-by: Philipp Moritz <[email protected]>
Co-authored-by: Cade Daniel <[email protected]>
Co-authored-by: 张大成 <[email protected]>
Co-authored-by: zhangdacheng <[email protected]>
Co-authored-by: Jingru <[email protected]>
Co-authored-by: Dylan Hawk <[email protected]>
Co-authored-by: Tao He <[email protected]>
Co-authored-by: Ganesh Jagadeesan <[email protected]>
Co-authored-by: Allen.Dou <[email protected]>
Co-authored-by: Liangfu Chen <[email protected]>
Co-authored-by: CHU Tianxiang <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Seonghyeon <[email protected]>
Co-authored-by: Billy Cao <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: felixzhu555 <[email protected]>
Co-authored-by: br3no <[email protected]>
Co-authored-by: simon-mo <[email protected]>
Co-authored-by: Sherry <[email protected]>
Co-authored-by: Yuan Tang <[email protected]>
Co-authored-by: Huarong <[email protected]>
Co-authored-by: huohuarong <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: alexm <[email protected]>
Co-authored-by: zixiao <[email protected]>
Co-authored-by: cloudhan <[email protected]>
Co-authored-by: Sage Moore <[email protected]>
Co-authored-by: ElizaWszola <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Jason Cox <[email protected]>
Co-authored-by: Zhuohan Li <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: TianYu GUO <[email protected]>
Co-authored-by: Jialun Lyu <[email protected]>
Co-authored-by: ttbachyinsda <[email protected]>
Co-authored-by: guofangze <[email protected]>
Co-authored-by: Antoni Baum <[email protected]>
Co-authored-by: Avnish Narayan <[email protected]>
Co-authored-by: Chen Wang <[email protected]>
Co-authored-by: Hongxia Yang <[email protected]>
Co-authored-by: lcskrishna <[email protected]>
Co-authored-by: SangBin Cho <[email protected]>
Co-authored-by: Chujie Zheng <[email protected]>
Co-authored-by: TechxGenus <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: jacobthebanana <[email protected]>
Co-authored-by: whyiug <[email protected]>
Co-authored-by: Terry <[email protected]>
Co-authored-by: Douglas Lehr <[email protected]>
Co-authored-by: kliuae <[email protected]>
Co-authored-by: DAIZHENWEI <[email protected]>
Co-authored-by: Sherlock Xu <[email protected]>
Co-authored-by: Bo-Wen Wang <[email protected]>
Co-authored-by: Ronan McGovern <[email protected]>
Co-authored-by: Hui Liu <[email protected]>
Co-authored-by: 陈序 <[email protected]>
Co-authored-by: Or Sharir <[email protected]>
Co-authored-by: youkaichao <[email protected]>
Co-authored-by: Thomas Parnell <[email protected]>
Co-authored-by: Dan Clark <[email protected]>
Co-authored-by: Daniel Clark <[email protected]>
Co-authored-by: youkaichao <[email protected]>
@zhuohan123
Copy link
Member Author

Hi I don't have the context of this pr, but how can we enable SingleGPUModelRunner path?

By default if you use 1 GPU you will go to that path.

dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024
@zhuohan123 zhuohan123 deleted the add-executor-abstraction branch April 26, 2024 00:27
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants