Migrate `logits` computation and gather to `model_runner` #3233

esmeetu · 2024-03-06T12:54:13Z

Finish TODO: Use NCCL instead of ray for control-plane communication to remove serialization overhead #2221 (comment)
~~This might replace Fix vocab_size inconsistency for sampler #2398 .~~ (refer to [Misc][Log] Add log for tokenizer length not equal to vocabulary size #3500)

Description

This PR migrate logits computation and gather to model_runner. This change make Sampler simple and clean.
Furthermore, i want to remove sample method from model file, like llama.py. Because model and sample are at different stage, and should decouple with each other. But in this comment of PR #3183 : #3183 (comment) , that model will scale logits in sampler, so i will keep sample method as it is. This PR will better support #3183 model integration as well.

Pipeline:

Prepare inputs
Model
LogitsProcessor (support logits scale and Custom LogitsProcessor Functions)
Sampler
Output

TODO

Fix llama, opt for tests
Fix tests (sampler and logits_processor)
Fix other model files
~~- [ ] Fix Fix vocab_size inconsistency for sampler #2398~~

grandiose-pizza · 2024-03-06T19:57:45Z

Hi @esmeetu , Important update.

Regarding this comment, #3183 (comment)

I found the bug. It will be useful for this PR.

Please add this change while making the PR for scaling logits: e04e56d

This is done as during multiGPU setting logits are None for get_tensor_model_parallel_rank() that are > 0 (it will be 0 & 1 in a 2GPU setting) as per this:

vllm/vllm/model_executor/parallel_utils/communication_op.py

Line 102 in a33ce60

output_tensor = None

Yard1 · 2024-03-07T17:59:45Z

While I agree with this change in principle, I think it's important to ensure we have an API that can support different usecases. For example, I would suggest making the logit generation a layer (or some other sort of abstraction on the model level). The fact that it has now been moved out of the model into the model runner makes the code harder to understand, especially considering the sampler remains a layer.

In other words, I would just suggest adding a new logit generator layer to the models (or the sampler, though models would be better - the output of a model should be logits IMO) and not putting that logic inside of model runner.

esmeetu · 2024-03-13T13:46:02Z

@Yard1 @zhuohan123 I redesign this PR, which make logits processor an individual layer. PTAL!
After this design is supported, i will fix other model files.

Yard1 · 2024-03-13T18:37:45Z

This need changes to work with the LoRA path

esmeetu · 2024-03-15T01:53:24Z

@Yard1 All CI passed, please review this again. cc @zhuohan123
And should we resolve #2398 into this PR? Is there any better solution?

zhuohan123

Thanks for the fix! This looks very good. Much better than what we have before :) Can you merge the PR after fixing the merge conflicts?

vllm/worker/model_runner.py

* upstream/main: [Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config (vllm-project#3551) [Misc][Log] Add log for tokenizer length not equal to vocabulary size (vllm-project#3500) [🚀 Ready to be merged] Added support for Jais models (vllm-project#3183) Fix 1D query issue from `_prune_hidden_states` (vllm-project#3539) [PREFIX CACHING FOLLOW UP] OrderedDict-based evictor (vllm-project#3431) [BugFix] Hot fix in setup.py for neuron build (vllm-project#3537) Migrate `logits` computation and gather to `model_runner` (vllm-project#3233) [1/n][Chunked Prefill] Refactor input query shapes (vllm-project#3236) [1/n] Triton sampling kernel (vllm-project#3186) [Bugfix] Fix ROCm support in CMakeLists.txt (vllm-project#3534)

…ct#3233)

zhuohan123 self-assigned this Mar 7, 2024

esmeetu mentioned this pull request Mar 7, 2024

Added support for Jais models #3183

Merged

esmeetu marked this pull request as draft March 10, 2024 08:36

refactor

b3fa8cb

esmeetu force-pushed the perf-sampler branch from f1fd658 to b3fa8cb Compare March 12, 2024 13:29

esmeetu added 7 commits March 12, 2024 21:44

add test

2e2b65c

fix

4d4c31f

fix test

ab10813

fix opt

97be522

fix cuda graph

b3ddea3

rename

1ebe550

fix test

8b642ac

esmeetu marked this pull request as ready for review March 13, 2024 13:42

esmeetu requested review from Yard1 and zhuohan123 March 13, 2024 13:46

GennVa mentioned this pull request Mar 14, 2024

[v0.4.0] Release Tracker #3155

Closed

3 tasks

esmeetu added 10 commits March 14, 2024 21:05

fix lora

4892fb9

Merge remote-tracking branch 'upstream/main' into perf-sampler

c20bcad

fix other models

d75c42c

fix lora test

1b29f92

format

0f13076

fix test

8b66abd

fix layers

4c5703d

format

6bdcbf4

fix test and neuron

0260367

fix test

6d95bc0

esmeetu added 2 commits March 19, 2024 20:46

Merge remote-tracking branch 'upstream/main' into perf-sampler

7bd6de7

fix

0179bd1

mwbyeon mentioned this pull request Mar 19, 2024

Add support for Cohere's Command-R model #3433

Merged

zhuohan123 approved these changes Mar 20, 2024

View reviewed changes

vllm/worker/model_runner.py Show resolved Hide resolved

esmeetu added 2 commits March 21, 2024 06:47

fix comment

744c0c5

Merge remote-tracking branch 'upstream/main' into perf-sampler

63db9e7

zhuohan123 enabled auto-merge (squash) March 20, 2024 22:59

zhuohan123 merged commit f1c0fc3 into vllm-project:main Mar 20, 2024
32 checks passed

grandiose-pizza pushed a commit to grandiose-pizza/vllm-jais that referenced this pull request Mar 21, 2024

adapted to vllm-project#3233 and bug fix for gpt2

33a3a8c

esmeetu deleted the perf-sampler branch March 23, 2024 11:10

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

Migrate logits computation and gather to model_runner (vllm-proje…

7e7801b

…ct#3233)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate `logits` computation and gather to `model_runner` #3233

Migrate `logits` computation and gather to `model_runner` #3233

esmeetu commented Mar 6, 2024 •

edited

Loading

grandiose-pizza commented Mar 6, 2024 •

edited

Loading

Yard1 commented Mar 7, 2024

esmeetu commented Mar 13, 2024 •

edited

Loading

Yard1 commented Mar 13, 2024

esmeetu commented Mar 15, 2024

zhuohan123 left a comment •

edited

Loading

Migrate logits computation and gather to model_runner #3233

Migrate logits computation and gather to model_runner #3233

Conversation

esmeetu commented Mar 6, 2024 • edited Loading

Description

Pipeline:

TODO

grandiose-pizza commented Mar 6, 2024 • edited Loading

Yard1 commented Mar 7, 2024

esmeetu commented Mar 13, 2024 • edited Loading

Yard1 commented Mar 13, 2024

esmeetu commented Mar 15, 2024

zhuohan123 left a comment • edited Loading

Choose a reason for hiding this comment

Migrate `logits` computation and gather to `model_runner` #3233

Migrate `logits` computation and gather to `model_runner` #3233

esmeetu commented Mar 6, 2024 •

edited

Loading

grandiose-pizza commented Mar 6, 2024 •

edited

Loading

esmeetu commented Mar 13, 2024 •

edited

Loading

zhuohan123 left a comment •

edited

Loading