-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate logits
computation and gather to model_runner
#3233
Conversation
Hi @esmeetu , Important update. Regarding this comment, #3183 (comment) I found the bug. It will be useful for this PR. Please add this change while making the PR for scaling logits: e04e56d This is done as during multiGPU setting logits are None for
|
While I agree with this change in principle, I think it's important to ensure we have an API that can support different usecases. For example, I would suggest making the logit generation a layer (or some other sort of abstraction on the model level). The fact that it has now been moved out of the model into the model runner makes the code harder to understand, especially considering the sampler remains a layer. In other words, I would just suggest adding a new logit generator layer to the models (or the sampler, though models would be better - the output of a model should be logits IMO) and not putting that logic inside of model runner. |
@Yard1 @zhuohan123 I redesign this PR, which make logits processor an individual layer. PTAL! |
This need changes to work with the LoRA path |
@Yard1 All CI passed, please review this again. cc @zhuohan123 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! This looks very good. Much better than what we have before :) Can you merge the PR after fixing the merge conflicts?
* upstream/main: [Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config (vllm-project#3551) [Misc][Log] Add log for tokenizer length not equal to vocabulary size (vllm-project#3500) [🚀 Ready to be merged] Added support for Jais models (vllm-project#3183) Fix 1D query issue from `_prune_hidden_states` (vllm-project#3539) [PREFIX CACHING FOLLOW UP] OrderedDict-based evictor (vllm-project#3431) [BugFix] Hot fix in setup.py for neuron build (vllm-project#3537) Migrate `logits` computation and gather to `model_runner` (vllm-project#3233) [1/n][Chunked Prefill] Refactor input query shapes (vllm-project#3236) [1/n] Triton sampling kernel (vllm-project#3186) [Bugfix] Fix ROCm support in CMakeLists.txt (vllm-project#3534)
This might replace Fix vocab_size inconsistency for sampler #2398 .(refer to [Misc][Log] Add log for tokenizer length not equal to vocabulary size #3500)Description
This PR migrate
logits
computation and gather tomodel_runner
. This change makeSampler
simple and clean.Furthermore, i want to remove
sample
method from model file, likellama.py
. Becausemodel
andsample
are at different stage, and should decouple with each other. But in this comment of PR #3183 : #3183 (comment) , that model will scalelogits
in sampler, so i will keepsample
method as it is. This PR will better support #3183 model integration as well.Pipeline:
TODO
- [ ] Fix Fix vocab_size inconsistency for sampler #2398