-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage]: Is it possible to generate without detokenizing? #3635
Comments
I think this can be a good idea. Are you thinking about offline evaluation using the Thoughts about this? @Yard1 @zhuohan123 |
In the use case we ran into right now we're using the API server (but mostly because we need multiple instances on different GPUs, and we didn't see a way to pin |
I'd also been thinking about this recently. I think it would be nice to have some kind of |
@simon-mo If this is something that would be useful to others, I'd be happy to help work on a PR for this! Would this mainly be a matter of skipping the two calls to |
The initialization also needs updating. I tried to resolve the issue #3647 It would be nice that someone more familiar with the code base can make this implementation cleaner. |
@GeauxEric please feel free to open a PR so it's easier to get feedback. |
Just chiming in this is something I'd be interested in as well. |
Your current environment
How would you like to use vllm
We have a use case where we only need the output tokens, but not the detokenized text. This also happens to be using a very small model, and as far as I can tell performance is limited by CPU, not GPU - we see 100% CPU utilisation, but only around 10% GPU utilisation (something similar has been observed even with medium sized models, see #1375 ).
We haven't done detailed profiling, but one obvious optimisation would be to skip detokenisation, i.e. return only the token_ids, but not the output text. Is there a way to do this? I haven't found anything so I assume the answer is "No" out of the box, but we also don't mind making changes to the vllm source for this, if it was just a matter of commenting out a line or two.
Thanks so much!
The text was updated successfully, but these errors were encountered: