-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage]: Generate specified number of tokens for each request individually #3650
Comments
This is possible by setting ignore_eos and max_tokens sampling parameters. If you don't want eos in the output, the min_tokens parameter is added recently and will be in upcoming release. |
Thanks for your reply. I'ved tried the sampling parameters but it seems like a global setting which applies to all the request (e.g., generate 100 tokens for all requests). Is there a way to specify number of generated tokens for each request individually? |
In the meantime the online api supports multiple independent requests with different parameters. vLLM perform batching under the hood. |
Hello @oximi123 , is there any build where we can test the min_tokens as the version in PyPI does not have that, I wanted to use it and see if it works, is there any way we can test it earlier? |
Your current environment
VLLM with python 3.9, Ubuntu 20
How would you like to use vllm
How can I specify the number of generated tokens for each request individually in both online serving mode and offline batching mode? For example, three requests with 100 tokens generated for request 1, 200 for request 2 and 300 for request 3. In both offline and online mode, three requests can be processed in a batch and return the specified number of tokens.
The text was updated successfully, but these errors were encountered: