-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added logits processor API to sampling params #1469
Conversation
This seems to be a similar PR: |
My team and I really really want this!! |
Benchmarked the RegexParser - works great, over 1000 examples in offline inference it's 40% slower than without (this will of course vary with the type of regex you're forcing) The next phase will be acceleration when logits are simply a selection like guidance did in the past |
Hi @noamgat, thank you for this amazing contribution. The team (+@zhuohan123 @WoosukKwon @LiuXiaoxuanPKU) discussed a bit about this PR, and we think it's very promising. Can you help address the following:
Thank you again for this commit. We are really looping forward to bringing this to vLLM. |
Thanks for the feedback! Replies here:
|
Thank you for the response. We will accept the PR pending one question above. |
Co-authored-by: Simon Mo <[email protected]>
Nice catch! It was a bug I added when making a small modification to reduce the footprint when no logits processors are present during the PR process. Confirmed and updated. |
When will the logits processor feature be release? |
Is it possible to use this PR to implement more complex methods s.a. contrastive decoding, etc...? |
Does anyone know how are the logit processor functions passed to other workers when using Ray? I understand that the "driver" worker where the sampling happens is in fact another thread within the main vLLM process, so there is probably no problem there. However, because SamplingParams are passed to all workers (as part of SequenceGroupMetadata), would Ray just copy lots of data if the processor references it and pass it around (and not use it later)? (The case of local copying of logits processor was also addressed in #3099 but I don't think this applies to Ray) |
This PR adds a new optional parameter
logits_processors
to SamplingParams.The idea (which exists in huggingface transformers, llama.cpp and other inference engines) allows custom code to modify the logits scores after they are generated by the model, before they are sampled from.
This opens integration possibilities with a lot of solutions, such as LM Format Enforcer (my library), Guidance, JsonFormer and Outlines.
For example, this allows limiting vLLM to only generate outputs that conform to a specific JSON Schema or regular expression.
The design principles behind this PR were to have as little impact as possible on vLLM itself (so no extra dependencies, no runtime penalty if the option is not used) and to be as consistent as possible with other inference engines.
LM Format Enforcer already has an example notebook on integrating with vLLM, showing the benefits, however it uses monkey patching due to the lack of API, which makes it less robust for production use.
I added a test to check the integration, and format.sh did not add any new remarks.