Added logits processor API to sampling params #1469

noamgat · 2023-10-25T09:14:22Z

This PR adds a new optional parameter logits_processors to SamplingParams.

The idea (which exists in huggingface transformers, llama.cpp and other inference engines) allows custom code to modify the logits scores after they are generated by the model, before they are sampled from.

This opens integration possibilities with a lot of solutions, such as LM Format Enforcer (my library), Guidance, JsonFormer and Outlines.

For example, this allows limiting vLLM to only generate outputs that conform to a specific JSON Schema or regular expression.

The design principles behind this PR were to have as little impact as possible on vLLM itself (so no extra dependencies, no runtime penalty if the option is not used) and to be as consistent as possible with other inference engines.

LM Format Enforcer already has an example notebook on integrating with vLLM, showing the benefits, however it uses monkey patching due to the lack of API, which makes it less robust for production use.

I added a test to check the integration, and format.sh did not add any new remarks.

noamgat · 2023-10-31T18:06:08Z

This seems to be a similar PR:
#535
(But this one is ready to merge because it was written against the most up to date vLLM)

c3-adam · 2023-10-31T18:08:09Z

My team and I really really want this!!

flexorRegev · 2023-11-01T13:00:48Z

Benchmarked the RegexParser - works great, over 1000 examples in offline inference it's 40% slower than without (this will of course vary with the type of regex you're forcing)
This is super super useful for a lot of things and a super simple integration!
Great job @noamgat

The next phase will be acceleration when logits are simply a selection like guidance did in the past

simon-mo · 2023-11-01T21:47:56Z

Hi @noamgat, thank you for this amazing contribution. The team (+@zhuohan123 @WoosukKwon @LiuXiaoxuanPKU) discussed a bit about this PR, and we think it's very promising. Can you help address the following:

Zero cost abstraction: how can we ensure that the inference will not be slowed down when there's no logits processors. Can we disable the entire for loop when there's no processor present.
Potential of batching: is it possible make the logic processor accept a batch? In particular, we are interested in reducing the performance penalty when the common case of just one logits processor for all request.
Documentation: can you add documentation about this feature and examples? stressing the performance penalty will be helpful to guide our users on this as well.
Error handling: currently if there is no token available, the request fail with Assertion error. Is it possible to fail more properly with custom error. somehow allowing other request to continue?

Thank you again for this commit. We are really looping forward to bringing this to vLLM.

noamgat · 2023-11-01T22:09:40Z

Hi @noamgat, thank you for this amazing contribution. The team (+@zhuohan123 @WoosukKwon @LiuXiaoxuanPKU) discussed a bit about this PR, and we think it's very promising. Can you help address the following:

Zero cost abstraction: how can we ensure that the inference will not be slowed down when there's no logits processors. Can we disable the entire for loop when there's no processor present.

Potential of batching: is it possible make the logic processor accept a batch? In particular, we are interested in reducing the performance penalty when the common case of just one logits processor for all request.

Documentation: can you add documentation about this feature and examples? stressing the performance penalty will be helpful to guide our users on this as well.

Error handling: currently if there is no token available, the request fail with Assertion error. Is it possible to fail more properly with custom error. somehow allowing other request to continue?

Thank you again for this commit. We are really looping forward to bringing this to vLLM.

Thanks for the feedback!

Replies here:

Zero cost abstraction: I believe this is already the case. _apply_logits_processors() will only loop over the sequence groups and check if a processor exists. No buffers will be copied if none exist, the modifications happens in place.
Batching - The contract I went with mimics the design decisions of Llama.cpp and huggingface transformers - the API allows the logits process to depend on the output logits that were chosen in previous steps. So each sample can get different processing. If the caller wants to do something simpler (for example, disable a logit), they can do it. The performance won't be that different.
Documentation - there is no inherent performance penalty. The 40% slowdown that flexorRegex was talking about is due to the processing time of LMFormatEnforcer in the logits processor, not the vLLM pipeline integration. I think that after this PR is approved, I will update my library (LMFormatEnforcer) to use the new API, and submit integration examples to the vLLM documentation, so users will be able to use this API to generate outputs that conform to a JSON Schema or regular expression.
Error handling - This is how other inference engines behave as well. Is there a way to fail only one (or some) of the requests in the minibatch?

vllm/model_executor/layers/sampler.py

simon-mo · 2023-11-03T20:35:45Z

Thank you for the response. We will accept the PR pending one question above.

Co-authored-by: Simon Mo <[email protected]>

noamgat · 2023-11-03T21:05:20Z

Nice catch! It was a bug I added when making a small modification to reduce the footprint when no logits processors are present during the PR process. Confirmed and updated.

Cppowboy · 2023-11-09T08:59:25Z

When will the logits processor feature be release?

veltz1 · 2023-12-13T10:20:10Z

Is it possible to use this PR to implement more complex methods s.a. contrastive decoding, etc...?

mmoskal · 2024-05-11T18:02:08Z

Does anyone know how are the logit processor functions passed to other workers when using Ray? I understand that the "driver" worker where the sampling happens is in fact another thread within the main vLLM process, so there is probably no problem there. However, because SamplingParams are passed to all workers (as part of SequenceGroupMetadata), would Ray just copy lots of data if the processor references it and pass it around (and not use it later)?

(The case of local copying of logits processor was also addressed in #3099 but I don't think this applies to Ray)

Added logits processor API to sampling params

1d0b4db

noamgat mentioned this pull request Oct 25, 2023

Pre-PR Feedback: Token Filtering API #1357

Closed

noamgat added 2 commits October 31, 2023 07:31

Merge branch 'main' into feature/logits-processor

06bad4d

Lint fix

58e528d

simon-mo requested review from zhuohan123 and WoosukKwon October 31, 2023 18:17

noamgat mentioned this pull request Oct 31, 2023

Batching with vLLM noamgat/lm-format-enforcer#9

Closed

noamgat added 2 commits November 2, 2023 00:00

Reduced runtime footprint to zero if there are no logits processors

57ba07a

Code cleanup

f78c786

simon-mo approved these changes Nov 3, 2023

View reviewed changes

vllm/model_executor/layers/sampler.py Outdated Show resolved Hide resolved

Update vllm/model_executor/layers/sampler.py

417fc00

Co-authored-by: Simon Mo <[email protected]>

simon-mo merged commit 555bdcc into vllm-project:main Nov 3, 2023
2 checks passed

viantirreau mentioned this pull request Nov 23, 2023

Support for grammar #1229

Closed

giorgiopiatti mentioned this pull request Nov 25, 2023

Integrate vLLM eth-sri/lmql#143

Open

simon-mo mentioned this pull request Nov 30, 2023

Add logits processors to enable logit_bias in OpenAI server #535

Closed

viktor-ferenczi mentioned this pull request Jan 12, 2024

Attempt to pipe logit_bias to sampler's embedding_bias #1279

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Added logits processor API to sampling params (vllm-project#1469)

3c59656

tjohnson31415 mentioned this pull request Feb 15, 2024

Adding Locally Typical Sampling (i.e. typical_p in transformers and TGI) #1444

Closed

huyiwen mentioned this pull request Feb 24, 2024

[bug] AssertionError with prompt_logprobs and logits_processors both set #2800

Closed

dylanwhawk mentioned this pull request Feb 25, 2024

Support logit bias for OpenAI API #3027

Merged

huyiwen mentioned this pull request Mar 4, 2024

Fix logits processor when prompt_logprobs is not None #3023

Closed

sjchoi1 pushed a commit to casys-kaist-internal/vllm that referenced this pull request May 7, 2024

Added logits processor API to sampling params (vllm-project#1469)

d237cf7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added logits processor API to sampling params #1469

Added logits processor API to sampling params #1469

noamgat commented Oct 25, 2023 •

edited

Loading

noamgat commented Oct 31, 2023 •

edited

Loading

c3-adam commented Oct 31, 2023

flexorRegev commented Nov 1, 2023

simon-mo commented Nov 1, 2023

noamgat commented Nov 1, 2023 •

edited

Loading

simon-mo commented Nov 3, 2023

noamgat commented Nov 3, 2023

Cppowboy commented Nov 9, 2023

veltz1 commented Dec 13, 2023

mmoskal commented May 11, 2024 •

edited

Loading

Added logits processor API to sampling params #1469

Added logits processor API to sampling params #1469

Conversation

noamgat commented Oct 25, 2023 • edited Loading

noamgat commented Oct 31, 2023 • edited Loading

c3-adam commented Oct 31, 2023

flexorRegev commented Nov 1, 2023

simon-mo commented Nov 1, 2023

noamgat commented Nov 1, 2023 • edited Loading

simon-mo commented Nov 3, 2023

noamgat commented Nov 3, 2023

Cppowboy commented Nov 9, 2023

veltz1 commented Dec 13, 2023

mmoskal commented May 11, 2024 • edited Loading

noamgat commented Oct 25, 2023 •

edited

Loading

noamgat commented Oct 31, 2023 •

edited

Loading

noamgat commented Nov 1, 2023 •

edited

Loading

mmoskal commented May 11, 2024 •

edited

Loading