-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to pipe logit_bias to sampler's embedding_bias #1279
Conversation
I think i'm passing an array of arrays of logit biases, so that will probably need to be changed. |
Please rebase this branch to current main, then I will build and test this. I also want to get LMQL working with vLLM, because performance is not good with any of the other backends. |
Sure thing. I'll rebase tomorrow. |
No idea how this got closed. Reopened and rebased. Finally got it built too! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, added minor comments.
The client must know the vocabulary and the vocab_size in order to pass a logits_bias
array which works with the model loaded. Right now the client has to load the tokenizer corresponding to the model loaded into the vLLM server to achieve this. It makes clients more complex and there is a risk of using the wrong tokenizer for the model, ending up in errors.
I suggest exposing two new REST API calls:
vocabulary
Returns the vocabulary (array of strings). Thevocab_size
is the length of this array. It may also return the special tokens, stop tokens, start/end of conversation tokens, etc.tokenize
Acceptstext
and returns an array of integers (token values).
This way no tokenizer needs to be built on client side and no way to use the wrong vocabulary. The client can build the logits_bias
based on the information returned (the client must acquire the vocabulary once on initialization).
Having these would allow for a pretty straightforward test case and easier integration with LMQL.
@@ -92,6 +93,7 @@ def __init__( | |||
stop_token_ids: Optional[List[int]] = None, | |||
ignore_eos: bool = False, | |||
max_tokens: int = 16, | |||
logit_bias: float = [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proper type is: Optional[List[float]]
@@ -122,6 +124,7 @@ def __init__( | |||
self.max_tokens = max_tokens | |||
self.logprobs = logprobs | |||
self.prompt_logprobs = prompt_logprobs | |||
self.logit_bias = logit_bias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define type here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add self.logit_bias
to __repr__
below.
logit_biases: any = [] | ||
for seq_group in input_metadata.seq_groups: | ||
set_ids, sampling_params = seq_group | ||
logit_biases += [sampling_params.logit_bias] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add validation of the size of the logit_bias
array received from sampling_params
. Having an explicit error with a clear explanation here is better than ending up with a cryptic PyTorch error message while trying to add arrays of mismatching sizes later.
(It cannot be validated inside SamplingParams
, because the right size is not known there.)
@viktor-ferenczi I've been trying to test this locally, but it doesn't seem like the logit_bias parameter is actually working :( Were you able to see the logit_bias taking effect? |
@viktor-ferenczi I may be missing something, but why do they need to know the vocab size? Isn't logit_bias a sparse mapping of token ids to bias to apply? |
I did not know it is a sparse mapping, I expected it to be an array of floats with Anyway, the client must be able to retrieve the vocabulary of the currently loaded model for the solution to be fully usable. Otherwise the client must instantiate the tokenizer of the exact same model loaded into the vLLM server just to get the vocabulary, which means a lot more dependencies and room for error, not to mention the added loading time. |
Looks like on OAI's API it's a map of token_ids to bias values https://platform.openai.com/docs/api-reference/chat/create#chat-create-logit_bias |
@viktor-ferenczi @benbot did you guys come to a consensus on how to go about implementing this? Loading the tokenizer vocab will be tricky but then again this feature is meant for rather advanced use cases and we could just leave it to the api consumer to figure out the mapping between token_ids and tokens (that's what openai did anyway). |
There is a |
Depending on your use case using a grammar would be a much cleaner alternative than trying to manipulate logits, it also moves the problem of handling tokens into the server completely. See #2105: Add grammars |
Covered by #3027 |
logit_bias
is an important feature of the OpenAI API that vllm seems to have implemented but not exposed to the actual api.This is me taking a crack on exposing that functionality.
For the life of me, I can't get my cuda versions to all agree to build this locally, so while I try to do that i'm opening the PR for others to try out.
Should resolve: #379