-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Push logprob generation to LLMEngine #3065
Push logprob generation to LLMEngine #3065
Conversation
Co-authored-by: Avnish Narayan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
vllm/sequence.py
Outdated
and self.logprobs == other.logprobs) | ||
equal = (self.parent_seq_id == other.parent_seq_id | ||
and self.output_token == other.output_token) | ||
log_probs_equal = ((len(other.logprobs) == len(self.logprobs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to move this to Logprobs
's __eq__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call!
I pointed out an issue w.r.t to LogProbs in a previous PR:
Maybe this PR is a good point to merge/attend this into vLLM ? |
@AguirreNicolas IIUC, considering that change would need to be mainly implemented in OpenAI server, I think it should be independent of this PR. |
@esmeetu I had to add some extra logic, ptal again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Yard1! I'd realized that something like this was needed while making changes to use a threadpool for tokenization (per #2879 (comment)). I'll wait until this is merged before opening the PR for that.
# We need to do it here, because if there are exceptions in | ||
# the result_generator, it needs to be sent as the FIRST | ||
# response (by the try...catch). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the error need to be the first response? This would also delay the first responses until after the first token is generated (which could include any time queuing I think)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember the openai client package was not able to handle errors unless they were the first thing that came out of the endpoint. I think the current version may be more robust, though. Will see if the test can still pass with the previous layout.
I think the slight delay in response is fine, it will not affect e2e time
for token_id, sample_logprob in logprobs.items(): | ||
if (sample_logprob.decoded_token is None and token_id != -1): | ||
all_input_ids_with_logprob = all_input_ids[:-1] + [token_id] | ||
_, new_text, prefix_offset, read_offset = detokenize_incrementally( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without knowing the OpenAI behaviour, IMHO it would be more appropriate here to use convert_ids_to_tokens
and include the explicit/atomic token strings. Otherwise the text may not line up with the token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the OpenAI behavior is exactly that - the logprob token text depends on the previous tokens and is not constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yard1 ok, thanks! I'll take a closer look at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yard1 LGTM! Could you merge the latest branch and pass the CI?
vllm/engine/arg_utils.py
Outdated
@@ -30,6 +30,7 @@ class EngineArgs: | |||
max_num_batched_tokens: Optional[int] = None | |||
max_num_seqs: int = 256 | |||
max_paddings: int = 256 | |||
max_log_probs: int = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is max_logprobs
better than this?
And we can add comments for why default value is 5 (from OpenAI API Reference?).
@Yard1 I realized that this is an API-breaking change for anyone consuming logprobs via the engine API (it actually broke our integration). I'm not sure what the project stance on this is w.r.t. semantic versioning but at minimum I guess it should be highlighted in the 0.3.4 release notes. |
@Yard1 This also breaks https://github.com/EleutherAI/lm-evaluation-harness -- we should either fix the harness or roll back the API change :)
|
We should fix the harness. |
Sounds good to me -- can you make a PR for it? |
Co-authored-by: Avnish Narayan <[email protected]>
Co-authored-by: Avnish Narayan <[email protected]>
This PR moves the logprob detokenization logic away from the OpenAI server to the LLMEngine, allowing for consistent output between the two. It also is a first step towards making the OpenAI server more lightweight by pushing down some of its responsibilities.
It also ensures the logprob tokens are detokenized with the previous tokens in mind (same as generated tokens), which will make them more accurate.