Push logprob generation to LLMEngine #3065

Yard1 · 2024-02-27T22:04:20Z

This PR moves the logprob detokenization logic away from the OpenAI server to the LLMEngine, allowing for consistent output between the two. It also is a first step towards making the OpenAI server more lightweight by pushing down some of its responsibilities.

It also ensures the logprob tokens are detokenized with the previous tokens in mind (same as generated tokens), which will make them more accurate.

Co-authored-by: Avnish Narayan <[email protected]>

esmeetu

LGTM!

esmeetu · 2024-02-27T23:32:37Z

vllm/sequence.py

-                and self.logprobs == other.logprobs)
+        equal = (self.parent_seq_id == other.parent_seq_id
+                 and self.output_token == other.output_token)
+        log_probs_equal = ((len(other.logprobs) == len(self.logprobs))


Is it better to move this to Logprobs's __eq__?

AguirreNicolas · 2024-02-28T07:52:32Z

I pointed out an issue w.r.t to LogProbs in a previous PR:

Please refers to the source of the ChatCompletionResponseChoice (class Choice source) in OpenAI.

There is is a new class called ChoiceLogprobs (source) that is not the same as LogProbs.

IMO, first it should be implemented both ChatCompletionTokenLogprob and TopLogprob (source)

API reference about the structure: https://platform.openai.com/docs/api-reference/chat/object

Maybe this PR is a good point to merge/attend this into vLLM ?

Yard1 · 2024-02-28T22:15:54Z

@AguirreNicolas IIUC, considering that change would need to be mainly implemented in OpenAI server, I think it should be independent of this PR.

Yard1 · 2024-02-29T01:54:11Z

@esmeetu I had to add some extra logic, ptal again

njhill

Thanks @Yard1! I'd realized that something like this was needed while making changes to use a threadpool for tokenization (per #2879 (comment)). I'll wait until this is merged before opening the PR for that.

njhill · 2024-03-03T21:37:55Z

vllm/entrypoints/openai/serving_chat.py

+                # We need to do it here, because if there are exceptions in
+                # the result_generator, it needs to be sent as the FIRST
+                # response (by the try...catch).


Why does the error need to be the first response? This would also delay the first responses until after the first token is generated (which could include any time queuing I think)?

I remember the openai client package was not able to handle errors unless they were the first thing that came out of the endpoint. I think the current version may be more robust, though. Will see if the test can still pass with the previous layout.

I think the slight delay in response is fine, it will not affect e2e time

njhill · 2024-03-03T21:49:33Z

vllm/engine/llm_engine.py

+        for token_id, sample_logprob in logprobs.items():
+            if (sample_logprob.decoded_token is None and token_id != -1):
+                all_input_ids_with_logprob = all_input_ids[:-1] + [token_id]
+                _, new_text, prefix_offset, read_offset = detokenize_incrementally(


Without knowing the OpenAI behaviour, IMHO it would be more appropriate here to use convert_ids_to_tokens and include the explicit/atomic token strings. Otherwise the text may not line up with the token.

Actually, the OpenAI behavior is exactly that - the logprob token text depends on the previous tokens and is not constant.

@Yard1 ok, thanks! I'll take a closer look at this.

esmeetu

@Yard1 LGTM! Could you merge the latest branch and pass the CI?

esmeetu · 2024-03-04T10:29:32Z

vllm/engine/arg_utils.py

@@ -30,6 +30,7 @@ class EngineArgs:
    max_num_batched_tokens: Optional[int] = None
    max_num_seqs: int = 256
    max_paddings: int = 256
+    max_log_probs: int = 5


Is max_logprobs better than this?
And we can add comments for why default value is 5 (from OpenAI API Reference?).

njhill · 2024-03-07T23:13:55Z

@Yard1 I realized that this is an API-breaking change for anyone consuming logprobs via the engine API (it actually broke our integration). I'm not sure what the project stance on this is w.r.t. semantic versioning but at minimum I guess it should be highlighted in the 0.3.4 release notes.

pcmoritz · 2024-03-08T20:11:43Z

@Yard1 This also breaks https://github.com/EleutherAI/lm-evaluation-harness -- we should either fix the harness or roll back the API change :)

  File "/home/ray/anaconda3/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
  File "/home/ray/default/lm-evaluation-harness/lm_eval/__main__.py", line 318, in cli_evaluate
    results = evaluator.simple_evaluate(
  File "/home/ray/default/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper
    return fn(*args, **kwargs)
  File "/home/ray/default/lm-evaluation-harness/lm_eval/evaluator.py", line 230, in simple_evaluate
    results = evaluate(
  File "/home/ray/default/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper
    return fn(*args, **kwargs)
  File "/home/ray/default/lm-evaluation-harness/lm_eval/evaluator.py", line 368, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
  File "/home/ray/default/lm-evaluation-harness/lm_eval/api/model.py", line 321, in loglikelihood
    return self._loglikelihood_tokens(new_reqs)
  File "/home/ray/default/lm-evaluation-harness/lm_eval/models/vllm_causallms.py", line 379, in _loglikelihood_tokens
    answer = self._parse_logprobs(
  File "/home/ray/default/lm-evaluation-harness/lm_eval/models/vllm_causallms.py", line 416, in _parse_logprobs
    continuation_logprobs = sum(
TypeError: unsupported operand type(s) for +: 'int' and 'Logprob'

Yard1 · 2024-03-08T21:24:52Z

We should fix the harness.

pcmoritz · 2024-03-08T21:29:44Z

Sounds good to me -- can you make a PR for it?

Co-authored-by: Avnish Narayan <[email protected]>

Push logprob generation to LLMEngine

16a49f5

Co-authored-by: Avnish Narayan <[email protected]>

Yard1 requested review from esmeetu, zhuohan123, simon-mo and WoosukKwon February 27, 2024 22:04

esmeetu approved these changes Feb 27, 2024

View reviewed changes

Yard1 added 2 commits February 28, 2024 17:52

Fix error propagation

3b59c01

Merge branch 'main' into push_logprob_generation_to_engine

0b7a9c9

Yard1 requested a review from esmeetu February 29, 2024 01:54

Yard1 added 2 commits February 28, 2024 18:19

Trigger CI

306d3dd

Revert

cafccae

njhill reviewed Mar 4, 2024

View reviewed changes

esmeetu approved these changes Mar 4, 2024

View reviewed changes

Yard1 added 4 commits March 4, 2024 10:37

Merge branch 'main' into push_logprob_generation_to_engine

f101ef6

max_log_probs -> max_logprobs

05fcdcc

Add comment

9da7def

Lint

2c3e8da

Yard1 enabled auto-merge (squash) March 4, 2024 19:07

Yard1 merged commit 22de452 into vllm-project:main Mar 4, 2024
22 checks passed

Yard1 mentioned this pull request Mar 8, 2024

Add compatibility for vLLM's new Logprob object EleutherAI/lm-evaluation-harness#1549

Merged

pcmoritz mentioned this pull request Mar 9, 2024

lm-evaluation-harness broken on master #3292

Open

dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024

Push logprob generation to LLMEngine (vllm-project#3065)

b00c20d

Co-authored-by: Avnish Narayan <[email protected]>

saiatmakuri mentioned this pull request Apr 23, 2024

fix return_token_log_probs on vLLM > 0.3.3 endpoints scaleapi/llm-engine#498

Merged

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

Push logprob generation to LLMEngine (vllm-project#3065)

790182b

Co-authored-by: Avnish Narayan <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Push logprob generation to LLMEngine #3065

Push logprob generation to LLMEngine #3065

Yard1 commented Feb 27, 2024 •

edited

Loading

esmeetu left a comment

esmeetu Feb 27, 2024

Yard1 Feb 28, 2024

AguirreNicolas commented Feb 28, 2024

Yard1 commented Feb 28, 2024

Yard1 commented Feb 29, 2024

njhill left a comment

njhill Mar 3, 2024

Yard1 Mar 4, 2024 •

edited

Loading

njhill Mar 3, 2024

Yard1 Mar 4, 2024

njhill Mar 4, 2024

esmeetu left a comment

esmeetu Mar 4, 2024

njhill commented Mar 7, 2024

pcmoritz commented Mar 8, 2024

Yard1 commented Mar 8, 2024

pcmoritz commented Mar 8, 2024

Push logprob generation to LLMEngine #3065

Push logprob generation to LLMEngine #3065

Conversation

Yard1 commented Feb 27, 2024 • edited Loading

esmeetu left a comment

Choose a reason for hiding this comment

esmeetu Feb 27, 2024

Choose a reason for hiding this comment

Yard1 Feb 28, 2024

Choose a reason for hiding this comment

AguirreNicolas commented Feb 28, 2024

Yard1 commented Feb 28, 2024

Yard1 commented Feb 29, 2024

njhill left a comment

Choose a reason for hiding this comment

njhill Mar 3, 2024

Choose a reason for hiding this comment

Yard1 Mar 4, 2024 • edited Loading

Choose a reason for hiding this comment

njhill Mar 3, 2024

Choose a reason for hiding this comment

Yard1 Mar 4, 2024

Choose a reason for hiding this comment

njhill Mar 4, 2024

Choose a reason for hiding this comment

esmeetu left a comment

Choose a reason for hiding this comment

esmeetu Mar 4, 2024

Choose a reason for hiding this comment

njhill commented Mar 7, 2024

pcmoritz commented Mar 8, 2024

Yard1 commented Mar 8, 2024

pcmoritz commented Mar 8, 2024

Yard1 commented Feb 27, 2024 •

edited

Loading

Yard1 Mar 4, 2024 •

edited

Loading