New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Frontend] Add Usage data in each chunk for chat_serving. #6540 #6652

Merged

simon-mo merged 8 commits into vllm-project:main from yecohn:main

Jul 23, 2024

Contributor

yecohn commented Jul 22, 2024

PR for issue #6540

I added a some logic to return usage data on each chunk when flag stream_options.continuous_usage_stats is set.


          change serving_chat to return usage data on each chunk of stream

f78a2e3

github-actions bot commented Jul 22, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

yecohn force-pushed the main branch from 02650a9 to 6930643 Compare

July 22, 2024 19:53


          run formatter

49294b8

yecohn force-pushed the main branch from 14e8345 to 49294b8 Compare

July 23, 2024 06:51

Yehoshua Cohen added 3 commits

July 23, 2024 11:44


          add test for new continious_usage_stats option

f99de0e


          fix ruff

57842b8


          yapf fixer

bda75b1

tdoublep reviewed

View reviewed changes

Member

tdoublep left a comment

Looks generally good - I suggested some small simplifications to the logic. I haven't reviewed the new tests yet due to large diff.

vllm/entrypoints/openai/serving_chat.py Outdated

Comment on lines 226 to 227

		or res.outputs[i].finish_reason
		is not None):

Member

tdoublep Jul 23, 2024

I think this response contains only the role (e.g., no tokens have been generated), I don't think it is necessary to check the stopping criterion at this stage.

Contributor Author

yecohn Jul 23, 2024

good point, i'll change it

vllm/entrypoints/openai/serving_chat.py Outdated

Comment on lines 223 to 239

                                       if (request.stream_options
                                               and request.stream_options.include_usage):
-                                          chunk.usage = None
+                                          if (request.stream_options.continuous_usage_stats
+                                                  or res.outputs[i].finish_reason
+                                                  is not None):
+                                              prompt_tokens = len(res.prompt_token_ids)
+                                              completion_tokens = 0
+                                              usage = UsageInfo(
+                                                  prompt_tokens=prompt_tokens,
+                                                  completion_tokens=completion_tokens,
+                                                  total_tokens=prompt_tokens +
+                                                  completion_tokens,
+                                              )
+                                          if request.stream_options.continuous_usage_stats:
+                                              chunk.usage = usage
+                                          else:
+                                              chunk.usage = None

Member

tdoublep Jul 23, 2024

Perhaps this could just be simplified to:

if (request.stream_options
        and request.stream_options.include_usage):
    if request.stream_options.continuous_usage_stats:
        prompt_tokens=len(res.prompt_token_ids)
        usage = UsageInfo(
            prompt_tokens=prompt_tokens,
            completion_tokens=0,
            total_tokens=prompt_tokens,
        )
    else:
        chunk=None

vllm/entrypoints/openai/serving_chat.py Outdated

Comment on lines 270 to 289

+                                                  if (request.stream_options.
+                                                          continuous_usage_stats
+                                                          or res.outputs[i].finish_reason
+                                                          is not None):
+                                                      prompt_tokens = len(
+                                                          res.prompt_token_ids)
+                                                      completion_tokens = len(
+                                                          res.outputs[i].token_ids)
+                                                      usage = UsageInfo(
+                                                          prompt_tokens=prompt_tokens,
+                                                          completion_tokens=completion_tokens,
+                                                          total_tokens=prompt_tokens +
+                                                          completion_tokens,
+                                                      )
+                                                  if (request.stream_options.
+                                                          continuous_usage_stats):
+                                                      chunk.usage = usage
+                                                  else:
+                                                      chunk.usage = None

Member

tdoublep Jul 23, 2024

Same comments apply here as above: (a) I don't think we need to check the finish reason since we are just echoing the prompt and (b) we can simplify the if/else logic in the same way as above.

vllm/entrypoints/openai/serving_chat.py Outdated

Comment on lines 348 to 361

+                                          if (request.stream_options.continuous_usage_stats):
+                                              prompt_tokens = len(res.prompt_token_ids)
+                                              completion_tokens = len(output.token_ids)
+                                              usage = UsageInfo(
+                                                  prompt_tokens=prompt_tokens,
+                                                  completion_tokens=completion_tokens,
+                                                  total_tokens=prompt_tokens +
+                                                  completion_tokens,
+                                              )
+                                          if request.stream_options.continuous_usage_stats:
+                                              chunk.usage = usage
+                                          else:
+                                              chunk.usage = None

Member

tdoublep Jul 23, 2024

Can't we just have:

if request.stream_options.continuous_usage_stats:
    prompt_tokens = len(res.prompt_token_ids)
    completion_tokens = len(output.token_ids)
    usage = UsageInfo(
        prompt_tokens=prompt_tokens,
        completion_tokens=completion_tokens,
        total_tokens=prompt_tokens +
        completion_tokens,
    )
    chunk.usage = usage
else:
    chunk.usage = None

vllm/entrypoints/openai/serving_chat.py Outdated

Comment on lines 381 to 393

+                                          if (request.stream_options.continuous_usage_stats):
+                                              prompt_tokens = len(res.prompt_token_ids)
+                                              completion_tokens = len(output.token_ids)
+                                              usage = UsageInfo(
+                                                  prompt_tokens=prompt_tokens,
+                                                  completion_tokens=completion_tokens,
+                                                  total_tokens=prompt_tokens +
+                                                  completion_tokens,
+                                              )
+                                          if request.stream_options.continuous_usage_stats:
+                                              chunk.usage = usage
+                                          else:
+                                              chunk.usage = None

Member

tdoublep Jul 23, 2024

same as above here

Contributor Author

yecohn Jul 23, 2024

good point

Yehoshua Cohen added 3 commits

July 23, 2024 14:24


          should add continuous_stats to test for only include_usage

8ec181a


          simplify code

6b0d43d


          simplify code and run yapf

d87fe88

yecohn force-pushed the main branch from bb76e61 to d87fe88 Compare

July 23, 2024 15:07

tdoublep approved these changes

View reviewed changes

Member

tdoublep left a comment

LGTM

Member

tdoublep commented Jul 23, 2024

/ready

github-actions bot added the ready label

Member

tdoublep commented Jul 23, 2024

The CI failure does not look related to these changes.

think this PR looks good cc @simon-mo

simon-mo merged commit 58f5303 into vllm-project:main

70 of 84 checks passed

Member

tdoublep commented Jul 23, 2024

Thanks for contributing this! @yecohn

xjpang pushed a commit to xjpang/vllm that referenced this pull request


          [Frontend] Add Usage data in each chunk for chat_serving. vllm-projec…

5b654e2

…t#6540 (vllm-project#6652)

xjpang pushed a commit to xjpang/vllm that referenced this pull request


          [Frontend] Add Usage data in each chunk for chat_serving. vllm-projec…

960491b

…t#6540 (vllm-project#6652)

fialhocoelho pushed a commit to opendatahub-io/vllm that referenced this pull request


          [Frontend] Add Usage data in each chunk for chat_serving. vllm-projec…

63d5038

…t#6540 (vllm-project#6652)

Contributor Author

yecohn commented Jul 25, 2024

Great news 🥳

dtrifiro mentioned this pull request

Sync with [email protected] opendatahub-io/vllm#120

Closed

cduk pushed a commit to cduk/vllm-pascal that referenced this pull request


          [Frontend] Add Usage data in each chunk for chat_serving. vllm-projec…

bcc2c53

…t#6540 (vllm-project#6652)

Collaborator

robertgshaw2-neuralmagic commented Aug 7, 2024

This change does not conform for the OAI spec:

[Bug]: stream_options.include_usage being retrieved on every chunk #7262

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request


          [Frontend] Add Usage data in each chunk for chat_serving. vllm-projec…

cec589c

…t#6540 (vllm-project#6652)

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request


          [Frontend] Add Usage data in each chunk for chat_serving. vllm-projec…

…t#6540 (vllm-project#6652)

Signed-off-by: Alvant <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready