-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Frontend] Add Usage data in each chunk for chat_serving. #6540 #6652
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks generally good - I suggested some small simplifications to the logic. I haven't reviewed the new tests yet due to large diff.
or res.outputs[i].finish_reason | ||
is not None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this response contains only the role (e.g., no tokens have been generated), I don't think it is necessary to check the stopping criterion at this stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, i'll change it
if (request.stream_options | ||
and request.stream_options.include_usage): | ||
chunk.usage = None | ||
if (request.stream_options.continuous_usage_stats | ||
or res.outputs[i].finish_reason | ||
is not None): | ||
prompt_tokens = len(res.prompt_token_ids) | ||
completion_tokens = 0 | ||
usage = UsageInfo( | ||
prompt_tokens=prompt_tokens, | ||
completion_tokens=completion_tokens, | ||
total_tokens=prompt_tokens + | ||
completion_tokens, | ||
) | ||
if request.stream_options.continuous_usage_stats: | ||
chunk.usage = usage | ||
else: | ||
chunk.usage = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this could just be simplified to:
if (request.stream_options
and request.stream_options.include_usage):
if request.stream_options.continuous_usage_stats:
prompt_tokens=len(res.prompt_token_ids)
usage = UsageInfo(
prompt_tokens=prompt_tokens,
completion_tokens=0,
total_tokens=prompt_tokens,
)
else:
chunk=None
if (request.stream_options. | ||
continuous_usage_stats | ||
or res.outputs[i].finish_reason | ||
is not None): | ||
prompt_tokens = len( | ||
res.prompt_token_ids) | ||
completion_tokens = len( | ||
res.outputs[i].token_ids) | ||
usage = UsageInfo( | ||
prompt_tokens=prompt_tokens, | ||
completion_tokens=completion_tokens, | ||
total_tokens=prompt_tokens + | ||
completion_tokens, | ||
) | ||
if (request.stream_options. | ||
continuous_usage_stats): | ||
chunk.usage = usage | ||
else: | ||
chunk.usage = None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments apply here as above: (a) I don't think we need to check the finish reason since we are just echoing the prompt and (b) we can simplify the if/else logic in the same way as above.
if (request.stream_options.continuous_usage_stats): | ||
prompt_tokens = len(res.prompt_token_ids) | ||
completion_tokens = len(output.token_ids) | ||
usage = UsageInfo( | ||
prompt_tokens=prompt_tokens, | ||
completion_tokens=completion_tokens, | ||
total_tokens=prompt_tokens + | ||
completion_tokens, | ||
) | ||
if request.stream_options.continuous_usage_stats: | ||
chunk.usage = usage | ||
else: | ||
chunk.usage = None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just have:
if request.stream_options.continuous_usage_stats:
prompt_tokens = len(res.prompt_token_ids)
completion_tokens = len(output.token_ids)
usage = UsageInfo(
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
total_tokens=prompt_tokens +
completion_tokens,
)
chunk.usage = usage
else:
chunk.usage = None
if (request.stream_options.continuous_usage_stats): | ||
prompt_tokens = len(res.prompt_token_ids) | ||
completion_tokens = len(output.token_ids) | ||
usage = UsageInfo( | ||
prompt_tokens=prompt_tokens, | ||
completion_tokens=completion_tokens, | ||
total_tokens=prompt_tokens + | ||
completion_tokens, | ||
) | ||
if request.stream_options.continuous_usage_stats: | ||
chunk.usage = usage | ||
else: | ||
chunk.usage = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/ready |
The CI failure does not look related to these changes. think this PR looks good cc @simon-mo |
Thanks for contributing this! @yecohn |
Great news 🥳 |
This change does not conform for the OAI spec: |
…t#6540 (vllm-project#6652) Signed-off-by: Alvant <[email protected]>
PR for issue #6540
I added a some logic to return usage data on each chunk when flag
stream_options.continuous_usage_stats
is set.