Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] fix output parsing error for trtllm backend #4137

Merged
merged 2 commits into from
Apr 17, 2024

Conversation

elinx
Copy link
Contributor

@elinx elinx commented Apr 17, 2024

Fix #4136

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for testing and fix! Do you mind attaching the output from running the fixed version?

@@ -149,7 +150,6 @@ async def async_request_trt_llm(
most_recent_timestamp = timestamp

output.latency = most_recent_timestamp - st
output.generated_text = json.loads(data)["text_output"]
Copy link
Member

@ywang96 ywang96 Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember Triton + TRT had an issue where the data['text_output'] contains the cumulative output text instead of the delta, hence initially in this version I only took the last chunk to retrieve generated_text.

Could you verify if this is still the case? (and either way, this should have been data['text_output'] not json.loads(data)["text_output"], so thanks for catching this!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like it has been fixed, the output I got is like:

data: {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":0.0,"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"a"}

data: {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"city"}

data: {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"in"}

data: {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"China"}

data: {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"."}

data: {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"It"}

data: {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"is"}

@elinx
Copy link
Contributor Author

elinx commented Apr 17, 2024

Thank you for testing and fix! Do you mind attaching the output from running the fixed version?

the final output is something like:

============ Serving Benchmark Result ============
Successful requests:                     1000     
Benchmark duration (s):                  355.37   
Total input tokens:                      248339   
Total generated tokens:                  260518   
Request throughput (req/s):              2.81     
Input token throughput (tok/s):          698.81   
Output token throughput (tok/s):         733.08   
---------------Time to First Token----------------
Mean TTFT (ms):                          213579.15
Median TTFT (ms):                        213038.47
P99 TTFT (ms):                           322949.50
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          55.34    
Median TPOT (ms):                        55.59    
P99 TPOT (ms):                           77.75    
==================================================

@ywang96 ywang96 enabled auto-merge (squash) April 17, 2024 10:19
@ywang96 ywang96 merged commit fe3b5bb into vllm-project:main Apr 17, 2024
46 checks passed
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 21, 2024
z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 26, 2024
alexeykondrat pushed a commit to alexeykondrat/ci-vllm that referenced this pull request May 1, 2024
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: benchmark trtllm failed
2 participants