docs: add conclusion #1340

zhyncs · 2024-09-05T18:21:28Z

Motivation

Modifications

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

robertgshaw2-neuralmagic · 2024-09-05T18:49:38Z

@zhyncs - QQ, how is ITL measured and how is it measured as compared to TPOT?

I thought these were the same concept: https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html#inter-token-latency-itl

zhyncs · 2024-09-05T19:11:14Z

Hi @robertgshaw2-neuralmagic
For example, Engine A returns 1 token per second, for a total of 50 tokens. Engine B returns 10 tokens at once every 10 seconds, for a total of 50 tokens. At this point, the TPOT for A is 1 second, and the ITL is also 1 second. On the other hand, the TPOT for B is also 1 second, but the ITL is 10 seconds. Here, the definition of ITL is "the average latency between streaming chunks," not "the average latency between two tokens," to account for the effect of multi-token streaming. It is somewhat counterintuitive, which is why you are confused. We used this naming because it is adapted from the original vLLM benchmark scripts. I agree a better name should be "Inter-Chunk-Latency". This will introduce choppiness during online serving, leading to degraded user experiences. cc @ywang96

zhyncs · 2024-09-06T20:32:54Z

Hi @robertgshaw2-neuralmagic

Under the honest and clear communication of the SGLang Team and vLLM Team, we have reached a consensus. Thank you very much! And we have summarized the following lessons learned.

Common Mistakes in Benchmarking LLM Inference Engines

incomplete reporting of optimization trade-offs
misleading plots from inappropriate scaling of the figure's y-axis
biased hyperparameter tuning

Hope that in the future, both SGLang Team and vLLM Team can continue to improve in terms of functionality, performance, usability, and scalability. Cheers!

ref
https://x.com/lmsys_oss/status/1832133545655202288
https://docs.google.com/document/d/1fEaaIQoRQLbevRu3pj1_ReOklHkoeE7ELqZJ3pnW-K4
vllm-project/vllm-project.github.io@2103980

robertgshaw2-neuralmagic · 2024-09-06T20:36:08Z

@zhyncs thanks! I agree the choppiness is not good. We will resolve this in v0.6.1. Thanks for calling it out, as it helps us to improve.

icavanyu · 2024-09-08T05:53:36Z

Hi @robertgshaw2-neuralmagic For example, Engine A returns 1 token per second, for a total of 50 tokens. Engine B returns 10 tokens at once every 10 seconds, for a total of 50 tokens. At this point, the TPOT for A is 1 second, and the ITL is also 1 second. On the other hand, the TPOT for B is also 1 second, but the ITL is 10 seconds. Here, the definition of ITL is "the average latency between streaming chunks," not "the average latency between two tokens," to account for the effect of multi-token streaming. It is somewhat counterintuitive, which is why you are confused. We used this naming because it is adapted from the original vLLM benchmark scripts. I agree a better name should be "Inter-Chunk-Latency". This will introduce choppiness during online serving, leading to degraded user experiences. cc @ywang96

Just be curious, even if Engine B returns tokens in a bigger granularity, such differences will be eliminated if the frontend ui decide to simulate same user experiences as Engine A, i.e. the frontend will render one token per second even if it gets 10 tokens at hands.

The differences will be time to the first decoding chunk will be different. Engine A is better for the first chunk.

In a conclusion, I did not see significant differences between ITL(TBT) and TPOT if we do the frontend simulation tricks.

Is there any other reasons why we care so much about the TBT stability?

zhyncs · 2024-09-08T06:16:31Z

@icavanyu Nope. Simulation trick not work for this situation.

icavanyu · 2024-09-08T07:07:51Z

@icavanyu Nope. Simulation trick not work for this situation.

Thanks for your response. Could you help explain more about the differences on end-users' perspective?

zhyncs requested review from Ying1123 and merrymercy September 5, 2024 18:21

docs: add conclusion

88424a1

zhyncs force-pushed the tmp2 branch from 922b2b1 to 88424a1 Compare September 5, 2024 18:24

zhyncs merged commit 62f15ee into main Sep 5, 2024
1 check passed

zhyncs deleted the tmp2 branch September 5, 2024 18:25

Ying1123 approved these changes Sep 5, 2024

View reviewed changes

zhyncs mentioned this pull request Sep 5, 2024

[Performance]: reproducing vLLM performance benchmark vllm-project/vllm#8176

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add conclusion #1340

docs: add conclusion #1340

zhyncs commented Sep 5, 2024

robertgshaw2-neuralmagic commented Sep 5, 2024

zhyncs commented Sep 5, 2024

zhyncs commented Sep 6, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Sep 6, 2024

icavanyu commented Sep 8, 2024

zhyncs commented Sep 8, 2024

icavanyu commented Sep 8, 2024

docs: add conclusion #1340

docs: add conclusion #1340

Conversation

zhyncs commented Sep 5, 2024

Motivation

Modifications

Checklist

robertgshaw2-neuralmagic commented Sep 5, 2024

zhyncs commented Sep 5, 2024

zhyncs commented Sep 6, 2024 • edited Loading

robertgshaw2-neuralmagic commented Sep 6, 2024

icavanyu commented Sep 8, 2024

zhyncs commented Sep 8, 2024

icavanyu commented Sep 8, 2024

zhyncs commented Sep 6, 2024 •

edited

Loading