Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add conclusion #1340

Merged
merged 1 commit into from
Sep 5, 2024
Merged

docs: add conclusion #1340

merged 1 commit into from
Sep 5, 2024

Conversation

zhyncs
Copy link
Member

@zhyncs zhyncs commented Sep 5, 2024

Motivation

Modifications

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@zhyncs zhyncs merged commit 62f15ee into main Sep 5, 2024
1 check passed
@zhyncs zhyncs deleted the tmp2 branch September 5, 2024 18:25
@robertgshaw2-neuralmagic

@zhyncs - QQ, how is ITL measured and how is it measured as compared to TPOT?

I thought these were the same concept: https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html#inter-token-latency-itl

@zhyncs
Copy link
Member Author

zhyncs commented Sep 5, 2024

Hi @robertgshaw2-neuralmagic
For example, Engine A returns 1 token per second, for a total of 50 tokens. Engine B returns 10 tokens at once every 10 seconds, for a total of 50 tokens. At this point, the TPOT for A is 1 second, and the ITL is also 1 second. On the other hand, the TPOT for B is also 1 second, but the ITL is 10 seconds. Here, the definition of ITL is "the average latency between streaming chunks," not "the average latency between two tokens," to account for the effect of multi-token streaming. It is somewhat counterintuitive, which is why you are confused. We used this naming because it is adapted from the original vLLM benchmark scripts. I agree a better name should be "Inter-Chunk-Latency". This will introduce choppiness during online serving, leading to degraded user experiences. cc @ywang96

@zhyncs
Copy link
Member Author

zhyncs commented Sep 6, 2024

Hi @robertgshaw2-neuralmagic

Under the honest and clear communication of the SGLang Team and vLLM Team, we have reached a consensus. Thank you very much! And we have summarized the following lessons learned.

Common Mistakes in Benchmarking LLM Inference Engines

  • incomplete reporting of optimization trade-offs
  • misleading plots from inappropriate scaling of the figure's y-axis
  • biased hyperparameter tuning

Hope that in the future, both SGLang Team and vLLM Team can continue to improve in terms of functionality, performance, usability, and scalability. Cheers!

ref
https://x.com/lmsys_oss/status/1832133545655202288
https://docs.google.com/document/d/1fEaaIQoRQLbevRu3pj1_ReOklHkoeE7ELqZJ3pnW-K4
vllm-project/vllm-project.github.io@2103980

@robertgshaw2-neuralmagic

@zhyncs thanks! I agree the choppiness is not good. We will resolve this in v0.6.1. Thanks for calling it out, as it helps us to improve.

@icavanyu
Copy link

icavanyu commented Sep 8, 2024

Hi @robertgshaw2-neuralmagic For example, Engine A returns 1 token per second, for a total of 50 tokens. Engine B returns 10 tokens at once every 10 seconds, for a total of 50 tokens. At this point, the TPOT for A is 1 second, and the ITL is also 1 second. On the other hand, the TPOT for B is also 1 second, but the ITL is 10 seconds. Here, the definition of ITL is "the average latency between streaming chunks," not "the average latency between two tokens," to account for the effect of multi-token streaming. It is somewhat counterintuitive, which is why you are confused. We used this naming because it is adapted from the original vLLM benchmark scripts. I agree a better name should be "Inter-Chunk-Latency". This will introduce choppiness during online serving, leading to degraded user experiences. cc @ywang96

Just be curious, even if Engine B returns tokens in a bigger granularity, such differences will be eliminated if the frontend ui decide to simulate same user experiences as Engine A, i.e. the frontend will render one token per second even if it gets 10 tokens at hands.

The differences will be time to the first decoding chunk will be different. Engine A is better for the first chunk.

In a conclusion, I did not see significant differences between ITL(TBT) and TPOT if we do the frontend simulation tricks.

Is there any other reasons why we care so much about the TBT stability?

@zhyncs
Copy link
Member Author

zhyncs commented Sep 8, 2024

@icavanyu Nope. Simulation trick not work for this situation.

@icavanyu
Copy link

icavanyu commented Sep 8, 2024

@icavanyu Nope. Simulation trick not work for this situation.

Thanks for your response. Could you help explain more about the differences on end-users' perspective?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants