-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add conclusion #1340
docs: add conclusion #1340
Conversation
@zhyncs - QQ, how is ITL measured and how is it measured as compared to TPOT? I thought these were the same concept: https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html#inter-token-latency-itl |
Hi @robertgshaw2-neuralmagic |
Under the honest and clear communication of the SGLang Team and vLLM Team, we have reached a consensus. Thank you very much! And we have summarized the following lessons learned. Common Mistakes in Benchmarking LLM Inference Engines
Hope that in the future, both SGLang Team and vLLM Team can continue to improve in terms of functionality, performance, usability, and scalability. Cheers! ref |
@zhyncs thanks! I agree the choppiness is not good. We will resolve this in v0.6.1. Thanks for calling it out, as it helps us to improve. |
Just be curious, even if Engine B returns tokens in a bigger granularity, such differences will be eliminated if the frontend ui decide to simulate same user experiences as Engine A, i.e. the frontend will render one token per second even if it gets 10 tokens at hands. The differences will be In a conclusion, I did not see significant differences between ITL(TBT) and TPOT if we do the frontend simulation tricks. Is there any other reasons why we care so much about the TBT stability? |
@icavanyu Nope. Simulation trick not work for this situation. |
Thanks for your response. Could you help explain more about the differences on end-users' perspective? |
Motivation
Modifications
Checklist