-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response #6495
Conversation
server: add tokens usage in stream mode
…to what the server exports in metrics
It's hard for me to review this because I'm not familiar with the technology. Which part of the benchmark requires SSE and is there a way to avoid it? The proposed approach is not ideal since the benchmark will now depend on an external package / extension, while we want |
Yes sorry I merged without waiting for your feedback. I was asking you if we can depend on I need SSE to get the exact prompt processing time from client side, I get it at the Time to Emit First Token, but I need streaming enabled and k6 does not support it. As it is only for server benchmark purpose, I think it is ok ? I am supporting it on my own..., or we can move it to I think it can be difficult to implement benchmark without proper tool, but if you prefer we can try to replace k6 by plain old python or move to Gatling. Up to you, I will take the time if you require to go to the right direction. |
We can work with this for now since we are not going to require users/devs to run the benchmarks on their own, but we have to look for alternatives to simplify.
I see. I can understand it is more accurate to measure from the client side, but it might be simpler to see if there is any significant difference between client side measured speed and server side reported speed. If there isn't a big difference (which is what we expect) then we can stick with the simpler solution of using server reported speed for now and avoid SSE requirement. |
Understood, actually there is a difference: {"i":391,"req":{"p95":31109.26,"avg":11862.55},"pp":{"p95":734.81,"avg":146.57,"0":827.86},"tg":{"p95":24.75,"avg":23.72,"0":17.74}} "0" is the average from prometheus So probably better to remove all this :) BTW it looks the T4 node is down: Can you please have a look ? |
Restarted - should be running now |
…okens usage in stream OAI response (ggerganov#6495) * ci: bench: support sse and fix prompt processing time server: add tokens usage in stream mode * ci: bench: README.md EOL * ci: bench: remove total pp and tg as it is not accurate * ci: bench: fix case when there is no token generated * ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics * ci: bench: fix finish reason rate
Motivation
In the context of:
The prompt processing (pp) per second was not accurate because streaming was not enabled, so it was also including the token generation time.
SSE will not be supported in k6 core: the GrafanaLabs team proposed to introduce a dedicated xk6 extension repository, so I introduced xk6-sse extension (appreciate if you can also have a look).
Changes
utils.hpp
includes now theusage
field in the last chunk, it is not OAI compatible but I think it does not hurt and it is useful for the client to retrieve the number of prompt tokens according to the tokenizer.Tests