ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response #6495

phymbert · 2024-04-04T22:12:21Z

Motivation

In the context of:

The prompt processing (pp) per second was not accurate because streaming was not enabled, so it was also including the token generation time.

SSE will not be supported in k6 core: the GrafanaLabs team proposed to introduce a dedicated xk6 extension repository, so I introduced xk6-sse extension (appreciate if you can also have a look).

Changes

utils.hpp includes now the usage field in the last chunk, it is not OAI compatible but I think it does not hurt and it is useful for the client to retrieve the number of prompt tokens according to the tokenizer.
Add guidance to build the xk6 extension and run the bench
Fix PR bot comment

Tests

https://github.com/ggerganov/llama.cpp/actions/runs/8562927073

server: add tokens usage in stream mode

…to what the server exports in metrics

ggerganov · 2024-04-08T11:15:48Z

It's hard for me to review this because I'm not familiar with the technology.

Which part of the benchmark requires SSE and is there a way to avoid it? The proposed approach is not ideal since the benchmark will now depend on an external package / extension, while we want llama.cpp to be self-contained as much as possible. We already depended on k6, which is also not very desirable, but for these type of benchmarks it is sort of OK. In general, the long-term goals should be to reduce such kind of dependencies as much as possible

phymbert · 2024-04-08T12:21:59Z

Yes sorry I merged without waiting for your feedback. I was asking you if we can depend on xk6-sse then I deleted my comment and merged, apologize.

I need SSE to get the exact prompt processing time from client side, I get it at the Time to Emit First Token, but I need streaming enabled and k6 does not support it.

As it is only for server benchmark purpose, I think it is ok ? I am supporting it on my own..., or we can move it to examples/server/bench/xk6-sse but I have the feeling it has nothing to do here.

I think it can be difficult to implement benchmark without proper tool, but if you prefer we can try to replace k6 by plain old python or move to Gatling.

Up to you, I will take the time if you require to go to the right direction.

ggerganov · 2024-04-08T18:35:01Z

As it is only for server benchmark purpose, I think it is ok ?

We can work with this for now since we are not going to require users/devs to run the benchmarks on their own, but we have to look for alternatives to simplify.

I need SSE to get the exact prompt processing time from client side

I see. I can understand it is more accurate to measure from the client side, but it might be simpler to see if there is any significant difference between client side measured speed and server side reported speed. If there isn't a big difference (which is what we expect) then we can stick with the simpler solution of using server reported speed for now and avoid SSE requirement.

phymbert · 2024-04-08T19:01:59Z

Understood, actually there is a difference:

{"i":391,"req":{"p95":31109.26,"avg":11862.55},"pp":{"p95":734.81,"avg":146.57,"0":827.86},"tg":{"p95":24.75,"avg":23.72,"0":17.74}}

"0" is the average from prometheus /metrics endpoint and other values comes from k6 metrics. Maybe because on the client side, pp ends after the first token, so it includes 1 tg.

So probably better to remove all this :)

BTW it looks the T4 node is down:
https://github.com/ggerganov/llama.cpp/actions/workflows/bench.yml?query=is%3Aqueued

Can you please have a look ?

ggerganov · 2024-04-08T19:29:47Z

Restarted - should be running now

…okens usage in stream OAI response (ggerganov#6495) * ci: bench: support sse and fix prompt processing time server: add tokens usage in stream mode * ci: bench: README.md EOL * ci: bench: remove total pp and tg as it is not accurate * ci: bench: fix case when there is no token generated * ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics * ci: bench: fix finish reason rate

ci: bench: support sse and fix prompt processing time

713fa98

server: add tokens usage in stream mode

phymbert requested review from ggerganov and ngxson April 4, 2024 22:12

phymbert added 2 commits April 5, 2024 00:13

ci: bench: README.md EOL

1534d90

ci: bench: remove total pp and tg as it is not accurate

3694026

phymbert mentioned this pull request Apr 4, 2024

server: bench: continuous performance testing #6233

Closed

16 tasks

phymbert added 3 commits April 5, 2024 00:53

ci: bench: fix case when there is no token generated

59dc4bb

ci: bench: change to the 95 percentile for pp and tg as it is closer …

b6b50b1

…to what the server exports in metrics

ci: bench: fix finish reason rate

8789e17

ngxson approved these changes Apr 4, 2024

View reviewed changes

phymbert merged commit 75cd4c7 into master Apr 6, 2024
66 of 70 checks passed

phymbert deleted the hp/server/bench/sse branch April 6, 2024 03:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response #6495

ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response #6495

phymbert commented Apr 4, 2024 •

edited

Loading

ggerganov commented Apr 8, 2024

phymbert commented Apr 8, 2024

ggerganov commented Apr 8, 2024

phymbert commented Apr 8, 2024

ggerganov commented Apr 8, 2024

ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response #6495

ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response #6495

Conversation

phymbert commented Apr 4, 2024 • edited Loading

Motivation

Changes

Tests

ggerganov commented Apr 8, 2024

phymbert commented Apr 8, 2024

ggerganov commented Apr 8, 2024

phymbert commented Apr 8, 2024

ggerganov commented Apr 8, 2024

phymbert commented Apr 4, 2024 •

edited

Loading