Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: bench: continuous performance testing #6233

Closed
12 of 16 tasks
phymbert opened this issue Mar 22, 2024 · 19 comments
Closed
12 of 16 tasks

server: bench: continuous performance testing #6233

phymbert opened this issue Mar 22, 2024 · 19 comments
Assignees
Labels
enhancement New feature or request need feedback Testing and feedback with results are needed performance Speed related topics server/webui stale

Comments

@phymbert
Copy link
Collaborator

phymbert commented Mar 22, 2024

Motivation

llama.cpp is under active development, new papers on LLM are implemented quickly (for the good) and backend device
optimizations are continuously added.

All these factors have an impact on the server performances, especially the following metrics:

  1. latency: pp (prompt processing) + tg (tokens generation) per request
  2. server latency: total pp+tg per second across all requests with continuous batching
  3. concurrency: how many concurrent request/users the server can handle in parallel
  4. VRAM usage
  5. RAM usage
  6. GPU usage
  7. CPU usage

It is important to monitor and control the impact of the codebase evolution on these metrics,
example from:

prompt_tokens_seconds

Since #5941, we have a server bench framework, we can now trigger it based on different events:

  1. scheduled on master branch
  2. on PR pushes

The approach should be reproducible: use the same hardware architecture, same models size and quants.

It would be nice to follow performances changes on a time series graph like it is done
in Apache Lucene.

Proposed approach

Bench will run on a T4 GPU node in Azure
Cloud, so:

  • Standard_NC4as_T4_v3
  • 20.04.1-Ubuntu
  • 4 VCPU
  • 28GB RAM
  • 1 NVidia Tesla T4
  • 16GB VRAM
  • /dev/sdb, 256GB standard SSD, mounted at /
  • /dev/sda, 1T premium SSD, mounted at /mnt

On
a GitHub self-hosted runners
with prometheus installed.

A GitHub workflow, will:

  1. build the server target using cmake Release build type and LLAMA_CUDA with native CUDA architecture
  2. for each bench parameters
  3. start the server
  4. configure prometheus scrapping on the server instance
  5. wait for the server to start
  6. build the relevant dataset for the test
  7. start performance test scenario using the right dataset
  8. export the results to json
  9. Download prometheus metrics graph
  10. plot results into time series images
  11. Add a comment in the PR with the metrics results images

Technical consideration

One important aspect of this configuration would be to make it easy to add more nodes in the future.
If we see that it works and is useful, we can find ways to add more hardware in order to do metrics for different cases.
All the code used must be stored in examples/server/bench folder.

GitHub Self-Hosted runner security

Self-hosted runner security:

Warning: We recommend that you only use self-hosted runners with private repositories. This is because forks of your
public repository can potentially run dangerous code on your self-hosted runner machine by creating a pull request
that
executes the code in a workflow.

By design, we will be using just-in-time runners:

  1. with ggml-ci in a docker container, loop look for new workflow job waiting for the host GPU series type label:
  2. Create configuration for a just-in-time runner with this label
  3. Start a rootless docker container with nvidia docker runtime with the JIT configuration token
  4. start the GitHub runner within the container
  5. wait for the container to exit
  6. restart the loop

As the GitHub checks can only be run by collaborators, the job is running in a non-root docker container, I think we are safe.

Server scenario parameters matrix

scenario duration users hf-repo hf-file model-alias model-size model-type ngl parallel ctx-size batch-size ubatch-size n-predict grp-attn-n grp-attn-w embeddings CUDA_VISIBLE_DEVICES SERVER_BENCH_N_PROMPTS SERVER_BENCH_MAX_PROMPT_TOKENS SERVER_BENCH_MAX_CONTEXT
completions 10m 8 TODO phi2 3B F16 33 8 16384 2048 256 2048 1 512 false 0 1000 1024 1024
completions 10m 8 ggml-org/models phi-2/ggml-model-q4_0.gguf phi2 3B MOSTLY_Q4_K_M 33 8 16384 2048 256 2048 1 512 false 0 1000 1024 1024
embeddings 5m 8 ggml-org/models bert-bge-large/ggml-model-f16.gguf bert-bge-large ? F16 TODO 8 16384 4096 4096 NA NA NA true 0 1000 4096 NA

In addition, following parameters will be used:

  • --log-disable no need to have a log file
  • --metrics to allow prometheus metrics scrapping
  • --cont-batching, probably need to enable by default server: enable --cont-batching by default #6229
  • --threads 1, we will test only with all layers offloaded to GPU
  • --threads-batch 1, we will test only with all layers offloaded to GPU
  • --model ggml-model.gguf as now we can download anything from HF
  • --defrag-thold 0.1

Only the OAI Chat completions endpoint with streaming enabled will be tested for completions.

Dataset consideration

  1. dataset must contain system, assistant and user prompts (in order to test chat template overhead if any)
  2. random must not be used to select prompt, running the test twice must output almost the same metrics
  3. it must be possible to select prompts in order they fit in KV Cache (or not) using parameters listed
    in bench/README.md:
    • SERVER_BENCH_N_PROMPTS total prompts to select in the benchmark
    • SERVER_BENCH_MAX_PROMPT_TOKENS maximum prompt tokens to filter out in the dataset
    • SERVER_BENCH_MAX_CONTEXT maximum context size of the completions request to filter out in the dataset: prompt +
      predicted tokens

Selected dataset:

scenario dataset comment
completions ShareGPT_Vicuna_unfiltered taken from VLLM to have a baseline
embeddings IMDB Data suggested by @ngxson, looks good for embeddings

Tasks

@phymbert phymbert added enhancement New feature or request performance Speed related topics server/webui labels Mar 22, 2024
@phymbert phymbert self-assigned this Mar 22, 2024
@phymbert
Copy link
Collaborator Author

@ggerganov @ngxson @slaren appreciate your early feedback on the approach before I start implementing too much

@Azeirah
Copy link
Contributor

Azeirah commented Mar 22, 2024

This is honestly so cool, I think it'd be a very worthwhile investment to track performance changes for a small set of select hardware over time. I think we'll be seeing that some small changes affect performance in unexpected ways (both positive and negative)

Only one thing I am wondering right now, do these servers run on some kind of shared hardware? It's incredibly important that everything on the system is in the exact same clean slate whenever a test is runned.

For example, if it's on shared hardware it's possible certain caches are unoptimal, whereas in the opposite case if the same hardware is ran 5x in a row, will the second run be a lot faster due to all sorts of arcane kernel caches, filesystem, ssd, driver caches etc?

I believe I saw a presentation by a C++ benchmarking expert that they'd developed a script that can reset all this arcane and hidden shared state/caches affecting benchmarking in one go. I'll go look and see if I can find it.

@slaren
Copy link
Collaborator

slaren commented Mar 22, 2024

Looks good, it would be nice to have other parameters in the matrix such as different values of -ngl, but that's not important right now.

@ggerganov
Copy link
Owner

Only one thing I am wondering right now, do these servers run on some kind of shared hardware?

All tests will be running on dedicated Azure nodes (thanks @aigrant) that will do just this benchmark. We are starting with a single T4 node and if this works out, we will add more

@ngxson
Copy link
Collaborator

ngxson commented Mar 22, 2024

Cool idea, it will be very useful to keep track of llama.cpp's performance compared to "pure" GPU alternative like TensorRT or exllama.

A GitHub workflow, will:

One thing I think we need to consider though: the proposal here seems to based on the idea of having a "manager" machine and a "runner" machine - this will not be the case when using self-hosted runner. You can imagine that github will simply send out SSH commands to self-hosted runner, so there will be only one machine evolved.

Because of that, prometheus may not really fit the usage (because everything run on the same machine). Also I think we can firstly start with something more basic than prometheus, for example just a simple python script that collect metrics each X seconds. My idea here is to prevent have an external dependency from the beginning - we should add it in the way when we feel absolutely needed.

Personally, on my company, I often have to work with self-hosted gitlab and self-hosted runners, so I think I can help on setup scripts if needed.

Figuring out how to properly setup the self-hosted runner is also a big task to do I think, let's focus more on that for now.

@Azeirah
Copy link
Contributor

Azeirah commented Mar 22, 2024

Only one thing I am wondering right now, do these servers run on some kind of shared hardware?

All tests will be running on dedicated Azure nodes (thanks @aigrant) that will do just this benchmark. We are starting with a single T4 node and if this works out, we will add more

Yes I understand, but is it a bare metal server that is completely isolated? Or is it sharing resources on one huge server?

Either way, it doesn't matter much regardless what exactly it's running on. My point is that any hidden arcane state needs to be reset before running any benchmark script.

@ngxson
Copy link
Collaborator

ngxson commented Mar 22, 2024

Yes I understand, but is it a bare metal server that is completely isolated?

Servers with T4 GPU are usually "shared CPU but dedicated GPU". I believe that's also the case with other GPU like A100 or A10G, but not sure if it's also the same with H100 or not.

@ngxson
Copy link
Collaborator

ngxson commented Mar 22, 2024

My point is that any hidden arcane state needs to be reset before running any benchmark script.

On my company we have gitlab runners that plugged into docker on each machine, so in the end each CI run is isolated (multiple CI can even run in parallel). Even when the CI is failed for some reason, the resource is automatically clean up by docker. I believe that Github runner "agent" will have the same function.

Edit: but yeah sometime it's better to simply reset the machine (maybe via a snapshot) especially when working with benchmark. We can look into this in the future.

@phymbert
Copy link
Collaborator Author

Servers with T4 GPU are usually "shared CPU but dedicated GPU". I believe that's also the case with other GPU like A100 or A10G, but not sure if it's also the same with H100 or not.

Yes, AFAIK NVidia GPU virtualization does not exists on Azure (yet?), it is possible to fraction them only, but this is not our case. There is a solution from the vendor and I also have some good feedback with fractional GPU sharing for Kubernetes of run.ai.

@Azeirah In this proposition, all layers will be offloaded to GPU, only one test at a time per runner, so I believe we will not suffer too much with the hypervisor throttling.

@phymbert
Copy link
Collaborator Author

phymbert commented Mar 22, 2024

@ggerganov We need to keep this in mind:

Warning: We recommend that you only use self-hosted runners with private repositories. This is because forks of your public repository can potentially run dangerous code on your self-hosted runner machine by creating a pull request that executes the code in a workflow.

see Self-hosted runner security

So by design we will be using just-in-time runners and ideally the workflow should be started only by Collaborators.

Be sure I will test all this on my private fork first.

EDIT: solution proposed in the summary

@phymbert
Copy link
Collaborator Author

phymbert commented Mar 24, 2024

@ggerganov what about the defragmentation target for the baseline, without, I see lot of: update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1024

With --defrag-thold 0.8, it does not look better:

@ggerganov
Copy link
Owner

ggerganov commented Mar 25, 2024

The thold should be 5-10% (e.g. --defrag-thold 0.1)

If you are getting that error, it means your --context is too small.
It should be equal to (num slots)*(max prompt + max predict) in order to fit the worst-case scenario

@phymbert
Copy link
Collaborator Author

First workflow ready to receive feedback:

Based on this, we can modify duration, all parameters, comment template, frequency, etc...

If you agree with the approach, I can later on continue to add models or embeddings.

@phymbert
Copy link
Collaborator Author

phymbert commented Apr 1, 2024

Hello everyone,

The workflow is deployed since one week, and some concerns have been identified:

#5021 #6367 #6387 #6403 #6408 #6412 #6413 #6414
iterations 264 481 498 534 504 518 504 516
req duration 18372.67 9814.5 9403.71 8767.58 9274.74 9059.66 9304.35 9070.17
total pp 105.34 190.52 198.39 205.85 200.65 201.57 199.63 201.14
total tg 219.35 128.99 128.22 130.33 129.75 128.18 129.67 128.61
/metrics pp 302.09 713.28 704/14 727.66 721.63 661.12 645.38 638.66
/metrics tg 0.24 17.81 17.65 18.05 17.92 17.74 17.68 18.11

@Azeirah @ngxson Any idea what can cause the discruptencies ? finally maybe the virtualization has an impact on
performances, at least on the k6 client side.

@ggerganov In which direction do you want we go further ? add an A100 test :) ? add embeddings ? other models MOE like ?

Thanks for your feedback

@phymbert phymbert added the need feedback Testing and feedback with results are needed label Apr 1, 2024
@ggerganov
Copy link
Owner

Regarding the PR comment with benchmark information: I find it a little bit distracting since it pops up in all PRs even unrelated to speed. I think it would be better to implement the long-term plot that you suggested at some point where we would be able to see the performance metrics as a function with time

Variations: are we using the same random seed for all runs? AFAICT from bench.py this is not the case and it might improve the reproducibility of the metrics.

We should add F16 and Q8_0 benchmarks for Phi-2

@ngxson
Copy link
Collaborator

ngxson commented Apr 1, 2024

Seems interesting. I’m currently limited to working from mobile phone, so can’t have a look right now. I’ll try when I can

@phymbert
Copy link
Collaborator Author

@ggerganov the node seems to be down. Maybe we should configure the runner as a service ?
Also, note that I did not forget to revert xk6-sse, I will do it in a couple of days.

@ggerganov
Copy link
Owner

Hm, not sure why it was down - restarted it again. A service could be useful

@github-actions github-actions bot added stale and removed stale labels May 15, 2024
@github-actions github-actions bot added the stale label Jun 18, 2024
Copy link
Contributor

github-actions bot commented Jul 3, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request need feedback Testing and feedback with results are needed performance Speed related topics server/webui stale
Projects
None yet
Development

No branches or pull requests

5 participants