-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
batched : add bench tool #3545
batched : add bench tool #3545
Conversation
5ecdb10
to
026bb1b
Compare
Played a bit with the MMQ kernel parameters on V100 (ref #3479) ./bin/batched-bench /mnt/llama.cpp/models/open-llama/7B-v2/ggml-model-q4_0.gguf 4800 0 100 1 50 100 1,2,3,4,5,6,7,8,16,32,64 The default settings give the following performance:
Applying the following patch results in > x3 faster TG speed, but slower PP speed for Q4_0 at low-batches: diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index 654d363..32eee8b 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -3552,9 +3552,9 @@ static __device__ __forceinline__ void mul_mat_q(
#define MMQ_X_Q4_0_RDNA1 64
#define MMQ_Y_Q4_0_RDNA1 64
#define NWARPS_Q4_0_RDNA1 8
-#define MMQ_X_Q4_0_AMPERE 64
-#define MMQ_Y_Q4_0_AMPERE 128
-#define NWARPS_Q4_0_AMPERE 4
+#define MMQ_X_Q4_0_AMPERE 8
+#define MMQ_Y_Q4_0_AMPERE 32
+#define NWARPS_Q4_0_AMPERE 8
#define MMQ_X_Q4_0_PASCAL 64
#define MMQ_Y_Q4_0_PASCAL 64
#define NWARPS_Q4_0_PASCAL 8
It looks like just by tuning the MMQ constants, we can achieve significant boosts for batched decoding with quantized models, using the existing kernels. Ideally, this should work optimally without modifications, but it is not obvious how to achieve this in generic way. Similar observations are valid for Mac (#3524) cc @slaren @JohannesGaessler to bring your attention |
For the mmq defines the "x" dimension is the dimension that varies with batch size. So the optimal way to tune the kernels would be to determine the optimal values for the y tile size and the number of warps for each x tile size. In practice it would be sufficient to determine only the optimal values for the x tile size being a power of 2. Then at runtime you could just select the kernel with the optimal parameters for a given batch size. The downside of this (apart from the effort needed to determine the optimal values, possibly multiple times) is that the compile time and binary size will increase with each additional batch size considered. So it may make sense to add a compile option that only compiles a single tile size to speed up the compilation for development. |
Yup, it's quite tricky. The optimal values likely depend not only on the batch size but also on the other dimension. So for different model sizes, we would need different sets of optimal parameters. At least this is what my investigation for the Metal kernels showed. The way I think about this is that we have 2 sets of kernels:
Increasing the batch size, we are transitioning from the first kernel being optimal to the second kernel being optimal. On top of this, the second kernel needs adjustments depending on the compute size. In the Metal implementation, we have a more general matrix-vector kernel that can be applied for batches > 1. However, I wish we had a way to not have to do any kind of manual adjustments and always get the best performance for any batch size and model size. But it is not clear to me atm how to do this |
I don't know what you did for the metal kernels but for CUDA the optimal values for the y tile size and the number of warps should not depend on model size. As long as the weight matrix has a number of rows divisible by the y tile size the only difference should be the size of the CUDA grid, i.e. how many tiles need to be worked on. At most you should get a small effect if the grid size is too small and leaves some of the streaming multiprocessors idle but I very much do not expect this to make much of a difference if at all.
The CUDA tile sizes need to be known at compile time for good performance so it is fundamentally impossible to somehow readjust the kernels at runtime. At most I think we could write a script that repeatedly compiles llama.cpp with different tile sizes and benchmarks the performance. |
What I meant is that break-even point where the 2 kernels become equally performant depends on the model size. To illustrate, try to put any kind of tile sizes to the matrix-matrix kernel and it's TG speed at batch == 2 will always be slower than the TG speed of the single-batch matrix-vector kernel at batch == 1. I.e. below a certain batch size, the mv kernel will dominate over the mm kernel. And this break-even batch size I think depends on the model size. Anyway, still thinking about this and sharing a few thoughts - I'm not 100% about those conclusions yet. Might try the script idea at some point and provide a way to pass configuration to the backends to use a certain set of kernels based on the config. |
…example * 'master' of github.com:ggerganov/llama.cpp: (34 commits) examples: support LLaVA v1.5 (multimodal model) (ggerganov#3436) docs : fix typo GOMP_CPU_AFFINITY (ggerganov#3597) cmake : fix add_compile_options on macOS typo : it is `--n-gpu-layers` not `--gpu-layers` (ggerganov#3592) ci : check if there is enough VRAM (ggerganov#3596) server : add completion mode (no chat) (ggerganov#3582) prompts : add mnemonics.txt server : fix kv cache management (ggerganov#3588) main : fix session loading bug (ggerganov#3400) server : add parameter -tb N, --threads-batch N (ggerganov#3584) common : fix mirostat state when using multiple sequences (ggerganov#3543) batched : add bench tool (ggerganov#3545) examples : add batched.swift + improve CI for swift (ggerganov#3562) Add MPT model to supported models in README.md (ggerganov#3574) Minor improvements in GPT2 tokenizer (ggerganov#3567) readme : add bloom (ggerganov#3570) llm : add bloom models (ggerganov#3553) swift : improvements and fixes (ggerganov#3564) llm : add MPT support (ggerganov#3417) infill. : fix tokenization (ggerganov#3508) ...
Inspired by this blog post, implemented a tool to generate similar stats for
llama.cpp
PP
- prompt tokens per batchTG
- generated tokens per batchB
- number of batchesN_KV
- required KV cache sizeT_PP
- prompt processing time (i.e. time to first token)S_PP
- prompt processing speed ((B*PP)/T_PP
orPP/T_PP
)T_TG
- time to generate all batchesS_TG
- text generation speed ((B*TG)/T_TG
)T
- total timeS
- total speed (i.e. all tokens / total time)There are 2 modes of operation:
prompt not shared
- each batch has a separate prompt of sizePP
(i.e.N_KV = B*(PP + TG)
)prompt is shared
- there is a common prompt of sizePP
used by all batches (i.e.N_KV = PP + B*TG
)LLaMA 7B, F16, N_KV_MAX = 16384 (8GB), M2 Ultra, prompt not shared
LLaMA 7B, F16, N_KV_MAX = 16384 (8GB), M2 Ultra, prompt is shared
LLaMA 7B, Q8_0, N_KV_MAX = 16384 (8GB), M2 Ultra, prompt not shared
# LLaMA 7B F16, V100, no prompt sharing ./bin/batched-bench /mnt/llama.cpp/models/open-llama/7B-v2/ggml-model-f16.gguf 4800 0 100 0 50 100 1,2,3,4,5,6,7,8,16,32,64