cuda : tweak mm stride to double perf on P40 + GTX 970 #4233

cebtenzzre · 2023-11-27T03:51:54Z

This simple change more than doubles the prompt processing speed when I am using both of my GPUs. All results are with -DLLAMA_CUDA_FORCE_MMQ=ON because I do not have tensor cores.

GPU	Model	Test	t/s master	t/s PR	Speedup
GTX 970 + P40	7b q4_0	pp512	217.28 ± 0.004	458.18 ± 0.050	2.11
GTX 970 + P40	13b q4_k_s	pp512	111.45 ± 0.001	246.14 ± 0.040	2.21

Can someone with more than one GPU that supports DP4A (compute capability >= 6.1) test with -DLLAMA_CUDA_FORCE_MMQ=ON and -DLLAMA_CUDA_FORCE_MMQ=OFF (assuming tensor cores are available), in case I need to make this change conditional? Unfortunately, I have only one multi-GPU configuration available to test.

Setting MUL_MAT_SRC1_COL_STRIDE to 1024 is sufficient, if anyone experiences a performance regression with this change, could they test that value too?

cc @Ph0rk0z since you have access to dual P40s.

ref #3814

ziedbha · 2023-11-27T04:30:11Z

Just curious, do you happen to know how many cuda streams end up being used in either model config? Or maybe an Nsight systems report you'd be willing to share? Would be nice to look at the perf gain from the difference in the stream usage.

slaren · 2023-11-27T10:11:20Z

Since @JohannesGaessler wrote that code and he uses P40s, I am surprised that the value of MUL_MAT_SRC1_COL_STRIDE is not already tuned for the P40. Maybe this is only an improvement with the GTX 970?

slaren · 2023-11-27T10:23:43Z

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	pp 512	2316.59 ± 203.46
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 128	123.07 ± 1.78

build: 12fb1c5 (1570)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	pp 512	3644.91 ± 155.33
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 128	121.13 ± 3.64

build: 12fb1c5 (1570)

Master:

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	pp 512	2365.14 ± 138.48
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 128	127.23 ± 1.78

build: f3b2698 (1570)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	pp 512	3760.56 ± 80.43
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 128	119.70 ± 0.92

build: f3b2698 (1570)

Ph0rk0z · 2023-11-27T15:15:15Z

I swear I tweaked this before and had no benefit but I can try it again. I am also setting

set(LLAMA_CUDA_MMV_Y "2" CACHE STRING

and that gave a speedup for both 3090 and P40.

I will give it a go on 2 GPU on the same CPU.

Ok.. I found my old tests for this and did some new ones. I have tried stride at 128,256,512 before. Each time it has resulted in slower prompt processing. Granted, now one GPU is in an x8 slot but I don't think that will make much difference except for top speed. I tried summarizing a long text and seeing what PP will do for max effect. On a 70b, not a 7b. So practical use.

Results match my previous tests, a small performance loss.

1024 Stride

llama_print_timings:        load time =    5336.42 ms
llama_print_timings:      sample time =       0.55 ms /     1 runs   (    0.55 ms per token,  1808.32 tokens per second)
llama_print_timings: prompt eval time =   22014.82 ms /  1921 tokens (   11.46 ms per token,    87.26 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   22029.54 ms


llama_print_timings:        load time =    5336.42 ms
llama_print_timings:      sample time =     109.10 ms /   200 runs   (    0.55 ms per token,  1833.15 tokens per second)
llama_print_timings: prompt eval time =   21290.08 ms /  1912 tokens (   11.13 ms per token,    89.81 tokens per second)
llama_print_timings:        eval time =   29319.90 ms /   199 runs   (  147.34 ms per token,     6.79 tokens per second)
llama_print_timings:       total time =   51257.25 ms

128 stride

llama_print_timings:        load time =    4679.82 ms
llama_print_timings:      sample time =       0.56 ms /     1 runs   (    0.56 ms per token,  1798.56 tokens per second)
llama_print_timings: prompt eval time =   19336.00 ms /  1920 tokens (   10.07 ms per token,    99.30 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   19352.98 ms


llama_print_timings:        load time =    4679.82 ms
llama_print_timings:      sample time =       4.98 ms /     9 runs   (    0.55 ms per token,  1807.96 tokens per second)
llama_print_timings: prompt eval time =   19211.39 ms /  1913 tokens (   10.04 ms per token,    99.58 tokens per second)
llama_print_timings:        eval time =    1164.75 ms /     8 runs   (  145.59 ms per token,     6.87 tokens per second)
llama_print_timings:       total time =   20422.92 ms

4096 stride

llama_print_timings:        load time =    5322.09 ms
llama_print_timings:      sample time =      20.57 ms /    36 runs   (    0.57 ms per token,  1750.29 tokens per second)
llama_print_timings: prompt eval time =   21994.82 ms /  1921 tokens (   11.45 ms per token,    87.34 tokens per second)
llama_print_timings:        eval time =    5116.03 ms /    35 runs   (  146.17 ms per token,     6.84 tokens per second)

cebtenzzre · 2023-11-27T18:16:19Z

I updated the PR such that it should have no effect unless you use -DLLAMA_CUDA_FORCE_MMQ=ON with a card that doesn't support DP4A. I see no siginficant performance different with just my GTX 970, only when I am using both cards.

Ph0rk0z · 2023-11-27T18:46:18Z

I always force MMQ because the performance of the tensor kernel is worse for both pascal and ampere. I am not doing batch inference. So if you are forcing the longer stride for MMQ then it will reduce performance.

This was 2xP40, btw. Not single GPU inference. Dunno why 970 + P40 it went up for you.

JohannesGaessler · 2023-11-27T21:25:00Z

MMQ needs compute capability 6.1 or higher, otherwise the __dp4a instruction is not present. I did not consider the case of one of the GPUs having a compute capability < 6.1 when optimizing the performance via the stride. Such GPUs cannot use MMQ and should fall back to dequantizing the weight matrix and running cuBLAS. Looking at the code it seems that the dequantized weight matrix is not being cached. So for a stride of 128 this would then imply that the weight matrix is being dequantized four times for a batch size of 512, leading to poor performance. That would also explain why for FP16 on Volta or higher the high stride value yields better performance.

cebtenzzre · 2024-03-16T19:47:40Z

This PR doesn't seem to be relevant anymore - on master with pp512 I'm getting about 654 t/s with my GTX 970 and Tesla P40, with or without this change.

cuda : tweak mm stride to double perf on P40 + GTX 970

12fb1c5

cebtenzzre force-pushed the ceb/perf-faster-multigpu branch from 9584527 to 12fb1c5 Compare November 27, 2023 03:52

cebtenzzre added 2 commits November 27, 2023 13:05

make MUL_MAT_SRC1_COL_STRIDE conditional on runtime mmq

dd71a35

use stride=128 if built for tensor cores

6272b67

cebtenzzre closed this Mar 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda : tweak mm stride to double perf on P40 + GTX 970 #4233

cuda : tweak mm stride to double perf on P40 + GTX 970 #4233

cebtenzzre commented Nov 27, 2023 •

edited

Loading

ziedbha commented Nov 27, 2023

slaren commented Nov 27, 2023

slaren commented Nov 27, 2023

Ph0rk0z commented Nov 27, 2023 •

edited

Loading

cebtenzzre commented Nov 27, 2023 •

edited

Loading

Ph0rk0z commented Nov 27, 2023 •

edited

Loading

JohannesGaessler commented Nov 27, 2023 •

edited

Loading

cebtenzzre commented Mar 16, 2024

cuda : tweak mm stride to double perf on P40 + GTX 970 #4233

cuda : tweak mm stride to double perf on P40 + GTX 970 #4233

Conversation

cebtenzzre commented Nov 27, 2023 • edited Loading

ziedbha commented Nov 27, 2023

slaren commented Nov 27, 2023

slaren commented Nov 27, 2023

Ph0rk0z commented Nov 27, 2023 • edited Loading

cebtenzzre commented Nov 27, 2023 • edited Loading

Ph0rk0z commented Nov 27, 2023 • edited Loading

JohannesGaessler commented Nov 27, 2023 • edited Loading

cebtenzzre commented Mar 16, 2024

cebtenzzre commented Nov 27, 2023 •

edited

Loading

Ph0rk0z commented Nov 27, 2023 •

edited

Loading

cebtenzzre commented Nov 27, 2023 •

edited

Loading

Ph0rk0z commented Nov 27, 2023 •

edited

Loading

JohannesGaessler commented Nov 27, 2023 •

edited

Loading