Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda : tweak mm stride to double perf on P40 + GTX 970 #4233

Closed
wants to merge 3 commits into from

Conversation

cebtenzzre
Copy link
Collaborator

@cebtenzzre cebtenzzre commented Nov 27, 2023

This simple change more than doubles the prompt processing speed when I am using both of my GPUs. All results are with -DLLAMA_CUDA_FORCE_MMQ=ON because I do not have tensor cores.

GPU Model Test t/s master t/s PR Speedup
GTX 970 + P40 7b q4_0 pp512 217.28 ± 0.004 458.18 ± 0.050 2.11
GTX 970 + P40 13b q4_k_s pp512 111.45 ± 0.001 246.14 ± 0.040 2.21

Can someone with more than one GPU that supports DP4A (compute capability >= 6.1) test with -DLLAMA_CUDA_FORCE_MMQ=ON and -DLLAMA_CUDA_FORCE_MMQ=OFF (assuming tensor cores are available), in case I need to make this change conditional? Unfortunately, I have only one multi-GPU configuration available to test.

Setting MUL_MAT_SRC1_COL_STRIDE to 1024 is sufficient, if anyone experiences a performance regression with this change, could they test that value too?

cc @Ph0rk0z since you have access to dual P40s.

ref #3814

@cebtenzzre cebtenzzre force-pushed the ceb/perf-faster-multigpu branch from 9584527 to 12fb1c5 Compare November 27, 2023 03:52
@ziedbha
Copy link
Contributor

ziedbha commented Nov 27, 2023

Just curious, do you happen to know how many cuda streams end up being used in either model config? Or maybe an Nsight systems report you'd be willing to share? Would be nice to look at the perf gain from the difference in the stream usage.

@slaren
Copy link
Collaborator

slaren commented Nov 27, 2023

Since @JohannesGaessler wrote that code and he uses P40s, I am surprised that the value of MUL_MAT_SRC1_COL_STRIDE is not already tuned for the P40. Maybe this is only an improvement with the GTX 970?

@slaren
Copy link
Collaborator

slaren commented Nov 27, 2023

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6

model size params backend ngl test t/s
llama 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 99 pp 512 2316.59 ± 203.46
llama 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 99 tg 128 123.07 ± 1.78

build: 12fb1c5 (1570)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6

model size params backend ngl test t/s
llama 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 99 pp 512 3644.91 ± 155.33
llama 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 99 tg 128 121.13 ± 3.64

build: 12fb1c5 (1570)

Master:

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6

model size params backend ngl test t/s
llama 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 99 pp 512 2365.14 ± 138.48
llama 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 99 tg 128 127.23 ± 1.78

build: f3b2698 (1570)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6

model size params backend ngl test t/s
llama 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 99 pp 512 3760.56 ± 80.43
llama 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 99 tg 128 119.70 ± 0.92

build: f3b2698 (1570)

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 27, 2023

I swear I tweaked this before and had no benefit but I can try it again. I am also setting

set(LLAMA_CUDA_MMV_Y "2" CACHE STRING

and that gave a speedup for both 3090 and P40.

I will give it a go on 2 GPU on the same CPU.

Ok.. I found my old tests for this and did some new ones. I have tried stride at 128,256,512 before. Each time it has resulted in slower prompt processing. Granted, now one GPU is in an x8 slot but I don't think that will make much difference except for top speed. I tried summarizing a long text and seeing what PP will do for max effect. On a 70b, not a 7b. So practical use.

Results match my previous tests, a small performance loss.

1024 Stride

llama_print_timings:        load time =    5336.42 ms
llama_print_timings:      sample time =       0.55 ms /     1 runs   (    0.55 ms per token,  1808.32 tokens per second)
llama_print_timings: prompt eval time =   22014.82 ms /  1921 tokens (   11.46 ms per token,    87.26 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   22029.54 ms


llama_print_timings:        load time =    5336.42 ms
llama_print_timings:      sample time =     109.10 ms /   200 runs   (    0.55 ms per token,  1833.15 tokens per second)
llama_print_timings: prompt eval time =   21290.08 ms /  1912 tokens (   11.13 ms per token,    89.81 tokens per second)
llama_print_timings:        eval time =   29319.90 ms /   199 runs   (  147.34 ms per token,     6.79 tokens per second)
llama_print_timings:       total time =   51257.25 ms

128 stride

llama_print_timings:        load time =    4679.82 ms
llama_print_timings:      sample time =       0.56 ms /     1 runs   (    0.56 ms per token,  1798.56 tokens per second)
llama_print_timings: prompt eval time =   19336.00 ms /  1920 tokens (   10.07 ms per token,    99.30 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   19352.98 ms


llama_print_timings:        load time =    4679.82 ms
llama_print_timings:      sample time =       4.98 ms /     9 runs   (    0.55 ms per token,  1807.96 tokens per second)
llama_print_timings: prompt eval time =   19211.39 ms /  1913 tokens (   10.04 ms per token,    99.58 tokens per second)
llama_print_timings:        eval time =    1164.75 ms /     8 runs   (  145.59 ms per token,     6.87 tokens per second)
llama_print_timings:       total time =   20422.92 ms

4096 stride

llama_print_timings:        load time =    5322.09 ms
llama_print_timings:      sample time =      20.57 ms /    36 runs   (    0.57 ms per token,  1750.29 tokens per second)
llama_print_timings: prompt eval time =   21994.82 ms /  1921 tokens (   11.45 ms per token,    87.34 tokens per second)
llama_print_timings:        eval time =    5116.03 ms /    35 runs   (  146.17 ms per token,     6.84 tokens per second)





@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Nov 27, 2023

I updated the PR such that it should have no effect unless you use -DLLAMA_CUDA_FORCE_MMQ=ON with a card that doesn't support DP4A. I see no siginficant performance different with just my GTX 970, only when I am using both cards.

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 27, 2023

I always force MMQ because the performance of the tensor kernel is worse for both pascal and ampere. I am not doing batch inference. So if you are forcing the longer stride for MMQ then it will reduce performance.

This was 2xP40, btw. Not single GPU inference. Dunno why 970 + P40 it went up for you.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Nov 27, 2023

MMQ needs compute capability 6.1 or higher, otherwise the __dp4a instruction is not present. I did not consider the case of one of the GPUs having a compute capability < 6.1 when optimizing the performance via the stride. Such GPUs cannot use MMQ and should fall back to dequantizing the weight matrix and running cuBLAS. Looking at the code it seems that the dequantized weight matrix is not being cached. So for a stride of 128 this would then imply that the weight matrix is being dequantized four times for a batch size of 512, leading to poor performance. That would also explain why for FP16 on Volta or higher the high stride value yields better performance.

@cebtenzzre
Copy link
Collaborator Author

This PR doesn't seem to be relevant anymore - on master with pp512 I'm getting about 654 t/s with my GTX 970 and Tesla P40, with or without this change.

@cebtenzzre cebtenzzre closed this Mar 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants