-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda : tweak mm stride to double perf on P40 + GTX 970 #4233
Conversation
9584527
to
12fb1c5
Compare
Just curious, do you happen to know how many cuda streams end up being used in either model config? Or maybe an Nsight systems report you'd be willing to share? Would be nice to look at the perf gain from the difference in the stream usage. |
Since @JohannesGaessler wrote that code and he uses P40s, I am surprised that the value of |
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes
build: 12fb1c5 (1570) ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
build: 12fb1c5 (1570) Master: ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes
build: f3b2698 (1570) ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
build: f3b2698 (1570) |
I swear I tweaked this before and had no benefit but I can try it again. I am also setting
and that gave a speedup for both 3090 and P40. I will give it a go on 2 GPU on the same CPU. Ok.. I found my old tests for this and did some new ones. I have tried stride at 128,256,512 before. Each time it has resulted in slower prompt processing. Granted, now one GPU is in an x8 slot but I don't think that will make much difference except for top speed. I tried summarizing a long text and seeing what PP will do for max effect. On a 70b, not a 7b. So practical use. Results match my previous tests, a small performance loss.
|
I updated the PR such that it should have no effect unless you use |
I always force MMQ because the performance of the tensor kernel is worse for both pascal and ampere. I am not doing batch inference. So if you are forcing the longer stride for MMQ then it will reduce performance. This was 2xP40, btw. Not single GPU inference. Dunno why 970 + P40 it went up for you. |
MMQ needs compute capability 6.1 or higher, otherwise the |
This PR doesn't seem to be relevant anymore - on master with pp512 I'm getting about 654 t/s with my GTX 970 and Tesla P40, with or without this change. |
This simple change more than doubles the prompt processing speed when I am using both of my GPUs. All results are with
-DLLAMA_CUDA_FORCE_MMQ=ON
because I do not have tensor cores.Can someone with more than one GPU that supports DP4A (compute capability >= 6.1) test with
-DLLAMA_CUDA_FORCE_MMQ=ON
and-DLLAMA_CUDA_FORCE_MMQ=OFF
(assuming tensor cores are available), in case I need to make this change conditional? Unfortunately, I have only one multi-GPU configuration available to test.Setting MUL_MAT_SRC1_COL_STRIDE to 1024 is sufficient, if anyone experiences a performance regression with this change, could they test that value too?
cc @Ph0rk0z since you have access to dual P40s.
ref #3814