Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda : improve multi-GPU performance using cuBLAS #3814

Closed
wants to merge 3 commits into from

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Oct 27, 2023

ref #3479

Not sure if this has any positive effect. Looking for feedback

LLAMA_CUBLAS=1 make -j && CUDA_VISIBLE_DEVICES=0,1 ./batched-bench models/codellama-7b/ggml-model-f16.gguf 14592 0 99 1 100 128 1,2,3,4,5,6,7,8,16,32,64

@ggerganov ggerganov added the need feedback Testing and feedback with results are needed label Oct 27, 2023
Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the purpose of this PR. I would very much expect either FP16 or FP32 to be optimal, regardless of whether or not the tensor is split. As I said before, I would expect the best implementation to be converting the hidden state to FP16 on the main GPU and then distributing that.

@slaren when you implemented the use of FP16 cuBLAS GEMM, did you consider the case of >1 GPU? Because to me the code looks like for multiple GPUs FP32 is still being used on master.

Comment on lines +464 to +469
#ifdef GGML_CUDA_FORCE_MMQ
#define MUL_MAT_SRC1_COL_STRIDE 128
#else
// with tensor cores, we copy the entire hidden state to the devices in one go
#define MUL_MAT_SRC1_COL_STRIDE 4096
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a check for compute capability/AMD.

@ggerganov
Copy link
Owner Author

The reason to do it like this is because on the main device, the F16 -> F32 conversion of dst needs to take into account the ldc padding. So instead of reimplementing to_fp32_cuda to support a stride, I opted for this version to take advantage of cublasGemmEx's dst stride support.

As I said before, I would expect the best implementation to be converting the hidden state to FP16 on the main GPU and then distributing that.

Yes, but I have limited access to multi-GPU systems and can try only simple things like the change in this PR. Implementing F16 hidden state distribution would require more effort and resources. The long term goal is to keep all results in F16 anyway, so this will likely resolve that problem as we start to support that.

@slaren
Copy link
Collaborator

slaren commented Oct 28, 2023

@slaren when you implemented the use of FP16 cuBLAS GEMM, did you consider the case of >1 GPU? Because to me the code looks like for multiple GPUs FP32 is still being used on master.

I explicitly disabled the FP16 mat mul when using multiple GPUs. This is what this PR should fix.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Oct 28, 2023

This PR makes prompt processing much faster when using both my Tesla P40 and GTX 970.

GPU Model Test t/s master t/s PR Speedup
P40 + GTX 970 7b q4_0 pp512 156.88 251.06 1.60
P40 + GTX 970 13b q4_k_s pp512 86.15 147.80 1.72

I ran some benchmarks with batched-bench on a 7b q4_0 model with mmq disabled (since my GTX 970 is too old).

master @ 2f9ec7e

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
100 128 1 228 0.613 163.20 6.688 19.14 7.301 31.23
100 128 2 456 1.317 151.85 26.535 9.65 27.852 16.37
100 128 3 684 1.702 176.25 31.764 12.09 33.466 20.44
100 128 4 912 2.536 157.73 33.957 15.08 36.493 24.99
100 128 5 1140 3.259 153.44 44.803 14.28 48.061 23.72
100 128 6 1368 4.198 142.93 48.590 15.81 52.788 25.91
100 128 7 1596 4.568 153.24 60.622 14.78 65.190 24.48
100 128 8 1824 5.487 145.81 65.627 15.60 71.113 25.65
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
100 16 16 1856 13.547 118.11 16.235 15.77 29.782 62.32
100 16 32 3712 36.387 87.94 18.733 27.33 55.121 67.34
100 16 64 7424 110.974 57.67 36.845 27.79 147.819 50.22

PR

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
100 128 1 228 0.648 154.44 6.683 19.15 7.331 31.10
100 128 2 456 0.799 250.21 26.532 9.65 27.331 16.68
100 128 3 684 1.052 285.05 31.740 12.10 32.793 20.86
100 128 4 912 1.536 260.37 33.947 15.08 35.484 25.70
100 128 5 1140 2.031 246.15 44.774 14.29 46.806 24.36
100 128 6 1368 2.955 203.02 48.563 15.81 51.518 26.55
100 128 7 1596 3.105 225.41 60.617 14.78 63.723 25.05
100 128 8 1824 3.653 219.01 65.630 15.60 69.283 26.33
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
100 16 16 1856 9.851 162.42 16.238 15.77 26.089 71.14
100 16 32 3712 28.956 110.51 18.728 27.34 47.684 77.85
100 16 64 7424 95.973 66.69 36.847 27.79 132.820 55.90

@Ph0rk0z
Copy link

Ph0rk0z commented Oct 29, 2023

How can this work on P40 at all since it has no tensor cores?

@cebtenzzre
Copy link
Collaborator

How can this work on P40 at all since it has no tensor cores?

The MUL_MAT_SRC1_COL_STRIDE change applies to all GPUs. Johannes says that needs a compute capability check, but I'm not sure if he means for performance reasons or for correctness. The perplexity seems fine.

@ggerganov
Copy link
Owner Author

@cebtenzzre If you apply this patch to the current PR:

diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index 9c7cf357..8e27a76d 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -6366,7 +6366,7 @@ inline void ggml_cuda_op_mul_mat_cublas(
     // ldc == nrows of the matrix that cuBLAS writes into
     const int ldc = dst->backend == GGML_BACKEND_GPU && id == g_main_device ? ne0 : row_diff;
 
-    const bool is_split = row_diff != src0->ne[1];
+    const bool is_split = false;
 
     if (compute_capability >= CC_VOLTA && (src0->type == GGML_TYPE_F16 || ggml_is_quantized(src0->type)) && ggml_is_contiguous(src0)) {
         // convert src0 and src1 to fp16, multiply as fp16, convert dst to fp32
 ggerganov ▶ Georgis-MacBook-Pro ▶ ~/development/github/llama.cpp ▶

and run on single GPU (P40 or GTX970) without GGML_CUDA_FORCE_MMQ, how does it compare to the PP speed on master with GGML_CUDA_FORCE_MMQ?

@lxrite
Copy link

lxrite commented Oct 30, 2023

Tested on A10 * 2. It seems that the speed of both S_PP and S_TG has been improved.

Master 2 GPUs 6e08281

CUDA_VISIBLE_DEVICES=0,1 ./batched-bench Mistral-7B-Instruct-v0.1/ggml-model-f16.gguf 14592 0 99 1 100 128 1,2,3,4,5,6,7,8,16,32,64
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
100 128 1 228 0.261 383.16 3.471 36.87 3.732 61.09
100 128 2 456 0.537 372.30 11.854 21.60 12.391 36.80
100 128 3 684 0.857 349.99 12.185 31.51 13.043 52.44
100 128 4 912 0.973 411.09 12.503 40.95 13.476 67.68
100 128 5 1140 0.992 504.14 12.654 50.58 13.645 83.55
100 128 6 1368 1.474 407.08 12.793 60.03 14.267 95.89
100 128 7 1596 1.499 467.04 13.330 67.21 14.829 107.63
100 128 8 1824 1.791 446.78 13.665 74.94 15.456 118.02
100 128 16 3648 3.404 470.03 16.306 125.60 19.710 185.08
100 128 32 7296 6.885 464.78 20.632 198.53 27.517 265.15
100 128 64 14592 14.339 446.34 37.060 221.05 51.399 283.90

PR 2 GPUs

CUDA_VISIBLE_DEVICES=0,1 ./batched-bench Mistral-7B-Instruct-v0.1/ggml-model-f16.gguf 14592 0 99 1 100 128 1,2,3,4,5,6,7,8,16,32,64
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
100 128 1 228 0.200 499.45 3.604 35.52 3.804 59.94
100 128 2 456 0.495 404.17 4.172 61.36 4.667 97.71
100 128 3 684 0.758 395.83 4.570 84.02 5.328 128.38
100 128 4 912 0.800 500.22 4.759 107.60 5.558 164.08
100 128 5 1140 0.985 507.70 5.145 124.38 6.130 185.96
100 128 6 1368 1.220 491.65 5.089 150.93 6.309 216.83
100 128 7 1596 1.426 490.96 5.588 160.35 7.014 227.56
100 128 8 1824 1.471 543.69 5.952 172.05 7.423 245.71
100 128 16 3648 3.113 513.94 8.284 247.23 11.397 320.08
100 128 32 7296 6.227 513.91 13.038 314.15 19.265 378.72
100 128 64 14592 13.352 479.34 28.825 284.20 42.176 345.98

But it's still slower than on 1 GPU.

Master 1 GPU 6e08281

CUDA_VISIBLE_DEVICES=0 ./batched-bench Mistral-7B-Instruct-v0.1/ggml-model-f16.gguf 14592 0 99 1 100 128 1,2,3,4,5,6,7,8,16,32,64
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
100 128 1 228 0.056 1770.95 4.246 30.15 4.302 53.00
100 128 2 456 0.085 2363.90 4.528 56.54 4.612 98.87
100 128 3 684 0.123 2432.50 4.564 84.14 4.687 145.93
100 128 4 912 0.169 2365.58 4.646 110.20 4.815 189.41
100 128 5 1140 0.236 2121.89 4.716 135.71 4.952 230.23
100 128 6 1368 0.253 2375.43 4.809 159.71 5.061 270.29
100 128 7 1596 0.258 2715.96 5.048 177.50 5.306 300.81
100 128 8 1824 0.290 2756.22 5.268 194.39 5.558 328.18
100 128 16 3648 0.663 2414.86 6.575 311.50 7.237 504.07
100 128 32 7296 1.572 2035.24 8.468 483.72 10.040 726.70
100 128 64 14592 4.656 1374.63 18.453 443.93 23.109 631.44

@JohannesGaessler
Copy link
Collaborator

@cebtenzzre program correctness should not depend on MUL_MAT_SRC1_COL_STRIDE. Which OS did you run the test on?

@Ph0rk0z
Copy link

Ph0rk0z commented Oct 30, 2023

I did dual P40s with this PR + patch and had a regression in token speed (down to 5.5t/s). No real improvement for prompt processing or anything. That is without forcing MMV on. With it, it behaves like master.

Dual 3090 had a regression of speed from 18.93t/s down to 17.0 t/s and crashed on long context as I never took the patch off.

@esmeetu
Copy link

esmeetu commented Oct 30, 2023

Test onto dual T4

CUDA_VISIBLE_DEVICES=0,1 ./batched-bench ./CodeLlama-7b-hf/ggml-model-f16.gguf 14592 0 99 1 100 128 1,2,3,4,5,6,7,8,16,32,64

master:

main: n_kv_max = 14592, is_pp_shared = 0, n_gpu_layers = 99, mmq = 1

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
100 128 1 228 0.386 259.01 4.130 30.99 4.516 50.49
100 128 2 456 0.680 294.17 18.867 13.57 19.546 23.33
100 128 3 684 0.909 330.00 19.402 19.79 20.311 33.68
100 128 4 912 1.217 328.81 19.750 25.92 20.966 43.50
100 128 5 1140 1.447 345.50 20.803 30.77 22.250 51.24
100 128 6 1368 1.839 326.33 21.261 36.12 23.100 59.22
100 128 7 1596 2.068 338.42 22.096 40.55 24.164 66.05
100 128 8 1824 2.338 342.11 22.206 46.11 24.545 74.31
100 128 16 3648 4.676 342.19 24.963 82.04 29.639 123.08
100 128 32 7296 9.761 327.85 28.503 143.70 38.264 190.68
100 128 64 14592 21.758 294.15 53.753 152.40 75.511 193.24

llama_print_timings: load time = 9256.74 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 298990.17 ms / 33632 tokens ( 8.89 ms per token, 112.49 tokens per second)
llama_print_timings: eval time = 4129.63 ms / 128 runs ( 32.26 ms per token, 31.00 tokens per second)
llama_print_timings: total time = 312069.12 ms

PR:

main: n_kv_max = 14592, is_pp_shared = 0, n_gpu_layers = 99, mmq = 1

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
100 128 1 228 0.176 567.34 4.146 30.88 4.322 52.75
100 128 2 456 0.277 720.99 4.489 57.03 4.767 95.67
100 128 3 684 0.379 791.61 4.774 80.43 5.153 132.73
100 128 4 912 0.494 809.32 4.972 102.97 5.467 166.83
100 128 5 1140 0.592 845.26 5.495 116.47 6.087 187.30
100 128 6 1368 0.755 794.75 5.686 135.07 6.441 212.39
100 128 7 1596 0.826 847.94 6.000 149.33 6.826 233.82
100 128 8 1824 0.935 855.37 6.223 164.55 7.158 254.81
100 128 16 3648 1.986 805.63 10.519 194.70 12.505 291.72
100 128 32 7296 4.569 700.37 15.295 267.80 19.864 367.29
100 128 64 14592 11.696 547.21 38.641 212.00 50.337 289.89

llama_print_timings: load time = 8965.64 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 124935.16 ms / 33632 tokens ( 3.71 ms per token, 269.20 tokens per second)
llama_print_timings: eval time = 4145.57 ms / 128 runs ( 32.39 ms per token, 30.88 tokens per second)
llama_print_timings: total time = 137892.88 ms

@cebtenzzre
Copy link
Collaborator

GPU Model Test t/s master+MMQ t/s PR+patch Speedup
P40 7b q4_0 pp512 336.40 336.98 no change
P40 13b q4_k_s pp512 203.36 204.21 no change

FWIW, using the GTX 970 disables mmq due to lack of dp4a, and I cannot fit a 7b q4_0 on it since it has only 4GB of VRAM. But as of 2f9ec7e, there is no clear way to disable mmq in order to make a fair comparison.

cebtenzzre added a commit to cebtenzzre/llama.cpp that referenced this pull request Nov 27, 2023
This part makes multi-GPU faster for me with no clear drawbacks.
@ggerganov ggerganov closed this Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants