cuda : improve multi-GPU performance using cuBLAS #3814

ggerganov · 2023-10-27T14:11:50Z

Not sure if this has any positive effect. Looking for feedback

LLAMA_CUBLAS=1 make -j && CUDA_VISIBLE_DEVICES=0,1 ./batched-bench models/codellama-7b/ggml-model-f16.gguf 14592 0 99 1 100 128 1,2,3,4,5,6,7,8,16,32,64

JohannesGaessler

I don't understand the purpose of this PR. I would very much expect either FP16 or FP32 to be optimal, regardless of whether or not the tensor is split. As I said before, I would expect the best implementation to be converting the hidden state to FP16 on the main GPU and then distributing that.

@slaren when you implemented the use of FP16 cuBLAS GEMM, did you consider the case of >1 GPU? Because to me the code looks like for multiple GPUs FP32 is still being used on master.

JohannesGaessler · 2023-10-28T08:12:28Z

ggml-cuda.cu

+#ifdef GGML_CUDA_FORCE_MMQ
 #define MUL_MAT_SRC1_COL_STRIDE 128
+#else
+// with tensor cores, we copy the entire hidden state to the devices in one go
+#define MUL_MAT_SRC1_COL_STRIDE 4096
+#endif


This needs a check for compute capability/AMD.

ggerganov · 2023-10-28T09:10:33Z

The reason to do it like this is because on the main device, the F16 -> F32 conversion of dst needs to take into account the ldc padding. So instead of reimplementing to_fp32_cuda to support a stride, I opted for this version to take advantage of cublasGemmEx's dst stride support.

As I said before, I would expect the best implementation to be converting the hidden state to FP16 on the main GPU and then distributing that.

Yes, but I have limited access to multi-GPU systems and can try only simple things like the change in this PR. Implementing F16 hidden state distribution would require more effort and resources. The long term goal is to keep all results in F16 anyway, so this will likely resolve that problem as we start to support that.

slaren · 2023-10-28T11:30:17Z

@slaren when you implemented the use of FP16 cuBLAS GEMM, did you consider the case of >1 GPU? Because to me the code looks like for multiple GPUs FP32 is still being used on master.

I explicitly disabled the FP16 mat mul when using multiple GPUs. This is what this PR should fix.

cebtenzzre · 2023-10-28T21:07:58Z

This PR makes prompt processing much faster when using both my Tesla P40 and GTX 970.

GPU	Model	Test	t/s master	t/s PR	Speedup
P40 + GTX 970	7b q4_0	pp512	156.88	251.06	1.60
P40 + GTX 970	13b q4_k_s	pp512	86.15	147.80	1.72

I ran some benchmarks with batched-bench on a 7b q4_0 model with mmq disabled (since my GTX 970 is too old).

master @ `2f9ec7e`

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
100	128	1	228	0.613	163.20	6.688	19.14	7.301	31.23
100	128	2	456	1.317	151.85	26.535	9.65	27.852	16.37
100	128	3	684	1.702	176.25	31.764	12.09	33.466	20.44
100	128	4	912	2.536	157.73	33.957	15.08	36.493	24.99
100	128	5	1140	3.259	153.44	44.803	14.28	48.061	23.72
100	128	6	1368	4.198	142.93	48.590	15.81	52.788	25.91
100	128	7	1596	4.568	153.24	60.622	14.78	65.190	24.48
100	128	8	1824	5.487	145.81	65.627	15.60	71.113	25.65

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
100	16	16	1856	13.547	118.11	16.235	15.77	29.782	62.32
100	16	32	3712	36.387	87.94	18.733	27.33	55.121	67.34
100	16	64	7424	110.974	57.67	36.845	27.79	147.819	50.22

PR

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
100	128	1	228	0.648	154.44	6.683	19.15	7.331	31.10
100	128	2	456	0.799	250.21	26.532	9.65	27.331	16.68
100	128	3	684	1.052	285.05	31.740	12.10	32.793	20.86
100	128	4	912	1.536	260.37	33.947	15.08	35.484	25.70
100	128	5	1140	2.031	246.15	44.774	14.29	46.806	24.36
100	128	6	1368	2.955	203.02	48.563	15.81	51.518	26.55
100	128	7	1596	3.105	225.41	60.617	14.78	63.723	25.05
100	128	8	1824	3.653	219.01	65.630	15.60	69.283	26.33

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
100	16	16	1856	9.851	162.42	16.238	15.77	26.089	71.14
100	16	32	3712	28.956	110.51	18.728	27.34	47.684	77.85
100	16	64	7424	95.973	66.69	36.847	27.79	132.820	55.90

Ph0rk0z · 2023-10-29T12:21:43Z

How can this work on P40 at all since it has no tensor cores?

cebtenzzre · 2023-10-29T22:52:21Z

How can this work on P40 at all since it has no tensor cores?

The MUL_MAT_SRC1_COL_STRIDE change applies to all GPUs. Johannes says that needs a compute capability check, but I'm not sure if he means for performance reasons or for correctness. The perplexity seems fine.

ggerganov · 2023-10-30T06:40:59Z

@cebtenzzre If you apply this patch to the current PR:

diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index 9c7cf357..8e27a76d 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -6366,7 +6366,7 @@ inline void ggml_cuda_op_mul_mat_cublas(
     // ldc == nrows of the matrix that cuBLAS writes into
     const int ldc = dst->backend == GGML_BACKEND_GPU && id == g_main_device ? ne0 : row_diff;
 
-    const bool is_split = row_diff != src0->ne[1];
+    const bool is_split = false;
 
     if (compute_capability >= CC_VOLTA && (src0->type == GGML_TYPE_F16 || ggml_is_quantized(src0->type)) && ggml_is_contiguous(src0)) {
         // convert src0 and src1 to fp16, multiply as fp16, convert dst to fp32
 ggerganov ▶ Georgis-MacBook-Pro ▶ ~/development/github/llama.cpp ▶

and run on single GPU (P40 or GTX970) without GGML_CUDA_FORCE_MMQ, how does it compare to the PP speed on master with GGML_CUDA_FORCE_MMQ?

lxrite · 2023-10-30T07:22:41Z

Tested on A10 * 2. It seems that the speed of both S_PP and S_TG has been improved.

Master 2 GPUs `6e08281`

CUDA_VISIBLE_DEVICES=0,1 ./batched-bench Mistral-7B-Instruct-v0.1/ggml-model-f16.gguf 14592 0 99 1 100 128 1,2,3,4,5,6,7,8,16,32,64

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
100	128	1	228	0.261	383.16	3.471	36.87	3.732	61.09
100	128	2	456	0.537	372.30	11.854	21.60	12.391	36.80
100	128	3	684	0.857	349.99	12.185	31.51	13.043	52.44
100	128	4	912	0.973	411.09	12.503	40.95	13.476	67.68
100	128	5	1140	0.992	504.14	12.654	50.58	13.645	83.55
100	128	6	1368	1.474	407.08	12.793	60.03	14.267	95.89
100	128	7	1596	1.499	467.04	13.330	67.21	14.829	107.63
100	128	8	1824	1.791	446.78	13.665	74.94	15.456	118.02
100	128	16	3648	3.404	470.03	16.306	125.60	19.710	185.08
100	128	32	7296	6.885	464.78	20.632	198.53	27.517	265.15
100	128	64	14592	14.339	446.34	37.060	221.05	51.399	283.90

PR 2 GPUs

CUDA_VISIBLE_DEVICES=0,1 ./batched-bench Mistral-7B-Instruct-v0.1/ggml-model-f16.gguf 14592 0 99 1 100 128 1,2,3,4,5,6,7,8,16,32,64

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
100	128	1	228	0.200	499.45	3.604	35.52	3.804	59.94
100	128	2	456	0.495	404.17	4.172	61.36	4.667	97.71
100	128	3	684	0.758	395.83	4.570	84.02	5.328	128.38
100	128	4	912	0.800	500.22	4.759	107.60	5.558	164.08
100	128	5	1140	0.985	507.70	5.145	124.38	6.130	185.96
100	128	6	1368	1.220	491.65	5.089	150.93	6.309	216.83
100	128	7	1596	1.426	490.96	5.588	160.35	7.014	227.56
100	128	8	1824	1.471	543.69	5.952	172.05	7.423	245.71
100	128	16	3648	3.113	513.94	8.284	247.23	11.397	320.08
100	128	32	7296	6.227	513.91	13.038	314.15	19.265	378.72
100	128	64	14592	13.352	479.34	28.825	284.20	42.176	345.98

But it's still slower than on 1 GPU.

Master 1 GPU `6e08281`

CUDA_VISIBLE_DEVICES=0 ./batched-bench Mistral-7B-Instruct-v0.1/ggml-model-f16.gguf 14592 0 99 1 100 128 1,2,3,4,5,6,7,8,16,32,64

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
100	128	1	228	0.056	1770.95	4.246	30.15	4.302	53.00
100	128	2	456	0.085	2363.90	4.528	56.54	4.612	98.87
100	128	3	684	0.123	2432.50	4.564	84.14	4.687	145.93
100	128	4	912	0.169	2365.58	4.646	110.20	4.815	189.41
100	128	5	1140	0.236	2121.89	4.716	135.71	4.952	230.23
100	128	6	1368	0.253	2375.43	4.809	159.71	5.061	270.29
100	128	7	1596	0.258	2715.96	5.048	177.50	5.306	300.81
100	128	8	1824	0.290	2756.22	5.268	194.39	5.558	328.18
100	128	16	3648	0.663	2414.86	6.575	311.50	7.237	504.07
100	128	32	7296	1.572	2035.24	8.468	483.72	10.040	726.70
100	128	64	14592	4.656	1374.63	18.453	443.93	23.109	631.44

JohannesGaessler · 2023-10-30T10:37:55Z

@cebtenzzre program correctness should not depend on MUL_MAT_SRC1_COL_STRIDE. Which OS did you run the test on?

Ph0rk0z · 2023-10-30T11:52:48Z

I did dual P40s with this PR + patch and had a regression in token speed (down to 5.5t/s). No real improvement for prompt processing or anything. That is without forcing MMV on. With it, it behaves like master.

Dual 3090 had a regression of speed from 18.93t/s down to 17.0 t/s and crashed on long context as I never took the patch off.

esmeetu · 2023-10-30T12:50:47Z

Test onto dual T4

CUDA_VISIBLE_DEVICES=0,1 ./batched-bench ./CodeLlama-7b-hf/ggml-model-f16.gguf 14592 0 99 1 100 128 1,2,3,4,5,6,7,8,16,32,64

master:

main: n_kv_max = 14592, is_pp_shared = 0, n_gpu_layers = 99, mmq = 1

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
100	128	1	228	0.386	259.01	4.130	30.99	4.516	50.49
100	128	2	456	0.680	294.17	18.867	13.57	19.546	23.33
100	128	3	684	0.909	330.00	19.402	19.79	20.311	33.68
100	128	4	912	1.217	328.81	19.750	25.92	20.966	43.50
100	128	5	1140	1.447	345.50	20.803	30.77	22.250	51.24
100	128	6	1368	1.839	326.33	21.261	36.12	23.100	59.22
100	128	7	1596	2.068	338.42	22.096	40.55	24.164	66.05
100	128	8	1824	2.338	342.11	22.206	46.11	24.545	74.31
100	128	16	3648	4.676	342.19	24.963	82.04	29.639	123.08
100	128	32	7296	9.761	327.85	28.503	143.70	38.264	190.68
100	128	64	14592	21.758	294.15	53.753	152.40	75.511	193.24

llama_print_timings: load time = 9256.74 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 298990.17 ms / 33632 tokens ( 8.89 ms per token, 112.49 tokens per second)
llama_print_timings: eval time = 4129.63 ms / 128 runs ( 32.26 ms per token, 31.00 tokens per second)
llama_print_timings: total time = 312069.12 ms

PR:

main: n_kv_max = 14592, is_pp_shared = 0, n_gpu_layers = 99, mmq = 1

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
100	128	1	228	0.176	567.34	4.146	30.88	4.322	52.75
100	128	2	456	0.277	720.99	4.489	57.03	4.767	95.67
100	128	3	684	0.379	791.61	4.774	80.43	5.153	132.73
100	128	4	912	0.494	809.32	4.972	102.97	5.467	166.83
100	128	5	1140	0.592	845.26	5.495	116.47	6.087	187.30
100	128	6	1368	0.755	794.75	5.686	135.07	6.441	212.39
100	128	7	1596	0.826	847.94	6.000	149.33	6.826	233.82
100	128	8	1824	0.935	855.37	6.223	164.55	7.158	254.81
100	128	16	3648	1.986	805.63	10.519	194.70	12.505	291.72
100	128	32	7296	4.569	700.37	15.295	267.80	19.864	367.29
100	128	64	14592	11.696	547.21	38.641	212.00	50.337	289.89

llama_print_timings: load time = 8965.64 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 124935.16 ms / 33632 tokens ( 3.71 ms per token, 269.20 tokens per second)
llama_print_timings: eval time = 4145.57 ms / 128 runs ( 32.39 ms per token, 30.88 tokens per second)
llama_print_timings: total time = 137892.88 ms

cebtenzzre · 2023-10-30T13:52:21Z

GPU	Model	Test	t/s master+MMQ	t/s PR+patch	Speedup
P40	7b q4_0	pp512	336.40	336.98	no change
P40	13b q4_k_s	pp512	203.36	204.21	no change

FWIW, using the GTX 970 disables mmq due to lack of dp4a, and I cannot fit a 7b q4_0 on it since it has only 4GB of VRAM. But as of 2f9ec7e, there is no clear way to disable mmq in order to make a fair comparison.

This part makes multi-GPU faster for me with no clear drawbacks.

cuda : utilize tensor cores with multiple GPU devices

1a0843c

ggerganov mentioned this pull request Oct 27, 2023

cuda : improve text-generation and batched decoding performance #3776

Merged

6 tasks

ggerganov added 2 commits October 27, 2023 22:17

cuda : try to fix main device write

706ff4c

cuda : fix multi-gpu with tensor cores

cd3e20f

ggerganov added the need feedback Testing and feedback with results are needed label Oct 27, 2023

JohannesGaessler reviewed Oct 28, 2023

View reviewed changes

cebtenzzre mentioned this pull request Oct 28, 2023

cuda : speed-up by using CUBLAS_COMPUTE_32F instead of CUBLAS_COMPUTE_16F #3816

Closed

JohannesGaessler mentioned this pull request Nov 13, 2023

Multi GPU CUDA - 8x performance degradation when splitting tensors -> let's split by layer as an option #4055

Closed

cebtenzzre added a commit to cebtenzzre/llama.cpp that referenced this pull request Nov 27, 2023

cuda : partially merge PR ggerganov#3814

ff7aa76

This part makes multi-GPU faster for me with no clear drawbacks.

cebtenzzre mentioned this pull request Nov 27, 2023

cuda : tweak mm stride to double perf on P40 + GTX 970 #4233

Closed

ggerganov closed this Jan 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda : improve multi-GPU performance using cuBLAS #3814

cuda : improve multi-GPU performance using cuBLAS #3814

ggerganov commented Oct 27, 2023 •

edited

Loading

JohannesGaessler left a comment

JohannesGaessler Oct 28, 2023

ggerganov commented Oct 28, 2023

slaren commented Oct 28, 2023 •

edited

Loading

cebtenzzre commented Oct 28, 2023 •

edited

Loading

Ph0rk0z commented Oct 29, 2023

cebtenzzre commented Oct 29, 2023

ggerganov commented Oct 30, 2023

lxrite commented Oct 30, 2023 •

edited

Loading

JohannesGaessler commented Oct 30, 2023

Ph0rk0z commented Oct 30, 2023

esmeetu commented Oct 30, 2023

cebtenzzre commented Oct 30, 2023

cuda : improve multi-GPU performance using cuBLAS #3814

cuda : improve multi-GPU performance using cuBLAS #3814

Conversation

ggerganov commented Oct 27, 2023 • edited Loading

JohannesGaessler left a comment

Choose a reason for hiding this comment

JohannesGaessler Oct 28, 2023

Choose a reason for hiding this comment

ggerganov commented Oct 28, 2023

slaren commented Oct 28, 2023 • edited Loading

cebtenzzre commented Oct 28, 2023 • edited Loading

master @ 2f9ec7e

PR

Ph0rk0z commented Oct 29, 2023

cebtenzzre commented Oct 29, 2023

ggerganov commented Oct 30, 2023

lxrite commented Oct 30, 2023 • edited Loading

Master 2 GPUs 6e08281

PR 2 GPUs

Master 1 GPU 6e08281

JohannesGaessler commented Oct 30, 2023

Ph0rk0z commented Oct 30, 2023

esmeetu commented Oct 30, 2023

cebtenzzre commented Oct 30, 2023

ggerganov commented Oct 27, 2023 •

edited

Loading

slaren commented Oct 28, 2023 •

edited

Loading

cebtenzzre commented Oct 28, 2023 •

edited

Loading

master @ `2f9ec7e`

lxrite commented Oct 30, 2023 •

edited

Loading

Master 2 GPUs `6e08281`

Master 1 GPU `6e08281`