batched : add bench tool #3545

ggerganov · 2023-10-08T13:00:40Z

Inspired by this blog post, implemented a tool to generate similar stats for llama.cpp

PP - prompt tokens per batch
TG - generated tokens per batch
B - number of batches
N_KV - required KV cache size
T_PP - prompt processing time (i.e. time to first token)
S_PP - prompt processing speed ((B*PP)/T_PP or PP/T_PP)
T_TG - time to generate all batches
S_TG - text generation speed ((B*TG)/T_TG)
T - total time
S - total speed (i.e. all tokens / total time)

There are 2 modes of operation:

prompt not shared - each batch has a separate prompt of size PP (i.e. N_KV = B*(PP + TG))
prompt is shared - there is a common prompt of size PP used by all batches (i.e. N_KV = PP + B*TG)

LLaMA 7B, F16, N_KV_MAX = 16384 (8GB), M2 Ultra, prompt not shared

./bin/batched-bench ../models/llama-7b/ggml-model-f16.gguf 0 99

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.108	1186.64	3.079	41.57	3.187	80.32
128	128	2	512	0.198	1295.19	5.029	50.90	5.227	97.95
128	128	4	1024	0.373	1373.96	6.878	74.44	7.251	141.23
128	128	8	2048	0.751	1363.27	7.344	139.43	8.095	252.99
128	128	16	4096	1.570	1304.68	8.455	242.23	10.024	408.60
128	128	32	8192	3.408	1201.73	8.801	465.40	12.209	670.96
128	256	1	384	0.107	1196.70	6.329	40.45	6.436	59.67
128	256	2	768	0.194	1317.45	10.239	50.00	10.433	73.61
128	256	4	1536	0.366	1399.03	13.960	73.35	14.326	107.22
128	256	8	3072	0.751	1363.92	15.110	135.54	15.861	193.69
128	256	16	6144	1.569	1304.93	18.073	226.64	19.642	312.80
128	256	32	12288	3.409	1201.35	19.223	426.15	22.633	542.93
128	512	1	640	0.107	1200.71	12.784	40.05	12.891	49.65
128	512	2	1280	0.194	1317.47	20.789	49.26	20.984	61.00
128	512	4	2560	0.366	1400.18	28.471	71.93	28.837	88.78
128	512	8	5120	0.751	1363.92	31.929	128.28	32.680	156.67
128	512	16	10240	1.570	1304.49	41.024	199.69	42.594	240.41
256	128	1	384	0.195	1315.57	3.172	40.36	3.366	114.07
256	128	2	768	0.366	1399.75	5.170	49.51	5.536	138.73
256	128	4	1536	0.751	1363.24	7.054	72.59	7.805	196.81
256	128	8	3072	1.570	1304.84	7.751	132.11	9.321	329.59
256	128	16	6144	3.408	1201.89	9.618	212.92	13.026	471.65
256	128	32	12288	8.059	1016.56	10.422	393.01	18.481	664.91
256	256	1	512	0.195	1312.30	6.367	40.21	6.562	78.02
256	256	2	1024	0.365	1402.23	10.408	49.19	10.773	95.05
256	256	4	2048	0.750	1364.93	14.236	71.93	14.986	136.66
256	256	8	4096	1.569	1305.32	15.946	128.43	17.515	233.86
256	256	16	8192	3.410	1201.18	20.458	200.21	23.868	343.22
256	256	32	16384	8.064	1015.85	22.504	364.02	30.569	535.97
256	512	1	768	0.195	1313.84	12.747	40.17	12.942	59.34
256	512	2	1536	0.365	1401.96	21.045	48.66	21.410	71.74
256	512	4	3072	0.751	1362.82	29.032	70.54	29.784	103.14
256	512	8	6144	1.572	1302.94	33.604	121.89	35.176	174.66
256	512	16	12288	3.410	1201.02	46.006	178.06	49.416	248.66
512	128	1	640	0.366	1398.68	3.231	39.61	3.597	177.92
512	128	2	1280	0.751	1363.34	5.287	48.42	6.038	211.99
512	128	4	2560	1.570	1304.28	7.330	69.85	8.901	287.62
512	128	8	5120	3.409	1201.46	8.616	118.85	12.025	425.78
512	128	16	10240	8.058	1016.58	12.104	169.20	20.163	507.87
512	256	1	768	0.366	1399.96	6.530	39.21	6.895	111.38
512	256	2	1536	0.751	1363.10	10.707	47.82	11.458	134.05
512	256	4	3072	1.569	1305.34	14.808	69.15	16.377	187.58
512	256	8	6144	3.408	1201.79	17.650	116.03	21.058	291.76
512	256	16	12288	8.064	1015.92	25.537	160.40	33.600	365.71
512	512	1	1024	0.366	1399.14	13.204	38.78	13.570	75.46
512	512	2	2048	0.751	1363.61	21.671	47.25	22.421	91.34
512	512	4	4096	1.570	1304.64	30.222	67.77	31.792	128.84
512	512	8	8192	3.405	1203.05	36.993	110.72	40.398	202.78
512	512	16	16384	8.062	1016.14	56.692	144.50	64.754	253.02
1024	128	1	1152	0.752	1362.53	3.370	37.99	4.121	279.53
1024	128	2	2304	1.570	1304.82	5.613	45.61	7.183	320.77
1024	128	4	4608	3.409	1201.63	7.933	64.54	11.342	406.28
1024	128	8	9216	8.063	1015.94	10.336	99.08	18.399	500.89
1024	256	1	1280	0.751	1363.25	6.805	37.62	7.557	169.39
1024	256	2	2560	1.570	1304.79	11.307	45.28	12.877	198.80
1024	256	4	5120	3.408	1201.91	16.052	63.79	19.460	263.11
1024	256	8	10240	8.062	1016.17	21.167	96.75	29.229	350.34
1024	512	1	1536	0.751	1364.36	13.753	37.23	14.504	105.90
1024	512	2	3072	1.569	1305.08	22.975	44.57	24.544	125.16
1024	512	4	6144	3.409	1201.42	32.709	62.61	36.118	170.11
1024	512	8	12288	8.064	1015.89	44.231	92.60	52.295	234.97
2048	128	1	2176	1.570	1304.68	3.672	34.85	5.242	415.10
2048	128	2	4352	3.409	1201.67	6.247	40.98	9.656	450.71
2048	128	4	8704	8.063	1016.05	9.129	56.09	17.191	506.30
2048	256	1	2304	1.569	1305.56	7.393	34.63	8.962	257.09
2048	256	2	4608	3.410	1201.19	12.605	40.62	16.015	287.73
2048	256	4	9216	8.064	1015.81	18.422	55.59	26.486	347.95
2048	512	1	2560	1.569	1305.22	14.957	34.23	16.526	154.91
2048	512	2	5120	3.409	1201.61	25.513	40.14	28.922	177.03
2048	512	4	10240	8.063	1015.94	37.486	54.63	45.550	224.81
3584	128	1	3712	2.922	1226.65	4.085	31.33	7.007	529.77
3584	128	2	7424	6.761	1060.22	7.235	35.38	13.996	530.43
3584	128	4	14848	18.082	792.81	11.162	45.87	29.244	507.73

LLaMA 7B, F16, N_KV_MAX = 16384 (8GB), M2 Ultra, prompt is shared

./bin/batched-bench ../models/llama-7b/ggml-model-f16.gguf 1 99

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.108	1181.98	3.088	41.46	3.196	80.10
128	128	2	384	0.106	1204.21	5.008	51.12	5.115	75.08
128	128	4	640	0.107	1200.32	6.759	75.75	6.866	93.22
128	128	8	1152	0.107	1201.12	6.949	147.36	7.056	163.27
128	128	16	2176	0.106	1204.66	7.391	277.10	7.497	290.25
128	128	32	4224	0.107	1200.86	7.367	556.03	7.473	565.22
128	256	1	384	0.107	1200.66	6.226	41.11	6.333	60.63
128	256	2	640	0.107	1199.51	10.056	50.92	10.162	62.98
128	256	4	1152	0.107	1201.43	13.672	74.90	13.779	83.61
128	256	8	2176	0.106	1204.99	14.332	142.90	14.438	150.71
128	256	16	4224	0.107	1201.31	15.940	256.96	16.046	263.24
128	256	32	8320	0.107	1198.13	16.178	506.38	16.284	510.92
256	128	1	384	0.197	1300.92	3.113	41.12	3.310	116.02
256	128	2	512	0.195	1315.41	5.027	50.92	5.222	98.05
256	128	4	768	0.194	1319.74	6.799	75.31	6.993	109.82
256	128	8	1280	0.194	1317.29	7.011	146.05	7.206	177.64
256	128	16	2304	0.195	1315.78	7.471	274.12	7.666	300.56
256	128	32	4352	0.195	1314.05	7.423	551.80	7.618	571.29
256	256	1	512	0.194	1316.38	6.255	40.93	6.450	79.39
256	256	2	768	0.194	1318.78	10.115	50.62	10.309	74.50
256	256	4	1280	0.194	1319.14	13.757	74.43	13.951	91.75
256	256	8	2304	0.194	1319.89	14.445	141.78	14.639	157.39
256	256	16	4352	0.195	1314.28	16.142	253.74	16.337	266.39
256	256	32	8448	0.195	1311.93	16.309	502.29	16.504	511.87
512	128	1	640	0.371	1380.58	3.219	39.77	3.590	178.29
512	128	2	768	0.365	1401.37	5.135	49.85	5.501	139.62
512	128	4	1024	0.366	1398.62	6.868	74.55	7.234	141.55
512	128	8	1536	0.366	1400.81	7.109	144.04	7.475	205.49
512	128	16	2560	0.365	1401.21	7.607	269.21	7.973	321.09
512	128	32	4608	0.366	1399.47	7.536	543.50	7.902	583.13
512	256	1	768	0.366	1398.01	6.470	39.57	6.836	112.34
512	256	2	1024	0.366	1400.50	10.351	49.46	10.717	95.55
512	256	4	1536	0.365	1402.32	13.920	73.57	14.285	107.53
512	256	8	2560	0.365	1401.19	14.666	139.64	15.031	170.31
512	256	16	4608	0.365	1401.81	16.410	249.61	16.775	274.70
512	256	32	8704	0.366	1397.87	16.473	497.30	16.839	516.89
1024	128	1	1152	0.752	1361.28	3.366	38.03	4.118	279.75
1024	128	2	1280	0.750	1365.23	5.290	48.40	6.040	211.93
1024	128	4	1536	0.751	1364.32	7.035	72.78	7.786	197.29
1024	128	8	2048	0.751	1362.96	7.333	139.63	8.085	253.32
1024	128	16	3072	0.751	1362.91	7.895	259.41	8.646	355.30
1024	128	32	5120	0.753	1359.82	7.703	531.71	8.456	605.45
1024	256	1	1280	0.751	1363.21	6.650	38.50	7.401	172.94
1024	256	2	1536	0.750	1364.62	10.545	48.55	11.296	135.98
1024	256	4	2048	0.751	1364.22	14.213	72.05	14.963	136.87
1024	256	8	3072	0.751	1364.29	15.062	135.97	15.813	194.27
1024	256	16	5120	0.751	1362.66	16.941	241.77	17.693	289.38
1024	256	32	9216	0.753	1360.55	16.865	485.75	17.617	523.12
2048	128	1	2176	1.570	1304.40	3.644	35.13	5.214	417.33
2048	128	2	2304	1.570	1304.49	5.616	45.59	7.186	320.64
2048	128	4	2560	1.570	1304.68	7.324	69.91	8.893	287.86
2048	128	8	3072	1.569	1304.93	7.742	132.27	9.311	329.92
2048	128	16	4096	1.571	1303.90	8.462	242.01	10.033	408.25
2048	128	32	6144	1.570	1304.21	8.053	508.62	9.623	638.44
2048	256	1	2304	1.572	1303.18	7.353	34.82	8.924	258.17
2048	256	2	2560	1.569	1305.67	11.295	45.33	12.863	199.01
2048	256	4	3072	1.570	1304.31	14.793	69.22	16.363	187.74
2048	256	8	4096	1.568	1305.74	15.917	128.67	17.485	234.26
2048	256	16	6144	1.570	1304.66	18.095	226.36	19.665	312.44
2048	256	32	10240	1.572	1302.62	17.650	464.14	19.222	532.73
3584	128	1	3712	2.923	1226.22	4.076	31.40	6.999	530.38
3584	128	2	3840	2.920	1227.51	6.083	42.08	9.003	426.52
3584	128	4	4096	2.923	1226.16	7.773	65.87	10.696	382.94
3584	128	8	4608	2.921	1226.93	8.393	122.01	11.314	407.28
3584	128	16	5632	2.924	1225.76	9.346	219.14	12.269	459.03
3584	128	32	7680	2.925	1225.14	8.615	475.48	11.540	665.52
3584	256	1	3840	2.924	1225.52	8.212	31.17	11.137	344.80
3584	256	2	4096	2.921	1226.88	12.253	41.79	15.174	269.94
3584	256	4	4608	2.922	1226.65	15.715	65.16	18.636	247.26
3584	256	8	5632	2.922	1226.63	17.215	118.96	20.137	279.68
3584	256	16	7680	2.924	1225.91	19.900	205.83	22.824	336.49
3584	256	32	11776	2.928	1224.16	18.860	434.36	21.787	540.49
7680	128	1	7808	7.396	1038.35	5.255	24.36	12.652	617.16
7680	128	2	7936	7.395	1038.52	7.364	34.76	14.759	537.69
7680	128	4	8192	7.395	1038.61	8.974	57.06	16.368	500.49
7680	128	8	8704	7.399	1037.93	10.118	101.21	17.517	496.88
7680	128	16	9728	7.396	1038.43	11.843	172.92	19.239	505.63
7680	128	32	11776	7.399	1038.03	10.314	397.15	17.712	664.86

LLaMA 7B, Q8_0, N_KV_MAX = 16384 (8GB), M2 Ultra, prompt not shared

./bin/batched-bench ../models/llama-7b/ggml-model-q8_0.gguf 0 99

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.122	1048.05	1.935	66.14	2.058	124.42
128	128	2	512	0.222	1151.63	2.628	97.42	2.850	179.64
128	128	4	1024	0.417	1229.06	4.122	124.21	4.539	225.62
128	128	8	2048	0.845	1211.86	7.901	129.60	8.746	234.16
128	128	16	4096	1.757	1165.75	8.977	228.14	10.734	381.61
128	128	32	8192	3.779	1083.88	9.233	443.64	13.012	629.59
128	256	1	384	0.121	1053.98	3.902	65.60	4.024	95.43
128	256	2	768	0.220	1161.39	5.338	95.91	5.559	138.16
128	256	4	1536	0.412	1241.96	8.422	121.58	8.834	173.86
128	256	8	3072	0.843	1214.59	16.208	126.36	17.051	180.16
128	256	16	6144	1.756	1166.48	19.106	214.38	20.862	294.51
128	256	32	12288	3.778	1084.23	20.149	406.58	23.926	513.58
256	128	1	384	0.221	1158.98	1.966	65.09	2.187	175.56
256	128	2	768	0.411	1245.02	2.706	94.59	3.118	246.34
256	128	4	1536	0.845	1211.79	4.288	119.41	5.133	299.25
256	128	8	3072	1.757	1165.76	8.252	124.09	10.009	306.92
256	128	16	6144	3.777	1084.58	10.141	201.95	13.918	441.45
256	128	32	12288	8.789	932.05	10.907	375.53	19.696	623.87
256	256	1	512	0.221	1159.34	3.965	64.56	4.186	122.31
256	256	2	1024	0.410	1247.64	5.487	93.32	5.897	173.65
256	256	4	2048	0.841	1218.23	8.707	117.60	9.548	214.50
256	256	8	4096	1.751	1169.72	16.797	121.93	18.548	220.83
256	256	16	8192	3.770	1086.45	21.527	190.27	25.297	323.83
256	256	32	16384	8.792	931.76	23.416	349.85	32.208	508.70
512	128	1	640	0.412	1242.45	2.031	63.01	2.444	261.92
512	128	2	1280	0.841	1217.54	2.854	89.70	3.695	346.41
512	128	4	2560	1.748	1171.42	4.572	111.98	6.321	405.02
512	128	8	5120	3.769	1086.82	9.092	112.62	12.861	398.10
512	128	16	10240	8.791	931.86	12.679	161.53	21.470	476.95
512	256	1	768	0.412	1241.44	4.101	62.42	4.513	170.16
512	256	2	1536	0.844	1213.23	5.777	88.63	6.621	232.00
512	256	4	3072	1.760	1163.84	9.281	110.34	11.040	278.25
512	256	8	6144	3.779	1083.89	18.681	109.63	22.460	273.56
512	256	16	12288	8.794	931.53	26.663	153.62	35.457	346.56
1024	128	1	1152	0.845	1212.53	2.176	58.82	3.021	381.38
1024	128	2	2304	1.756	1166.32	3.179	80.54	4.935	466.91
1024	128	4	4608	3.780	1083.64	5.181	98.82	8.961	514.24
1024	128	8	9216	8.794	931.60	10.867	94.23	19.661	468.75
1024	256	1	1280	0.845	1212.34	4.381	58.43	5.226	244.95
1024	256	2	2560	1.757	1165.53	6.445	79.45	8.202	312.13
1024	256	4	5120	3.777	1084.35	10.531	97.24	14.308	357.83
1024	256	8	10240	8.795	931.44	22.240	92.09	31.035	329.95
2048	128	1	2176	1.750	1170.19	2.471	51.80	4.221	515.51
2048	128	2	4352	3.769	1086.73	3.836	66.74	7.605	572.27
2048	128	4	8704	8.793	931.66	6.398	80.03	15.191	572.99
2048	256	1	2304	1.757	1165.65	4.998	51.22	6.755	341.10
2048	256	2	4608	3.780	1083.61	7.762	65.96	11.542	399.24
2048	256	4	9216	8.783	932.67	12.961	79.01	21.744	423.83
3584	128	1	3712	3.246	1104.05	2.919	43.85	6.165	602.11
3584	128	2	7424	7.402	968.35	4.820	53.11	12.222	607.41
3584	128	4	14848	19.331	741.60	8.382	61.08	27.714	535.76
3584	256	1	3840	3.248	1103.34	5.871	43.60	9.120	421.07
3584	256	2	7680	7.400	968.68	9.752	52.50	17.152	447.77
3584	256	4	15360	19.322	741.94	16.915	60.54	36.237	423.87
7680	128	1	7808	8.080	950.54	4.073	31.43	12.152	642.51
7680	128	2	15616	21.488	714.81	7.513	34.08	29.001	538.46
7680	256	1	7936	8.092	949.07	8.208	31.19	16.300	486.86
7680	256	2	15872	21.496	714.57	15.091	33.93	36.587	433.82

# LLaMA 7B F16, V100, no prompt sharing
./bin/batched-bench /mnt/llama.cpp/models/open-llama/7B-v2/ggml-model-f16.gguf 4800 0 100 0 50 100 1,2,3,4,5,6,7,8,16,32,64

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
50	100	1	150	0.078	638.86	1.916	52.18	1.995	75.20
50	100	2	300	0.084	1193.52	4.121	48.53	4.205	71.34
50	100	3	450	0.094	1602.63	4.345	69.04	4.439	101.37
50	100	4	600	0.109	1842.13	4.426	90.38	4.534	132.33
50	100	5	750	0.115	2181.50	4.753	105.20	4.868	154.08
50	100	6	900	0.141	2122.00	4.983	120.42	5.124	175.64
50	100	7	1050	0.152	2301.19	5.217	134.18	5.369	195.57
50	100	8	1200	0.178	2250.71	5.340	149.81	5.518	217.48
50	100	16	2400	0.355	2252.53	6.471	247.25	6.826	351.58
50	100	32	4800	0.763	2097.53	9.137	350.22	9.900	484.85

ggerganov · 2023-10-11T17:07:45Z

Played a bit with the MMQ kernel parameters on V100 (ref #3479)

./bin/batched-bench /mnt/llama.cpp/models/open-llama/7B-v2/ggml-model-q4_0.gguf 4800 0 100 1 50 100 1,2,3,4,5,6,7,8,16,32,64

The default settings give the following performance:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
50	100	1	150	0.128	389.59	0.894	111.91	1.022	146.78
50	100	2	300	0.139	719.99	10.978	18.22	11.117	26.99
50	100	3	450	0.184	813.15	11.233	26.71	11.417	39.41
50	100	4	600	0.198	1011.63	11.358	35.22	11.556	51.92
50	100	5	750	0.206	1215.62	11.675	42.83	11.881	63.13
50	100	6	900	0.219	1370.03	11.853	50.62	12.072	74.55
50	100	7	1050	0.289	1209.67	12.051	58.09	12.340	85.09
50	100	8	1200	0.320	1249.17	12.222	65.45	12.543	95.67
50	100	16	2400	0.647	1236.23	13.319	120.13	13.966	171.85
50	100	32	4800	1.433	1116.65	15.521	206.18	16.954	283.13

Applying the following patch results in > x3 faster TG speed, but slower PP speed for Q4_0 at low-batches:

diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index 654d363..32eee8b 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -3552,9 +3552,9 @@ static __device__ __forceinline__ void mul_mat_q(
 #define  MMQ_X_Q4_0_RDNA1  64
 #define  MMQ_Y_Q4_0_RDNA1  64
 #define NWARPS_Q4_0_RDNA1  8
-#define  MMQ_X_Q4_0_AMPERE 64
-#define  MMQ_Y_Q4_0_AMPERE 128
-#define NWARPS_Q4_0_AMPERE 4
+#define  MMQ_X_Q4_0_AMPERE 8
+#define  MMQ_Y_Q4_0_AMPERE 32
+#define NWARPS_Q4_0_AMPERE 8
 #define  MMQ_X_Q4_0_PASCAL 64
 #define  MMQ_Y_Q4_0_PASCAL 64
 #define NWARPS_Q4_0_PASCAL 8

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
50	100	1	150	0.103	485.62	0.883	113.28	0.986	152.17
50	100	2	300	0.158	632.24	3.439	58.16	3.597	83.40
50	100	3	450	0.218	688.58	3.666	81.83	3.884	115.85
50	100	4	600	0.270	741.20	3.778	105.86	4.048	148.21
50	100	5	750	0.332	751.93	4.099	121.98	4.432	169.24
50	100	6	900	0.394	761.96	4.255	141.00	4.649	193.58
50	100	7	1050	0.456	767.10	4.491	155.86	4.947	212.23
50	100	8	1200	0.517	773.18	4.662	171.60	5.179	231.69
50	100	16	2400	1.044	766.05	6.621	241.65	7.666	313.09
50	100	32	4800	2.146	745.71	10.419	307.15	12.564	382.04

It looks like just by tuning the MMQ constants, we can achieve significant boosts for batched decoding with quantized models, using the existing kernels. Ideally, this should work optimally without modifications, but it is not obvious how to achieve this in generic way. Similar observations are valid for Mac (#3524)

cc @slaren @JohannesGaessler to bring your attention

JohannesGaessler · 2023-10-11T17:23:45Z

For the mmq defines the "x" dimension is the dimension that varies with batch size. So the optimal way to tune the kernels would be to determine the optimal values for the y tile size and the number of warps for each x tile size. In practice it would be sufficient to determine only the optimal values for the x tile size being a power of 2. Then at runtime you could just select the kernel with the optimal parameters for a given batch size. The downside of this (apart from the effort needed to determine the optimal values, possibly multiple times) is that the compile time and binary size will increase with each additional batch size considered. So it may make sense to add a compile option that only compiles a single tile size to speed up the compilation for development.

ggerganov · 2023-10-11T17:43:24Z

Yup, it's quite tricky. The optimal values likely depend not only on the batch size but also on the other dimension. So for different model sizes, we would need different sets of optimal parameters. At least this is what my investigation for the Metal kernels showed.

The way I think about this is that we have 2 sets of kernels:

first is optimized for memory-bandwidth bound single-batch multiplications (matrix-vector)
second is optimized for compute bound large-batch multiplications (matrix-matrix)

Increasing the batch size, we are transitioning from the first kernel being optimal to the second kernel being optimal. On top of this, the second kernel needs adjustments depending on the compute size.

In the Metal implementation, we have a more general matrix-vector kernel that can be applied for batches > 1.
I can always tune manually the transition from the first kernel to the second for a given model as a function of the batch size so that the performance scales with the batches without having the performance dip at n_batch == 2 as we do for CUDA (i.e. at some "break-even" batch size, the 2 kernels perform equally well). Probably if we extended the CUDA matrix-vector in a similar way, we could do the same thing.

However, I wish we had a way to not have to do any kind of manual adjustments and always get the best performance for any batch size and model size. But it is not clear to me atm how to do this

JohannesGaessler · 2023-10-11T17:58:38Z

The optimal values likely depend not only on the batch size but also on the other dimension. So for different model sizes, we would need different sets of optimal parameters. At least this is what my investigation for the Metal kernels showed.

I don't know what you did for the metal kernels but for CUDA the optimal values for the y tile size and the number of warps should not depend on model size. As long as the weight matrix has a number of rows divisible by the y tile size the only difference should be the size of the CUDA grid, i.e. how many tiles need to be worked on. At most you should get a small effect if the grid size is too small and leaves some of the streaming multiprocessors idle but I very much do not expect this to make much of a difference if at all.

However, I wish we had a way to not have to do any kind of manual adjustments and always get the best performance for any batch size and model size. But it is not clear to me atm how to do this

The CUDA tile sizes need to be known at compile time for good performance so it is fundamentally impossible to somehow readjust the kernels at runtime. At most I think we could write a script that repeatedly compiles llama.cpp with different tile sizes and benchmarks the performance.

ggerganov · 2023-10-11T18:23:20Z

What I meant is that break-even point where the 2 kernels become equally performant depends on the model size.

To illustrate, try to put any kind of tile sizes to the matrix-matrix kernel and it's TG speed at batch == 2 will always be slower than the TG speed of the single-batch matrix-vector kernel at batch == 1. I.e. below a certain batch size, the mv kernel will dominate over the mm kernel. And this break-even batch size I think depends on the model size.

Anyway, still thinking about this and sharing a few thoughts - I'm not 100% about those conclusions yet. Might try the script idea at some point and provide a way to pass configuration to the backends to use a certain set of kernels based on the config.

…example * 'master' of github.com:ggerganov/llama.cpp: (34 commits) examples: support LLaVA v1.5 (multimodal model) (ggerganov#3436) docs : fix typo GOMP_CPU_AFFINITY (ggerganov#3597) cmake : fix add_compile_options on macOS typo : it is `--n-gpu-layers` not `--gpu-layers` (ggerganov#3592) ci : check if there is enough VRAM (ggerganov#3596) server : add completion mode (no chat) (ggerganov#3582) prompts : add mnemonics.txt server : fix kv cache management (ggerganov#3588) main : fix session loading bug (ggerganov#3400) server : add parameter -tb N, --threads-batch N (ggerganov#3584) common : fix mirostat state when using multiple sequences (ggerganov#3543) batched : add bench tool (ggerganov#3545) examples : add batched.swift + improve CI for swift (ggerganov#3562) Add MPT model to supported models in README.md (ggerganov#3574) Minor improvements in GPT2 tokenizer (ggerganov#3567) readme : add bloom (ggerganov#3570) llm : add bloom models (ggerganov#3553) swift : improvements and fixes (ggerganov#3564) llm : add MPT support (ggerganov#3417) infill. : fix tokenization (ggerganov#3508) ...

ggerganov added 2 commits October 8, 2023 15:57

batched : add bench tool

bf06d65

batched : minor fix table

7438728

ggerganov added the need feedback Testing and feedback with results are needed label Oct 9, 2023

slaren mentioned this pull request Oct 11, 2023

llama : add support for batched inference #2813

Closed

batched-bench : add readme + n_kv_max is now configurable

026bb1b

ggerganov force-pushed the batched-bench branch from 5ecdb10 to 026bb1b Compare October 11, 2023 16:10

ggerganov added 4 commits October 11, 2023 19:18

Merge branch 'master' into batched-bench

76e17f8

batched-bench : init warm-up batch

c062ffd

batched-bench : pass custom set of PP, TG and PL

daeb834

batched-bench : add mmq CLI arg

2fcdf86

ggerganov merged commit 8c70a5f into master Oct 11, 2023
34 of 39 checks passed

ggerganov mentioned this pull request Oct 23, 2023

server : parallel decoding and multimodal (cont) #3677

Merged

lxrite mentioned this pull request Oct 25, 2023

Suspiciously low performance in batched inference compared to single token #3771

Closed

4 tasks

ggerganov mentioned this pull request Oct 25, 2023

cuda : improve text-generation and batched decoding performance #3776

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batched : add bench tool #3545

batched : add bench tool #3545

ggerganov commented Oct 8, 2023 •

edited

Loading

ggerganov commented Oct 11, 2023 •

edited

Loading

JohannesGaessler commented Oct 11, 2023

ggerganov commented Oct 11, 2023 •

edited

Loading

JohannesGaessler commented Oct 11, 2023

ggerganov commented Oct 11, 2023 •

edited

Loading

batched : add bench tool #3545

batched : add bench tool #3545

Conversation

ggerganov commented Oct 8, 2023 • edited Loading

ggerganov commented Oct 11, 2023 • edited Loading

JohannesGaessler commented Oct 11, 2023

ggerganov commented Oct 11, 2023 • edited Loading

JohannesGaessler commented Oct 11, 2023

ggerganov commented Oct 11, 2023 • edited Loading

ggerganov commented Oct 8, 2023 •

edited

Loading

ggerganov commented Oct 11, 2023 •

edited

Loading

ggerganov commented Oct 11, 2023 •

edited

Loading

ggerganov commented Oct 11, 2023 •

edited

Loading