Benchmark results #89

ggerganov · 2022-10-25T17:05:10Z

Encoder

Collection of bench results for various platforms and devices.
If you want to submit info about your device, simply run the bench tool or the extra/bench-all.sh and report the results in the comments below.

Suggestions for better summary of the results are welcome

CPU	OS	Config	Model	Th	Load	Enc.	Commit
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	tiny	8	71	102	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	base	8	96	220	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	8	233	685	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	medium	8	603	1928	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	large	8	1158	3350	`206fc93`
---
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	1	251	2605	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	4	255	884	`206fc93`
---
Mac Mini M1	MacOS	NEON BLAS	tiny	4	62	194	`fcf515d`
Mac Mini M1	MacOS	NEON BLAS	base	4	81	380	`fcf515d`
Mac Mini M1	MacOS	NEON BLAS	small	4	204	1249	`fcf515d`
Mac Mini M1	MacOS	NEON BLAS	medium	4	876	3980	`fcf515d`
Mac Mini M1	MacOS	NEON BLAS	large	4	1876	7979	`fcf515d`
---
Ryzen 9 3900X	Ubuntu 20.04	AVX2	tiny	8	107	422	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2	base	8	137	880	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2	small	8	280	2874	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2	medium	8	692	9610	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2	large	8	1317	16917	`fcf515d`
---
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	tiny	4	120	780	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	base	4	151	1173	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	small	4	289	3062	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	medium	4	711	9175	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	large	4	1282	16050	`fcf515d`
---
Ryzen 9 5950X	Ubuntu 22.04	AVX2	tiny	8	135	197	`fcf515d`
Ryzen 9 5950X	Ubuntu 22.04	AVX2	base	8	176	421	`fcf515d`
Ryzen 9 5950X	Ubuntu 22.04	AVX2	small	8	357	1393	`fcf515d`
Ryzen 9 5950X	Ubuntu 22.04	AVX2	medium	8	855	4404	`fcf515d`
Ryzen 9 5950X	Ubuntu 22.04	AVX2	large	8	1576	8118	`fcf515d`
---
Raspberry Pi 4		NEON	tiny	4	1436	13839	`fcf515d`
Raspberry Pi 4		NEON	base	4	1894	30552	`fcf515d`
---
iPhone 13 Mini	iOS 16.0	NEON BLAS	base	4	97	1091	`fcf515d`
---
MacBook M1 Pro	Vivaldi	WASM	tiny	8	133	3785	`fcf515d`
MacBook M1 Pro	Vivaldi	WASM	base	8	172	8253	`fcf515d`
---
MacBook M1 Pro	Chrome	WASM	tiny	8	134	3776	`fcf515d`
MacBook M1 Pro	Chrome	WASM	base	8	168	8200	`fcf515d`
---
MacBook M1 Pro	Firefox	WASM	tiny	8	137	2626	`fcf515d`
MacBook M1 Pro	Firefox	WASM	base	8	183	6226	`fcf515d`

memcpy

MacBook M1 Pro

./bench -w 1 -t 1
memcpy: 37.59 GB/s

Ryzen 9 5950X

./bench -w 1 -t 1
memcpy: 16.74 GB/s

ggml_mul_mat

MacBook M1 Pro

./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16    330.6 GFLOPS (128 runs) / F32    466.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16    737.5 GFLOPS (128 runs) / F32    838.9 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    938.6 GFLOPS (128 runs) / F32   1062.3 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16   1312.5 GFLOPS (128 runs) / F32   1835.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1765.1 GFLOPS (128 runs) / F32   2041.4 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1784.3 GFLOPS (104 runs) / F32   1859.2 GFLOPS (109 runs)
ggml_mul_mat:  4096 x  4096: F16   1855.1 GFLOPS ( 14 runs) / F32   1873.3 GFLOPS ( 14 runs)

Ryzen 9 5950X

WHISPER_OPENBLAS=1 make -j bench && ./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     56.3 GFLOPS (128 runs) / F32     70.2 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     47.8 GFLOPS (128 runs) / F32     67.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    185.1 GFLOPS (128 runs) / F32    332.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    386.4 GFLOPS (128 runs) / F32    658.6 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    636.2 GFLOPS (128 runs) / F32   1012.0 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    950.9 GFLOPS ( 56 runs) / F32   1296.8 GFLOPS ( 76 runs)
ggml_mul_mat:  4096 x  4096: F16   1168.6 GFLOPS (  9 runs) / F32   1403.1 GFLOPS ( 11 runs)

The text was updated successfully, but these errors were encountered:

cdosoftei · 2022-10-25T18:50:27Z

Results for Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz

CPU	OS	Model	Threads	Load [ms]	Encode [ms]
i7-4790K	Debian	tiny.en	4	165	808
i7-4790K	Debian	tiny.en	8	165	783
i7-4790K	Debian	base.en	4	212	1813
i7-4790K	Debian	base.en	8	214	1746

rjwilmsi · 2022-10-26T12:46:21Z

Results for Ryzen 5 4500U 6C/6T laptop CPU (I've just included one result for 8 threads as Encode time is much higher when threads > CPU cores).

CPU	OS	Model	Threads	Load [ms]	Encode [ms]
Ryzen 5 4500U (6C/6T)	Opensuse Leap	tiny.en	4	170.00	829.43
Ryzen 5 4500U (6C/6T)	Opensuse Leap	tiny.en	6	143.03	671.74
Ryzen 5 4500U (6C/6T)	Opensuse Leap	base.en	4	305.92	2,092.39
Ryzen 5 4500U (6C/6T)	Opensuse Leap	base.en	6	188.05	1,495.61
Ryzen 5 4500U (6C/6T)	Opensuse Leap	small.en	4	408.03	6,919.31
Ryzen 5 4500U (6C/6T)	Opensuse Leap	small.en	6	359.23	6,370.83
Ryzen 5 4500U (6C/6T)	Opensuse Leap	medium.en	4	2,238.11	25,863.28
Ryzen 5 4500U (6C/6T)	Opensuse Leap	medium.en	6	1,113.04	19,672.63
Ryzen 5 4500U (6C/6T)	Opensuse Leap	medium.en	8	973.65	39,619.20

ArtyomZemlyak · 2022-10-26T15:25:15Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11800H	WSL2 Ubuntu	AVX2	tiny	2	164.35	1087.61
i7-11800H	WSL2 Ubuntu	AVX2	tiny	4	128.94	733.24
i7-11800H	WSL2 Ubuntu	AVX2	tiny	8	137.57	619.88
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	tiny	2	143.02	1087.15
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	tiny	4	127.60	730.57
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	tiny	8	125.62	616.27
i7-11800H	WSL2 Ubuntu	AVX2 AVX512 BLAS	tiny	2	132.59	1511.38
i7-11800H	WSL2 Ubuntu	AVX2 AVX512 BLAS	tiny	4	132.48	1407.49
i7-11800H	WSL2 Ubuntu	AVX2 AVX512 BLAS	tiny	8	133.82	1458.27

ArtyomZemlyak · 2022-10-26T15:35:06Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11800H	WSL2 Ubuntu	AVX2	base	2	174.34	2533.79
i7-11800H	WSL2 Ubuntu	AVX2	base	4	166.68	1830.67
i7-11800H	WSL2 Ubuntu	AVX2	base	8	165.53	1478.73
i7-11800H	WSL2 Ubuntu	AVX2	small	2	340.12	8714.24
i7-11800H	WSL2 Ubuntu	AVX2	small	4	394.32	6021.41
i7-11800H	WSL2 Ubuntu	AVX2	small	8	305.98	4828.84
i7-11800H	WSL2 Ubuntu	AVX2	large	2	3205.36	57109.10
i7-11800H	WSL2 Ubuntu	AVX2	large	4	2720.25	38519.89
i7-11800H	WSL2 Ubuntu	AVX2	large	8	3716.34	27739.99

ArtyomZemlyak · 2022-10-26T15:41:21Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	large	2	1954.21	54966.84
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	large	4	1455.40	37320.62
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	large	8	1372.58	27937.64

ArtyomZemlyak · 2022-10-26T15:44:27Z

This performance is impressing!

M1 Pro | MacOS | | large | 8 | 1973 | 4208

ggerganov · 2022-10-26T19:32:12Z

This performance is impressing!

Yes, there is a huge performance boost due to using the built-in BLAS implementation on these devices. I will soon add OpenBLAS support for x86 architectures and see how this compares.

By the way, AVX-512 is not supported on master. I have added initial support here, but I am not sure if it works: #95

cristianglezm · 2022-10-28T20:45:56Z

CPU	OS	Config	Model	Threads	Load[ms]	encode[ms]
Intel® Core™ i5-8250U	Win11 Home	AVX2	Large	8	2226.85	61547.61

compiled with MinGW64 gcc 11.3

tazz4843 · 2022-10-29T00:06:50Z

Valve Jupiter (AMD Custom APU 0405, Zen 2 microarch, 4c8t, 16GB DDR5 @ 5200 MT/s)

CPU	OS	Config	Model	Threads	Load[ms]	encode[ms]
AMD Custom APU 0405	SteamOS 3.2	AVX2	Base	8	326.32	2592.96

Compiled with cc (GCC) 11.3.0

The performance gains on jfk.wav since last test (two weeks or so ago) are extremely impressive, ~10-20x speedup from 40 to 2-4 seconds.

yujinqiu · 2022-10-30T00:14:59Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
MacBook M1 Max	macOS Ventura	BLAS	small	1	299.09	4166.00
MacBook M1 Max	macOS Ventura	BLAS	small	4	329.45	1304.32
MacBook M1 Max	macOS Ventura	BLAS	base	1	139.10	1302.17
MacBook M1 Max	macOS Ventura	BLAS	base	4	135.96	399.45

ggerganov · 2022-10-31T17:45:02Z

@trholding
Thanks for the results.

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Regarding the threads - yes, it seems that going beyond 8 threads does not help regardless of how many cores you have. My guess is that the computation is memory-bound so that's why using more threads does not improve the performance.

trholding · 2022-10-31T22:55:48Z

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Hey Sorry. That didn't pan out well, I did the benchmark thrice, my account got deleted without notice. Could't get the logs as it was a web terminal. On the other hand I am happy that this happened, I was giving serious thought of purchasing a GPU+CPU plan there, so performance check of CPU was equally important. Probably or technically it was my fault - probably shouldn't have used a reverse shell and done benchmarks on a free trial, but how does one know if a service is real good or all just vapor...

rgerganov · 2022-11-05T06:43:35Z

Dell Precision 5560 laptop results:

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11850H	Ubuntu	AVX2	tiny	4	115.87	538.43
i7-11850H	Ubuntu	AVX2	base	4	145.14	1241.84
i7-11850H	Ubuntu	AVX2	small	4	299.30	4343.57
i7-11850H	Ubuntu	AVX2	medium	4	760.98	15238.31
i7-11850H	Ubuntu	AVX2	large	4	1404.32	27476.86
i7-11850H	Ubuntu	AVX2	tiny	8	131.96	358.81
i7-11850H	Ubuntu	AVX2	base	8	166.61	839.31
i7-11850H	Ubuntu	AVX2	small	8	320.29	2854.86
i7-11850H	Ubuntu	AVX2	medium	8	756.20	9829.62
i7-11850H	Ubuntu	AVX2	large	8	1382.38	19872.81

jaybinks · 2022-11-05T10:34:15Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i9-9900K	WSL2 Ubuntu (GCC)	AVX2	tiny.en	4	85.71	601.56
i9-9900K	WSL2 Ubuntu (GCC)	AVX2	small.en	4	212.59	5146.23
i9-9900K	OSX 10.14.1 (hackintosh - GCC)	AVX2	tiny.en	4	198.17	455.12
i9-9900K	OSX 10.14.1 (hackintosh - GCC)	AVX2	base.en	4	272.62	909.71
i9-9900K	OSX 10.14.1 (hackintosh - GCC)	AVX2	small.en	4	598.75	2968.75
Xeon(R) Silver 4210R CPU @ 2.40GHz	Virtual Machine - Debian Stretch (GCC - master branch)	AVX2 avx512f avx512dq avx512cd avx512bw avx512vl	small.en	4	776.56	12340.41
Xeon(R) Silver 4210R CPU @ 2.40GHz	Virtual Machine - Debian Stretch (GCC - master branch)	AVX2 avx512f avx512dq avx512cd avx512bw avx512vl	tiny.en	4	295.54	1710.46

mark-beeby · 2022-11-08T09:18:49Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Tiny	4	124.28	656.41
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Tiny	8	123.70	696.41
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Base	4	159.91	1754.44
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Base	8	164.47	1658.55
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Small	4	330.91	6161.86
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Small	8	346.22	5187.85

niksedk · 2022-11-09T19:57:02Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-1065G7	Windows 11	-	small.en	4	1,314.25	294,168.09

Compiled with VS 2022

Something is off, right?

ggerganov · 2022-11-09T20:13:11Z

Yup - you are missing the AVX2 flag. See if some of the comments in #5 can help you resolve this.

niksedk · 2022-11-09T20:33:55Z

OK, the AVX2 flag seems to help :)

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-1065G7	Windows 11	AVX2	small.en	4	527.59	9,648.67

Compiled with VS 2022

j1nx · 2022-11-17T11:02:17Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]	Remarks
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	tiny	1	861.34	29428.21	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny	1	843.80	16145.62	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	tiny	4	835.68	21509.08	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny	4	824.24	13187.96	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	base	1	1146.02	87615.00	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base	1	1103.39	52228.30	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	base	4	1183.47	55256.20	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base	4	1161.32	29851.40	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	tiny	1	752.64	24018.10	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny	1	751.96	13082.95	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	tiny	4	743.37	10122.80	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny	4	742.90	9564.89	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	base	1	974.46	71587.61	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base	1	979.65	43852.07	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	base	4	982.24	24814.62	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base	4	982.80	19910.19	Without OVOS services running

StuartIanNaylor · 2022-11-17T11:37:39Z

From the stream repo

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny.en	4	243.54 ms	779.49 ms
RK3588	Ubuntu20.04	NEON	base.en	4	316.52 ms	1821.06 ms
RK3588	Ubuntu20.04	NEON	small.en	4	618.93 ms	7117.69 ms
RK3588	Ubuntu20.04	NEON	medium.en	4	1514.88 ms	24139.92 ms

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny	4	233.86 ms	791.01 ms
RK3588	Ubuntu20.04	NEON	base	4	297.93 ms	1813.69 ms
RK3588	Ubuntu20.04	NEON	small	4	592.18 ms	7102.28 ms
RK3588	Ubuntu20.04	NEON	medium	4	1587.36 ms	24147.87 ms

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny	8	226.48 ms	740.34 ms
RK3588	Ubuntu20.04	NEON	base	8	300.48 ms	1723.42 ms
RK3588	Ubuntu20.04	NEON	small	8	620.58 ms	6392.47 ms
RK3588	Ubuntu20.04	NEON	medium	8	1533.75 ms	21899.08 ms

I still haven't worked out the little(0-3).Big(4-7) on this thing as if I pin to big cores taskset -c 4-7

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny.en	4	234.14 ms	681.53 ms
RK3588	Ubuntu20.04	NEON	base.en	4	297.08 ms	1679.75 ms
RK3588	Ubuntu20.04	NEON	small.en	4	599.98 ms	6867.66 ms
RK3588	Ubuntu20.04	NEON	medium.en	4	1492.73 ms	23600.45 ms

I tried to compile with openBlas but seemed to kill the make

From the master repo as didn't think about the repo after trying streaming input

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny	8	226.48 ms	2681.05 ms
RK3588	Ubuntu20.04	NEON	base	8	283.56 ms	6132.44 ms
RK3588	Ubuntu20.04	NEON	small	8	583.39 ms	24397.78 ms
RK3588	Ubuntu20.04	NEON	medium	8	1490.98	85099.45 ms

dodysw · 2022-11-17T12:06:04Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	tiny.en	8	136.29	454.52
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	tiny	8	134.64	486.01
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	base	8	180.22	1184.80
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	base.en	8	192.86	1197.85
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	small	8	367.55	4179.00
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	small.en	8	378.27	4557.73
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	medium	8	923.48	15552.61
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	medium.en	8	952.48	15708.63
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	large	8	1650.28	28357.09

8 threads seemed to be the fastest. However I managed to squeeze a bit more performance by pinning CPU:

$ taskset -c 0-15 ./extra/bench-all.sh 16

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	tiny	16	143.17	437.73
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	base	16	184.10	1061.14
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	small	16	374.41	3645.64
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	medium	16	935.45	13029.54

matth · 2022-11-21T16:20:26Z

Results for AWS Graviton 3 Processor (c7g.4xlarge instance type).

Compiled with -march=native -ffast-math.

./extra/bench-all.sh 8

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Graviton 3	Ubuntu 22.04	NEON	tiny	8	125.92	230.33
Graviton 3	Ubuntu 22.04	NEON	base	8	160.17	547.88
Graviton 3	Ubuntu 22.04	NEON	small	8	299.59	2138.86
Graviton 3	Ubuntu 22.04	NEON	medium	8	741.49	6999.33
Graviton 3	Ubuntu 22.04	NEON	large	8	1313.95	14174.00

./extra/bench-all.sh 16

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Graviton 3	Ubuntu 22.04	NEON	tiny	16	121.92	158.61
Graviton 3	Ubuntu 22.04	NEON	base	16	156.01	386.78
Graviton 3	Ubuntu 22.04	NEON	small	16	299.85	1596.38
Graviton 3	Ubuntu 22.04	NEON	medium	16	750.93	5351.24
Graviton 3	Ubuntu 22.04	NEON	large	16	1313.82	11115.69

ggerganov · 2022-11-21T16:25:52Z

@matth Do you observe significant performance difference with / without -march=native -ffast-math?

matth · 2022-11-21T21:16:42Z

@ggerganov -ffast-math seems to make only a very small difference that could be noise between runs

-march=native does seem to make a big difference, without it FP16_VA is not reported as being enabled (I can get this with -march=armv8.4-a+bf16+fp16fml) - I think -march=native is enabling more intrinsics than this though.

Results without any -march or -ffast-math flags ...

./extra/bench-all.sh 16

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Graviton 3	Ubuntu 22.04	NEON	tiny	16	124.25	320.53
Graviton 3	Ubuntu 22.04	NEON	base	16	156.91	734.22
Graviton 3	Ubuntu 22.04	NEON	small	16	301.78	2812.75
Graviton 3	Ubuntu 22.04	NEON	medium	16	714.23	9139.86
Graviton 3	Ubuntu 22.04	NEON	large	16	1298.33	18147.47

I have tried to improve by using OpenBlas and armpl.h but with they both slow it down considerably - I'll keep trying with the latter.

Are there any possibilities for further optimisations in ggml.c that can take advantage of the situation where you have bf16 functions but not BLAS or Accelerate?

nickovs · 2023-11-03T03:22:08Z

Results for the new Raspberry Pi 5. Tests performed on a board with the active cooler. uname -a output is:

Linux newpi 6.1.0-rpi4-rpi-2712 #1 SMP PREEMPT Debian 1:6.1.54-1+rpt2 (2023-10-05) aarch64 GNU/Linux

CPU	OS	Config	Model	Threads	Encode	Decode	Commit
BCM2712	Bookworm 12.2	NEON	4	tiny	1106.11	183.67	`54c978c`
BCM2712	Bookworm 12.2	NEON	4	tiny.en	1109.66	201.3	`54c978c`
BCM2712	Bookworm 12.2	NEON	4	base	2479.82	346.65	`54c978c`
BCM2712	Bookworm 12.2	NEON	4	base.en	2465.12	363.86	`54c978c`
BCM2712	Bookworm 12.2	NEON	4	small	8308.3	963.24	`54c978c`
BCM2712	Bookworm 12.2	NEON	4	small.en	8342.25	1119.25	`54c978c`
BCM2712	Bookworm 12.2	NEON	4	medium.en	26407.77	2893.55	`54c978c`
BCM2712	Bookworm 12.2	NEON	4	medium	26468.86	2919.43	`54c978c`

These results are 4.5 to 6.2 times faster than the Raspberry Pi 4.

NOTE: The packaged version of OpenBLAS has not been recompiled for the new CPU architecture, so it is about 50% slower than whisper.cpp's native NEON implementation. I will post benchmarks using OpenBLAS once I have built a version for the new CPU.

The memcpy and ggml_mul_mat benchmarks show:

memcpy: 4.64 GB/s (1 thread)
sum:    136902081526.000000

  64 x   64: Q4_0     5.5 GFLOPS (128 runs) | Q4_1     5.1 GFLOPS (128 runs)
  64 x   64: Q5_0     4.7 GFLOPS (128 runs) | Q5_1     4.9 GFLOPS (128 runs) | Q8_0     5.0 GFLOPS (128 runs)
  64 x   64: F16      5.0 GFLOPS (128 runs) | F32      4.9 GFLOPS (128 runs)
 128 x  128: Q4_0    22.9 GFLOPS (128 runs) | Q4_1    22.6 GFLOPS (128 runs)
 128 x  128: Q5_0    19.7 GFLOPS (128 runs) | Q5_1    20.3 GFLOPS (128 runs) | Q8_0    23.9 GFLOPS (128 runs)
 128 x  128: F16     26.3 GFLOPS (128 runs) | F32     13.3 GFLOPS (128 runs)
 256 x  256: Q4_0    39.0 GFLOPS (128 runs) | Q4_1    49.4 GFLOPS (128 runs)
 256 x  256: Q5_0    33.0 GFLOPS (128 runs) | Q5_1    37.5 GFLOPS (128 runs) | Q8_0    58.6 GFLOPS (128 runs)
 256 x  256: F16     64.1 GFLOPS (128 runs) | F32     48.4 GFLOPS (128 runs)
 512 x  512: Q4_0    62.6 GFLOPS (128 runs) | Q4_1    62.3 GFLOPS (128 runs)
 512 x  512: Q5_0    49.9 GFLOPS (128 runs) | Q5_1    46.1 GFLOPS (128 runs) | Q8_0    76.2 GFLOPS (128 runs)
 512 x  512: F16     80.1 GFLOPS (128 runs) | F32     51.1 GFLOPS (128 runs)
1024 x 1024: Q4_0    67.9 GFLOPS ( 32 runs) | Q4_1    67.6 GFLOPS ( 32 runs)
1024 x 1024: Q5_0    53.5 GFLOPS ( 25 runs) | Q5_1    50.4 GFLOPS ( 24 runs) | Q8_0    85.4 GFLOPS ( 40 runs)
1024 x 1024: F16     92.9 GFLOPS ( 44 runs) | F32     48.0 GFLOPS ( 23 runs)
2048 x 2048: Q4_0    71.0 GFLOPS (  5 runs) | Q4_1    72.2 GFLOPS (  5 runs)
2048 x 2048: Q5_0    55.7 GFLOPS (  4 runs) | Q5_1    52.3 GFLOPS (  4 runs) | Q8_0    87.6 GFLOPS (  6 runs)
2048 x 2048: F16     93.1 GFLOPS (  6 runs) | F32     43.9 GFLOPS (  3 runs)
4096 x 4096: Q4_0    72.2 GFLOPS (  3 runs) | Q4_1    73.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0    55.9 GFLOPS (  3 runs) | Q5_1    52.7 GFLOPS (  3 runs) | Q8_0    86.9 GFLOPS (  3 runs)
4096 x 4096: F16     86.8 GFLOPS (  3 runs) | F32     38.4 GFLOPS (  3 runs)

marjisound · 2023-11-03T09:10:44Z

CPU details: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
GPU name: NVIDIA Tesla T4
OS: Linux 14 22.04.1-Ubuntu
Compiler: cc (Ubuntu 11.4.0-1ubuntu1 22.04) 11.4.0

WHISPER_CUBLAS=1 make -j bench && ./extra/bench-all.sh

I whisper.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -mavx2 -mfma -mf16c -mavx -msse3 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:      g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

make: 'bench' is up to date.
Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 5.05 GB/s
sum:    -536870997.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:   64 x   64: Q4_0     3.8 GFLOPS (128 runs) / Q4_1     3.8 GFLOPS (128 runs) / F16     3.8 GFLOPS (128 runs) / F32     3.9 GFLOPS (128 runs)
ggml_mul_mat:  128 x  128: Q4_0    23.6 GFLOPS (128 runs) / Q4_1    24.0 GFLOPS (128 runs) / F16    22.1 GFLOPS (128 runs) / F32    22.4 GFLOPS (128 runs)
ggml_mul_mat:  256 x  256: Q4_0    90.3 GFLOPS (128 runs) / Q4_1   100.0 GFLOPS (128 runs) / F16    92.0 GFLOPS (128 runs) / F32    92.3 GFLOPS (128 runs)
ggml_mul_mat:  512 x  512: Q4_0   278.8 GFLOPS (128 runs) / Q4_1   277.6 GFLOPS (128 runs) / F16   244.9 GFLOPS (128 runs) / F32   242.1 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: Q4_0   859.2 GFLOPS (128 runs) / Q4_1   853.6 GFLOPS (128 runs) / F16   648.3 GFLOPS (128 runs) / F32   685.4 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: Q4_0  1583.4 GFLOPS ( 93 runs) / Q4_1  1585.1 GFLOPS ( 93 runs) / F16  1383.9 GFLOPS ( 81 runs) / F32  1359.7 GFLOPS ( 80 runs)
ggml_mul_mat: 4096 x 4096: Q4_0  2525.9 GFLOPS ( 19 runs) / Q4_1  2658.6 GFLOPS ( 20 runs) / F16  2716.0 GFLOPS ( 20 runs) / F32  2302.7 GFLOPS ( 17 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Xeon(R)	Ubuntu	AVX2 BLAS	tiny	4	429	550	`fa8dbdc`
Xeon(R)	Ubuntu	AVX2 BLAS	base	4	521	1133	`fa8dbdc`
Xeon(R)	Ubuntu	AVX2 BLAS	small	4	798	3025	`fa8dbdc`
Xeon(R)	Ubuntu	AVX2 BLAS	medium	4	1701	7639	`fa8dbdc`
Xeon(R)	Ubuntu	AVX2 BLAS	large	4	2966	12927	`fa8dbdc`

StuartIanNaylor · 2023-11-03T13:42:26Z

Whats happening with commit 8a2bee6?
I was just interested with the same Master Opi5 vs Rpi5, but seem to have an extra PP that I am sure I will find a use for
Rpi 5gb
Linux raspberrypi 6.1.0-rpi4-rpi-2712 #1 SMP PREEMPT Debian 1:6.1.54-1+rpt2 (2023-10-05) aarch64 GNU/Linux

memcpy: 5.32 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     6.0 GFLOPS (128 runs) | Q4_1     5.9 GFLOPS (128 runs)
  64 x   64: Q5_0     5.3 GFLOPS (128 runs) | Q5_1     4.9 GFLOPS (128 runs) | Q8_0     1.9 GFLOPS (128 runs)
  64 x   64: F16      6.0 GFLOPS (128 runs) | F32      5.8 GFLOPS (128 runs)
 128 x  128: Q4_0    23.9 GFLOPS (128 runs) | Q4_1    22.6 GFLOPS (128 runs)
 128 x  128: Q5_0    21.4 GFLOPS (128 runs) | Q5_1    20.4 GFLOPS (128 runs) | Q8_0    11.4 GFLOPS (128 runs)
 128 x  128: F16     28.6 GFLOPS (128 runs) | F32     26.2 GFLOPS (128 runs)
 256 x  256: Q4_0    49.8 GFLOPS (128 runs) | Q4_1    49.6 GFLOPS (128 runs)
 256 x  256: Q5_0    40.9 GFLOPS (128 runs) | Q5_1    24.8 GFLOPS (128 runs) | Q8_0    59.0 GFLOPS (128 runs)
 256 x  256: F16     63.0 GFLOPS (128 runs) | F32     29.6 GFLOPS (128 runs)
 512 x  512: Q4_0    56.6 GFLOPS (128 runs) | Q4_1    56.5 GFLOPS (128 runs)
 512 x  512: Q5_0    30.4 GFLOPS (114 runs) | Q5_1    36.5 GFLOPS (128 runs) | Q8_0    71.2 GFLOPS (128 runs)
 512 x  512: F16     64.6 GFLOPS (128 runs) | F32     35.2 GFLOPS (128 runs)
1024 x 1024: Q4_0    67.4 GFLOPS ( 32 runs) | Q4_1    68.7 GFLOPS ( 32 runs)
1024 x 1024: Q5_0    38.1 GFLOPS ( 18 runs) | Q5_1    32.3 GFLOPS ( 16 runs) | Q8_0    61.3 GFLOPS ( 29 runs)
1024 x 1024: F16     71.7 GFLOPS ( 34 runs) | F32     35.1 GFLOPS ( 17 runs)
2048 x 2048: Q4_0    71.4 GFLOPS (  5 runs) | Q4_1    71.5 GFLOPS (  5 runs)
2048 x 2048: Q5_0    38.1 GFLOPS (  3 runs) | Q5_1    36.9 GFLOPS (  3 runs) | Q8_0    63.5 GFLOPS (  4 runs)
2048 x 2048: F16     68.6 GFLOPS (  4 runs) | F32     32.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    66.8 GFLOPS (  3 runs) | Q4_1    62.4 GFLOPS (  3 runs)
4096 x 4096: Q5_0    39.5 GFLOPS (  3 runs) | Q5_1    37.0 GFLOPS (  3 runs) | Q8_0    62.7 GFLOPS (  3 runs)
4096 x 4096: F16     61.5 GFLOPS (  3 runs) | F32     29.1 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| Rpi5 BCM2712 | bookworm |             NEON |        tiny |   4 | 1206.23 |    6.67 |  198.84 | 8a2bee6 |
| Rpi5 BCM2712 | bookworm |             NEON |        base |   4 | 2862.56 |   11.74 |  466.51 | 8a2bee6 |
| Rpi5 BCM2712 | bookworm |             NEON |       small |   4 | 9630.88 |   32.81 | 1650.18 | 8a2bee6 |
| Rpi5 BCM2712 | bookworm |             NEON |      medium |   4 |      ms |   99.64 | 5601.57 | 8a2bee6 |

Opi5 4gb
Linux ubuntu 6.6.0 #1 SMP PREEMPT Mon Oct 30 22:54:25 GMT 2023 aarch64 aarch64 aarch64 GNU/Linux
Mainline Linux than the Rockchip BSP https://github.com/Joshua-Riek/ubuntu-rockchip/releases/tag/v1.29.1

memcpy: 10.93 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     6.8 GFLOPS (128 runs) | Q4_1     4.1 GFLOPS (128 runs)
  64 x   64: Q5_0     5.9 GFLOPS (128 runs) | Q5_1     6.0 GFLOPS (128 runs) | Q8_0     6.6 GFLOPS (128 runs)
  64 x   64: F16      4.1 GFLOPS (128 runs) | F32      6.8 GFLOPS (128 runs)
 128 x  128: Q4_0    14.0 GFLOPS (128 runs) | Q4_1    19.1 GFLOPS (128 runs)
 128 x  128: Q5_0    15.5 GFLOPS (128 runs) | Q5_1    12.7 GFLOPS (128 runs) | Q8_0    26.6 GFLOPS (128 runs)
 128 x  128: F16     22.1 GFLOPS (128 runs) | F32     21.2 GFLOPS (128 runs)
 256 x  256: Q4_0    45.0 GFLOPS (128 runs) | Q4_1    45.0 GFLOPS (128 runs)
 256 x  256: Q5_0    29.0 GFLOPS (128 runs) | Q5_1    29.6 GFLOPS (128 runs) | Q8_0    42.8 GFLOPS (128 runs)
 256 x  256: F16     42.5 GFLOPS (128 runs) | F32     42.6 GFLOPS (128 runs)
 512 x  512: Q4_0    55.8 GFLOPS (128 runs) | Q4_1    56.0 GFLOPS (128 runs)
 512 x  512: Q5_0    35.5 GFLOPS (128 runs) | Q5_1    36.7 GFLOPS (128 runs) | Q8_0    61.9 GFLOPS (128 runs)
 512 x  512: F16     80.7 GFLOPS (128 runs) | F32     49.6 GFLOPS (128 runs)
1024 x 1024: Q4_0    60.6 GFLOPS ( 29 runs) | Q4_1    61.4 GFLOPS ( 29 runs)
1024 x 1024: Q5_0    37.6 GFLOPS ( 18 runs) | Q5_1    39.3 GFLOPS ( 19 runs) | Q8_0    68.2 GFLOPS ( 32 runs)
1024 x 1024: F16     93.1 GFLOPS ( 44 runs) | F32     46.4 GFLOPS ( 22 runs)
2048 x 2048: Q4_0    63.1 GFLOPS (  4 runs) | Q4_1    64.1 GFLOPS (  4 runs)
2048 x 2048: Q5_0    39.2 GFLOPS (  3 runs) | Q5_1    41.0 GFLOPS (  3 runs) | Q8_0    70.9 GFLOPS (  5 runs)
2048 x 2048: F16     87.9 GFLOPS (  6 runs) | F32     41.4 GFLOPS (  3 runs)
4096 x 4096: Q4_0    64.2 GFLOPS (  3 runs) | Q4_1    65.3 GFLOPS (  3 runs)
4096 x 4096: Q5_0    39.7 GFLOPS (  3 runs) | Q5_1    41.7 GFLOPS (  3 runs) | Q8_0    70.7 GFLOPS (  3 runs)
4096 x 4096: F16     80.7 GFLOPS (  3 runs) | F32     38.7 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| Opi5 Rk3588s | 22.04.3 LTS (Jammy Jellyfish) |             NEON |        tiny |   4 |  782.52 |    3.10 |  135.25 | 8a2bee6 |
| Opi5 Rk3588s | 22.04.3 LTS (Jammy Jellyfish) |             NEON |        base |   4 | 1754.69 |   11.81 |  304.06 | 8a2bee6 |
| Opi5 Rk3588s | 22.04.3 LTS (Jammy Jellyfish) |             NEON |       small |   4 | 6226.10 |   15.26 | 1075.54 | 8a2bee6 |
| Opi5 Rk3588s | 22.04.3 LTS (Jammy Jellyfish) |             NEON |      medium |   4 |      ms |   44.75 | 3425.05 | 8a2bee6 |

ggerganov · 2023-11-03T13:53:27Z

@nickovs These are some very interesting results. Looking forward to the OpenBLAS results as well.

@StuartIanNaylor The PP timing is the "prompt processing" time for a prompt of 256 tokens. As we transcribe with whisper, the context (i.e. the previously transcribed text) grows up to n_text_ctx. For each new audio segment that we process, we have to process the context. This processing is very similar to the token-by-token text generation during decoding, but it is much faster since we process 256 tokens at once.

nickovs · 2023-11-03T20:10:52Z

By way of comparison to the benchmarks I posted above, here is are the matrix multiplication numbers for the same Raspberry Pi 5 using OpenBLAS. It is notable that Whisper.cpp's native NEON code outperforms OpenBLAS on the Pi5 for everything except FP32, where OpenBLAS wins by some margin.

  64 x   64: Q4_0     4.4 GFLOPS (128 runs) | Q4_1     4.3 GFLOPS (128 runs)
  64 x   64: Q5_0     3.7 GFLOPS (128 runs) | Q5_1     4.2 GFLOPS (128 runs) | Q8_0     4.1 GFLOPS (128 runs)
  64 x   64: F16      4.0 GFLOPS (128 runs) | F32      4.1 GFLOPS (128 runs)
 128 x  128: Q4_0     0.9 GFLOPS (128 runs) | Q4_1     0.9 GFLOPS (128 runs)
 128 x  128: Q5_0     0.9 GFLOPS (128 runs) | Q5_1     0.9 GFLOPS (128 runs) | Q8_0     0.9 GFLOPS (128 runs)
 128 x  128: F16      0.9 GFLOPS (128 runs) | F32      0.9 GFLOPS (128 runs)
 256 x  256: Q4_0     6.3 GFLOPS (128 runs) | Q4_1     6.4 GFLOPS (128 runs)
 256 x  256: Q5_0     6.4 GFLOPS (128 runs) | Q5_1     6.3 GFLOPS (128 runs) | Q8_0     6.4 GFLOPS (128 runs)
 256 x  256: F16      6.4 GFLOPS (128 runs) | F32      6.5 GFLOPS (128 runs)
 512 x  512: Q4_0    19.7 GFLOPS ( 74 runs) | Q4_1    20.4 GFLOPS ( 76 runs)
 512 x  512: Q5_0    23.7 GFLOPS ( 89 runs) | Q5_1    23.5 GFLOPS ( 89 runs) | Q8_0    23.7 GFLOPS ( 89 runs)
 512 x  512: F16     24.0 GFLOPS ( 90 runs) | F32     25.3 GFLOPS ( 95 runs)
1024 x 1024: Q4_0    35.5 GFLOPS ( 17 runs) | Q4_1    36.5 GFLOPS ( 17 runs)
1024 x 1024: Q5_0    38.9 GFLOPS ( 19 runs) | Q5_1    39.1 GFLOPS ( 19 runs) | Q8_0    38.7 GFLOPS ( 19 runs)
1024 x 1024: F16     39.3 GFLOPS ( 19 runs) | F32     40.9 GFLOPS ( 20 runs)
2048 x 2048: Q4_0    52.8 GFLOPS (  4 runs) | Q4_1    55.4 GFLOPS (  4 runs)
2048 x 2048: Q5_0    56.8 GFLOPS (  4 runs) | Q5_1    55.6 GFLOPS (  4 runs) | Q8_0    56.5 GFLOPS (  4 runs)
2048 x 2048: F16     56.1 GFLOPS (  4 runs) | F32     56.4 GFLOPS (  4 runs)
4096 x 4096: Q4_0    55.3 GFLOPS (  3 runs) | Q4_1    56.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0    58.9 GFLOPS (  3 runs) | Q5_1    60.0 GFLOPS (  3 runs) | Q8_0    61.4 GFLOPS (  3 runs)
4096 x 4096: F16     59.3 GFLOPS (  3 runs) | F32     60.4 GFLOPS (  3 runs)

I have not tried all the tuning options in OpenBLAS, but the options I did try didn't really change the performance compared to the pre-packaged version.

StuartIanNaylor · 2023-11-04T02:49:41Z

I have not tried all the tuning options in OpenBLAS, but the options I did try didn't really change the performance compared to the pre-packaged version.

I think this is where we benefit from ArmV8.2 and being a subgroup of Apple Silicon first-class citizen - optimized via ARM NEON.
If you do a lscpu
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
So I guess we benefit that GGML is optimised aroung V8.2+ architecture
What should be interesting with https://github.com/ggerganov/whisper.cpp#opencl-gpu-support-via-clblast is that the GPU on the Pi5 & Rk3588(s) should be able to use OpenCL but in testing I am finding that the same and wondering if that is also similar.
I never worked out if its due to the serial nature of Whisper that you will only get a speedup if the GPU is faster than the CPU but on testing I get a huge slow down whilst in other ML tests the supposed FP32 610.6 GFLOPS of the mali G610 works mightily at approx 75% of the CPU with ArmNN tests using the GPU Tflite OpenCL delegate.
I am presuming CLBlast is somewhat similar and may not be well optimised for some data types?

These results are 4.5 to 6.2 times faster than the Raspberry Pi 4.
Not too sure about that as likely the same commit would have to be tested as seem to remember thinking RK3588s was < 5x Pi4 and likely due to memory bandwidth, quite a bit faster than a Pi5.

Linux ubuntu 6.6.0 #1 SMP PREEMPT Opi5 4GB performance governor

memcpy: 10.50 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     3.5 GFLOPS (128 runs) | Q4_1     3.2 GFLOPS (128 runs)
  64 x   64: Q5_0     2.8 GFLOPS (128 runs) | Q5_1     2.7 GFLOPS (128 runs) | Q8_0     3.5 GFLOPS (128 runs)
  64 x   64: F16      3.4 GFLOPS (128 runs) | F32      3.4 GFLOPS (128 runs)
 128 x  128: Q4_0     7.9 GFLOPS (128 runs) | Q4_1     8.1 GFLOPS (128 runs)
 128 x  128: Q5_0     6.2 GFLOPS (128 runs) | Q5_1     6.5 GFLOPS (128 runs) | Q8_0     7.9 GFLOPS (128 runs)
 128 x  128: F16      9.4 GFLOPS (128 runs) | F32      7.5 GFLOPS (128 runs)
 256 x  256: Q4_0    10.5 GFLOPS (128 runs) | Q4_1    11.1 GFLOPS (128 runs)
 256 x  256: Q5_0     7.9 GFLOPS (128 runs) | Q5_1     8.5 GFLOPS (128 runs) | Q8_0    10.3 GFLOPS (128 runs)
 256 x  256: F16     14.5 GFLOPS (128 runs) | F32      9.3 GFLOPS (128 runs)
 512 x  512: Q4_0    11.7 GFLOPS ( 44 runs) | Q4_1    12.4 GFLOPS ( 47 runs)
 512 x  512: Q5_0     8.8 GFLOPS ( 33 runs) | Q5_1     9.7 GFLOPS ( 37 runs) | Q8_0    11.4 GFLOPS ( 43 runs)
 512 x  512: F16     17.8 GFLOPS ( 67 runs) | F32      9.2 GFLOPS ( 35 runs)
1024 x 1024: Q4_0    32.2 GFLOPS ( 15 runs) | Q4_1    33.2 GFLOPS ( 16 runs)
1024 x 1024: Q5_0    24.9 GFLOPS ( 12 runs) | Q5_1    25.7 GFLOPS ( 12 runs) | Q8_0    35.2 GFLOPS ( 17 runs)
1024 x 1024: F16     38.0 GFLOPS ( 18 runs) | F32     27.5 GFLOPS ( 13 runs)
2048 x 2048: Q4_0    57.7 GFLOPS (  4 runs) | Q4_1    59.5 GFLOPS (  4 runs)
2048 x 2048: Q5_0    38.0 GFLOPS (  3 runs) | Q5_1    39.3 GFLOPS (  3 runs) | Q8_0    64.3 GFLOPS (  4 runs)
2048 x 2048: F16     77.9 GFLOPS (  5 runs) | F32     38.4 GFLOPS (  3 runs)
4096 x 4096: Q4_0    63.4 GFLOPS (  3 runs) | Q4_1    64.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    39.9 GFLOPS (  3 runs) | Q5_1    41.7 GFLOPS (  3 runs) | Q8_0    70.3 GFLOPS (  3 runs)
4096 x 4096: F16     78.6 GFLOPS (  3 runs) | F32     37.2 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| <todo> | <todo> |             NEON |        tiny |   4 |  853.56 |    7.37 |  161.81 | f96e1c5 |
| <todo> | <todo> |             NEON |        base |   4 | 1847.86 |   13.00 |  338.18 | f96e1c5 |
| <todo> | <todo> |             NEON |       small |   4 | 6289.17 |   39.19 | 1109.25 | f96e1c5 |
| <todo> | <todo> |             NEON |      medium |   4 |      ms |   67.99 | 3454.96 | f96e1c5 |
| <todo> | <todo> |             NEON |       large |   4 |      ms |  107.50 | 6541.15 | f96e1c5 |

Linux raspberrypi 6.1.0-rpi4-rpi-2712 Rpi5 4GB performance governor

memcpy: 6.03 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     5.7 GFLOPS (128 runs) | Q4_1     5.5 GFLOPS (128 runs)
  64 x   64: Q5_0     5.3 GFLOPS (128 runs) | Q5_1     5.1 GFLOPS (128 runs) | Q8_0     5.6 GFLOPS (128 runs)
  64 x   64: F16      5.6 GFLOPS (128 runs) | F32      5.7 GFLOPS (128 runs)
 128 x  128: Q4_0    22.8 GFLOPS (128 runs) | Q4_1    24.1 GFLOPS (128 runs)
 128 x  128: Q5_0    12.3 GFLOPS (128 runs) | Q5_1    11.8 GFLOPS (128 runs) | Q8_0    11.3 GFLOPS (128 runs)
 128 x  128: F16     15.4 GFLOPS (128 runs) | F32     26.5 GFLOPS (128 runs)
 256 x  256: Q4_0    49.7 GFLOPS (128 runs) | Q4_1    50.3 GFLOPS (128 runs)
 256 x  256: Q5_0    41.8 GFLOPS (128 runs) | Q5_1    39.0 GFLOPS (128 runs) | Q8_0    59.7 GFLOPS (128 runs)
 256 x  256: F16     65.2 GFLOPS (128 runs) | F32     48.7 GFLOPS (128 runs)
 512 x  512: Q4_0    63.0 GFLOPS (128 runs) | Q4_1    63.6 GFLOPS (128 runs)
 512 x  512: Q5_0    50.5 GFLOPS (128 runs) | Q5_1    47.3 GFLOPS (128 runs) | Q8_0    77.7 GFLOPS (128 runs)
 512 x  512: F16     85.6 GFLOPS (128 runs) | F32     53.3 GFLOPS (128 runs)
1024 x 1024: Q4_0    68.1 GFLOPS ( 32 runs) | Q4_1    69.8 GFLOPS ( 33 runs)
1024 x 1024: Q5_0    54.1 GFLOPS ( 26 runs) | Q5_1    51.2 GFLOPS ( 24 runs) | Q8_0    86.0 GFLOPS ( 41 runs)
1024 x 1024: F16     93.6 GFLOPS ( 44 runs) | F32     49.0 GFLOPS ( 23 runs)
2048 x 2048: Q4_0    70.8 GFLOPS (  5 runs) | Q4_1    72.8 GFLOPS (  5 runs)
2048 x 2048: Q5_0    56.1 GFLOPS (  4 runs) | Q5_1    53.0 GFLOPS (  4 runs) | Q8_0    88.1 GFLOPS (  6 runs)
2048 x 2048: F16     93.7 GFLOPS (  6 runs) | F32     44.4 GFLOPS (  3 runs)
4096 x 4096: Q4_0    72.6 GFLOPS (  3 runs) | Q4_1    74.8 GFLOPS (  3 runs)
4096 x 4096: Q5_0    56.2 GFLOPS (  3 runs) | Q5_1    53.3 GFLOPS (  3 runs) | Q8_0    88.4 GFLOPS (  3 runs)
4096 x 4096: F16     86.7 GFLOPS (  3 runs) | F32     39.7 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| <todo> | <todo> |             NEON |        tiny |   4 | 1049.00 |    6.74 |  149.32 | f96e1c5 |
| <todo> | <todo> |             NEON |        base |   4 | 2362.92 |   12.60 |  361.37 | f96e1c5 |
| <todo> | <todo> |             NEON |       small |   4 | 8081.87 |   35.65 | 1283.34 | f96e1c5 |
| <todo> | <todo> |             NEON |      medium |   4 |      ms |  105.77 | 4360.80 | f96e1c5 |
| <todo> | <todo> |             NEON |       large |   4 |      ms |  189.93 | 8158.78 | f96e1c5 |

I dunno to be honest why Gflops is higher but whilst the Enc the biggest chunk of process faster, maybe mem bandwidth?
Its like4like with the perf governor, due to pref of running Whisper that way of race-till-idle.

nickovs · 2023-11-04T18:41:12Z

@StuartIanNaylor Here is a straight up comparison of the same 54c978c commit between the Pi4 and the Pi5, both running the code compiled on the Pi4 on the Pi5 and then also recompiling the same commit on the Pi5.

Model	Pi4		Pi4 code on Pi5		Speedup on same compilation		Recompiled on Pi5		Speedup on recompiled code
	Encode	Decode	Encode	Decode	Encode	Decode	Encode	Decode	Encode
tiny	5246.14	510.57	2694.38	188.38	1.95	2.71	1106.11	183.67	4.74
tiny.en	5264.76	551.17	2744.80	203.94	1.92	2.70	1109.66	201.3	4.74
base.en	12473.07	1004.23	6345.28	363.15	1.97	2.77	2479.82	346.65	5.03
base	12453.04	972.29	6399.54	348.33	1.95	2.79	2465.12	363.86	5.05
small.en	48849.9	3316.15	24127.58	961.75	2.02	3.45	8308.3	963.24	5.88
small	49671.25	2953	24134.46	1109.70	2.06	2.66	8342.25	1119.25	5.95
medium.en	169889.39	8451.51	79045.66	2815.81	2.15	3.00	26407.77	2893.55	6.43
medium	173236.92	8531.94	79075.19	2836.38	2.19	3.01	26468.86	2919.43	6.54

This suggests that there is a little better than a 2-fold performance improvement on encode, and more like a 2.8 fold improvement on decode, just moving the code from the Pi4 to the Pi5. Recompiling on the Pi5 raises the encode performance to between 4.74 and 6.54 times faster that on the Pi4, but the decode performance remains only about 2.8 times faster than the Pi4 and doesn't benefit a great deal from the recompilation.

(Note that this table hits GitHub's 10 column limit, so the decode speedup may not be displayed, but the numbers are in the comment source.)

The key thing here as far as I'm concerned is that on the Pi5 the small model runs in better than real time, whereas on the Pi4 you were stuck using the tiny model for real-time work.

jwinarske · 2023-11-04T19:43:37Z

It would be great to have a test results db for this. I'm thinking similar to what DRM info does

StuartIanNaylor · 2023-11-05T09:38:39Z

@jwinarske that would be great as maybe a seperate repo of fixed commits as we are not benching the software but the hardware.
The Llama bench would be a good inclusion as the openLlama3b-q4 manages 20 Tokens/s on a Rk3588s-4gb.
I also like https://github.com/Tencent/ncnn/tree/master/benchmark as a pretty easy install and has a ready made list of smaller yolo type models.

Linux ubuntu 6.6.0 #1 SMP PREEMPT Opi5 4GB performance governor 54c978c

memcpy: 11.18 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     3.5 GFLOPS (128 runs) | Q4_1     3.2 GFLOPS (128 runs)
  64 x   64: Q5_0     2.7 GFLOPS (128 runs) | Q5_1     2.8 GFLOPS (128 runs) | Q8_0     3.1 GFLOPS (128 runs)
  64 x   64: F16      3.3 GFLOPS (128 runs) | F32      3.2 GFLOPS (128 runs)
 128 x  128: Q4_0     7.8 GFLOPS (128 runs) | Q4_1     8.0 GFLOPS (128 runs)
 128 x  128: Q5_0     6.2 GFLOPS (128 runs) | Q5_1     6.5 GFLOPS (128 runs) | Q8_0     7.8 GFLOPS (128 runs)
 128 x  128: F16      9.5 GFLOPS (128 runs) | F32      7.5 GFLOPS (128 runs)
 256 x  256: Q4_0    10.6 GFLOPS (128 runs) | Q4_1    11.0 GFLOPS (128 runs)
 256 x  256: Q5_0     7.9 GFLOPS (128 runs) | Q5_1     8.4 GFLOPS (128 runs) | Q8_0    10.3 GFLOPS (128 runs)
 256 x  256: F16     14.8 GFLOPS (128 runs) | F32      9.3 GFLOPS (128 runs)
 512 x  512: Q4_0    11.8 GFLOPS ( 44 runs) | Q4_1    12.4 GFLOPS ( 47 runs)
 512 x  512: Q5_0     8.9 GFLOPS ( 34 runs) | Q5_1     9.7 GFLOPS ( 37 runs) | Q8_0    11.5 GFLOPS ( 43 runs)
 512 x  512: F16     17.8 GFLOPS ( 67 runs) | F32      9.6 GFLOPS ( 36 runs)
1024 x 1024: Q4_0    32.7 GFLOPS ( 16 runs) | Q4_1    33.3 GFLOPS ( 16 runs)
1024 x 1024: Q5_0    25.2 GFLOPS ( 12 runs) | Q5_1    27.0 GFLOPS ( 13 runs) | Q8_0    36.0 GFLOPS ( 17 runs)
1024 x 1024: F16     39.4 GFLOPS ( 19 runs) | F32     28.1 GFLOPS ( 14 runs)
2048 x 2048: Q4_0    58.2 GFLOPS (  4 runs) | Q4_1    60.0 GFLOPS (  4 runs)
2048 x 2048: Q5_0    37.2 GFLOPS (  3 runs) | Q5_1    38.8 GFLOPS (  3 runs) | Q8_0    63.3 GFLOPS (  4 runs)
2048 x 2048: F16     78.3 GFLOPS (  5 runs) | F32     38.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    63.9 GFLOPS (  3 runs) | Q4_1    64.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0    39.6 GFLOPS (  3 runs) | Q5_1    41.5 GFLOPS (  3 runs) | Q8_0    70.3 GFLOPS (  3 runs)
4096 x 4096: F16     78.6 GFLOPS (  3 runs) | F32     35.1 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| <todo> | <todo> |             NEON |        tiny |   4 |  885.27 |    7.35 |  166.54 | 54c978c |
| <todo> | <todo> |             NEON |        base |   4 | 1888.93 |   12.61 |  347.61 | 54c978c |
| <todo> | <todo> |             NEON |       small |   4 | 6397.88 |   38.49 | 1111.82 | 54c978c |
| <todo> | <todo> |             NEON |      medium |   4 |      ms |   68.98 | 3511.72 | 54c978c |

@nickovs Dunno as before as A76 gets vector mat/mul and the code is optimised for ArmV8,2+ that the poor Pi4 with openBlas was approx < 5 times slower than a RK3588s.
The above is just same commit on a Opi5-4gb so Zram and swap comes into play with bigger models but from audio in to txt out last time I pegged the Pi4 as approx just less than x5 and ignored models it didn't manage in realtime.
I guess further optimisations have happened, the decode is less important to overall time or the Enc as that is the biggest process.

(venv) pi@raspberrypi:~/llama.cpp $ ./llama-bench -m  models/3b/open-llama-3b-q4_0.gguf -t 4
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 3B mostly Q4_0           |   1.84 GiB |     3.43 B | CPU        |          4 | pp 512     |      9.77 ± 0.01 |
| llama 3B mostly Q4_0           |   1.84 GiB |     3.43 B | CPU        |          4 | tg 128     |      5.42 ± 0.00 |

build: c41ea36 (1487)

ubuntu@ubuntu:~/llama.cpp$ ./llama-bench -m models/3b/open-llama-3b-q4_0.gguf -t 4
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 3B mostly Q4_0           |   1.84 GiB |     3.43 B | CPU        |          4 | pp 512     |      9.14 ± 0.01 |
| llama 3B mostly Q4_0           |   1.84 GiB |     3.43 B | CPU        |          4 | tg 128     |      7.06 ± 0.05 |

"lib" is needed for windows. With this change, you can build whisper.cpp with OpenBLAS's prebuilt DLL. 1. extract a zip from https://github.com/xianyi/OpenBLAS/releases 2. copy the headers in (openblas)/include to the root directory of whisper.cpp 3. invoke cmake with -DCMAKE_LIBRARY_PATH=(openblas)\lib -DWHISPER_SUPPORT_OPENBLAS=ON 4. copy (openblas)/bin/libopenblas.dll to the same directory of whisper.dll after msbuild ggerganov/whisper.cpp#89 (comment)

Update whisper.cpp

petterreinholdtsen · 2024-02-24T07:59:31Z

Here is the result for NVIDIA GeForce GT 755M on Debian GNU/Linux 12 Bookworm using GCC 12.2.0 build with -DWHISPER_CLBLAST=ON:

whisper_init_from_file_with_params_no_state: loading model from '../nb-large-ggml-model.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_opencl: selecting platform: 'NVIDIA CUDA'
ggml_opencl: selecting device: 'NVIDIA GeForce GT 755M'
ggml_opencl: device FP16 support: false
whisper_model_load:      CPU buffer size =  3094.86 MB
whisper_model_load: model size    = 3094.36 MB
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   32.42 MB
whisper_init_state: compute buffer (encode) =  212.42 MB
whisper_init_state: compute buffer (cross)  =    9.38 MB
whisper_init_state: compute buffer (decode) =   99.24 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 

whisper_print_timings:     load time =   712.98 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time = 29405.07 ms /     1 runs (29405.07 ms per run)
whisper_print_timings:   decode time = 25138.65 ms /   256 runs (   98.20 ms per run)
whisper_print_timings:   batchd time = 15522.25 ms /   320 runs (   48.51 ms per run)
whisper_print_timings:   prompt time = 120379.20 ms /  4096 runs (   29.39 ms per run)
whisper_print_timings:    total time = 190447.95 ms

zhouwg · 2024-03-06T10:22:31Z

benchmark result with 11th Gen Intel Core(TM) i7-11700F @ 2.50GHz + Ubuntu 20.04 + gcc version 9.4.0

CPU	OS	Mode	Threads	Load [ms]	Encode [ms]
i7-11700F	Ubuntu 20.04	tiny.en	4	46.72	4654.39
i7-11700F	Ubuntu 20.04	tiny.en	8	49.85	2981.43
i7-11700F	Ubuntu 20.04	small.en	4	175.02	51381.51
i7-11700F	Ubuntu 20.04	small.en	8	161.98	29662.80

./bench  -m ./models/ggml-small.en.bin -t 8 -w 2
  64 x   64: Q4_0     4.3 GFLOPS (128 runs) | Q4_1     4.4 GFLOPS (128 runs)
  64 x   64: Q5_0     4.0 GFLOPS (128 runs) | Q5_1     3.5 GFLOPS (128 runs) | Q8_0     4.7 GFLOPS (128 runs)
  64 x   64: F16      4.2 GFLOPS (128 runs) | F32      2.1 GFLOPS (128 runs)
 128 x  128: Q4_0    15.0 GFLOPS (128 runs) | Q4_1    15.3 GFLOPS (128 runs)
 128 x  128: Q5_0    11.9 GFLOPS (128 runs) | Q5_1    12.3 GFLOPS (128 runs) | Q8_0    21.0 GFLOPS (128 runs)
 128 x  128: F16     11.1 GFLOPS (128 runs) | F32      8.7 GFLOPS (128 runs)
 256 x  256: Q4_0    25.4 GFLOPS (128 runs) | Q4_1    29.1 GFLOPS (128 runs)
 256 x  256: Q5_0    17.4 GFLOPS (128 runs) | Q5_1    18.7 GFLOPS (128 runs) | Q8_0    49.1 GFLOPS (128 runs)
 256 x  256: F16     13.8 GFLOPS (128 runs) | F32     10.4 GFLOPS (128 runs)
 512 x  512: Q4_0    31.1 GFLOPS (116 runs) | Q4_1    33.0 GFLOPS (124 runs)
 512 x  512: Q5_0    17.1 GFLOPS ( 64 runs) | Q5_1    20.5 GFLOPS ( 77 runs) | Q8_0    66.3 GFLOPS (128 runs)
 512 x  512: F16     14.0 GFLOPS ( 53 runs) | F32      9.3 GFLOPS ( 35 runs)
1024 x 1024: Q4_0    31.9 GFLOPS ( 16 runs) | Q4_1    31.0 GFLOPS ( 15 runs)
1024 x 1024: Q5_0    20.0 GFLOPS ( 10 runs) | Q5_1    22.9 GFLOPS ( 11 runs) | Q8_0    80.1 GFLOPS ( 38 runs)
1024 x 1024: F16     14.6 GFLOPS (  7 runs) | F32      8.9 GFLOPS (  5 runs)
2048 x 2048: Q4_0    35.9 GFLOPS (  3 runs) | Q4_1    40.1 GFLOPS (  3 runs)
2048 x 2048: Q5_0    21.2 GFLOPS (  3 runs) | Q5_1    23.6 GFLOPS (  3 runs) | Q8_0    88.0 GFLOPS (  6 runs)
2048 x 2048: F16     14.4 GFLOPS (  3 runs) | F32      8.6 GFLOPS (  3 runs)
4096 x 4096: Q4_0    35.4 GFLOPS (  3 runs) | Q4_1    39.2 GFLOPS (  3 runs)
4096 x 4096: Q5_0    20.0 GFLOPS (  3 runs) | Q5_1    21.2 GFLOPS (  3 runs) | Q8_0    85.0 GFLOPS (  3 runs)
4096 x 4096: F16     13.5 GFLOPS (  3 runs) | F32      8.2 GFLOPS (  3 runs)

./bench  -m ./models/ggml-small.en.bin -t 8 -w 1
memcpy:    9.43 GB/s (heat-up)
memcpy:    9.31 GB/s ( 1 thread)
memcpy:    9.15 GB/s ( 1 thread)
memcpy:    8.74 GB/s ( 2 thread)
memcpy:    8.67 GB/s ( 3 thread)
memcpy:    8.43 GB/s ( 4 thread)
memcpy:    8.42 GB/s ( 5 thread)
memcpy:    8.70 GB/s ( 6 thread)
memcpy:    8.63 GB/s ( 7 thread)
memcpy:    8.32 GB/s ( 8 thread)
sum:    -5119997019.000000

 ./bench-all.sh 
Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy:    9.38 GB/s (heat-up)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.12 GB/s ( 2 thread)
memcpy:    9.05 GB/s ( 3 thread)
memcpy:    8.68 GB/s ( 4 thread)
sum:    -3071998678.000000

memcpy:    9.38 GB/s (heat-up)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.12 GB/s ( 2 thread)
memcpy:    9.05 GB/s ( 3 thread)
memcpy:    8.68 GB/s ( 4 thread)
sum:    -3071998678.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     7.3 GFLOPS (128 runs) | Q4_1     7.8 GFLOPS (128 runs)
  64 x   64: Q5_0     6.3 GFLOPS (128 runs) | Q5_1     6.5 GFLOPS (128 runs) | Q8_0     9.4 GFLOPS (128 runs)
  64 x   64: F16      6.2 GFLOPS (128 runs) | F32      2.4 GFLOPS (128 runs)
 128 x  128: Q4_0    15.4 GFLOPS (128 runs) | Q4_1    16.6 GFLOPS (128 runs)
 128 x  128: Q5_0    10.6 GFLOPS (128 runs) | Q5_1    11.5 GFLOPS (128 runs) | Q8_0    25.9 GFLOPS (128 runs)
 128 x  128: F16      9.0 GFLOPS (128 runs) | F32      5.8 GFLOPS (128 runs)
 256 x  256: Q4_0    19.9 GFLOPS (128 runs) | Q4_1    22.8 GFLOPS (128 runs)
 256 x  256: Q5_0    12.8 GFLOPS (128 runs) | Q5_1    13.9 GFLOPS (128 runs) | Q8_0    44.2 GFLOPS (128 runs)
 256 x  256: F16      9.4 GFLOPS (128 runs) | F32      7.6 GFLOPS (128 runs)
 512 x  512: Q4_0    21.7 GFLOPS ( 81 runs) | Q4_1    23.0 GFLOPS ( 86 runs)
 512 x  512: Q5_0    12.9 GFLOPS ( 48 runs) | Q5_1    13.9 GFLOPS ( 52 runs) | Q8_0    48.6 GFLOPS (128 runs)
 512 x  512: F16      8.9 GFLOPS ( 34 runs) | F32      6.8 GFLOPS ( 26 runs)
1024 x 1024: Q4_0    22.1 GFLOPS ( 11 runs) | Q4_1    24.9 GFLOPS ( 12 runs)
1024 x 1024: Q5_0    13.1 GFLOPS (  7 runs) | Q5_1    14.0 GFLOPS (  7 runs) | Q8_0    53.4 GFLOPS ( 25 runs)
1024 x 1024: F16      8.8 GFLOPS (  5 runs) | F32      6.5 GFLOPS (  4 runs)
2048 x 2048: Q4_0    22.6 GFLOPS (  3 runs) | Q4_1    25.7 GFLOPS (  3 runs)
2048 x 2048: Q5_0    13.1 GFLOPS (  3 runs) | Q5_1    14.7 GFLOPS (  3 runs) | Q8_0    57.1 GFLOPS (  4 runs)
2048 x 2048: F16      8.7 GFLOPS (  3 runs) | F32      6.3 GFLOPS (  3 runs)
4096 x 4096: Q4_0    21.5 GFLOPS (  3 runs) | Q4_1    23.4 GFLOPS (  3 runs)
4096 x 4096: Q5_0    12.2 GFLOPS (  3 runs) | Q5_1    13.5 GFLOPS (  3 runs) | Q8_0    53.9 GFLOPS (  3 runs)
4096 x 4096: F16      8.0 GFLOPS (  3 runs) | F32      5.7 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
i7-11700F	Ubuntu 20.04		base	4	ms	15.82	15.05	15.71	31989a5a

there is an impressive benchmark result(compare to above bench result in PC which was purchased by RMB12000(about USD 1700) a few years ago) with Xiaomi 14's powerful mobile SoC------Qualcomm SM8650-AB Snapdragon 8 Gen 3 (4 nm) + Xiaomi's HyperOS(derived from Android 14) + Android NDK r21e:

updated on 03-20-2024, Xiaomi 14 + Android NDK r26c( NDK r26c is required for special build optimization:https://github.com/cdeos/kantv/blob/master/external/whispercpp/CMakeLists.txt#L60)

obeone · 2024-04-24T23:50:58Z

CPU	OS	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
Macbook M3 Pro	Sonoma 14.5	NEON BLAS METAL	tiny	4	34.15	1.45	0.47	0.03	`858452d`
Macbook M3 Pro	Sonoma 14.5	NEON BLAS METAL	base	4	59.32	2.27	0.79	0.05	`858452d`
Macbook M3 Pro	Sonoma 14.5	NEON BLAS METAL	small	4	200.45	5.50	1.75	0.15	`858452d`
Macbook M3 Pro	Sonoma 14.5	NEON BLAS METAL	medium	4	534.54	12.88	3.90	0.37	`858452d`
Macbook M3 Pro	Sonoma 14.5	NEON BLAS METAL	large-v1	4	989.45	22.29	6.58	0.64	`858452d`
Macbook M3 Pro	Sonoma 14.5	NEON BLAS METAL	large-v2	4	962.34	22.38	6.61	0.64	`858452d`
Macbook M3 Pro	Sonoma 14.5	NEON BLAS METAL	large-v3	4	969.27	22.23	6.59	0.64	`858452d`

nanocosmos-ol · 2024-04-28T17:39:58Z

Different results for different code commits - older version is much faster!

CPU: AMD Ryzen 9 7950X3D 16-Core

commit 858452d Date: Wed Apr 24 14:56:30 2024 +0300

whisper_print_timings: load time = 64.61 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 878.59 ms / 1 runs ( 878.59 ms per run)
whisper_print_timings: decode time = 935.20 ms / 256 runs ( 3.65 ms per run)
whisper_print_timings: batchd time = 544.69 ms / 320 runs ( 1.70 ms per run)
whisper_print_timings: prompt time = 3865.51 ms / 4096 runs ( 0.94 ms per run)
whisper_print_timings: total time = 6225.76 ms

commit d03c60d Date: Wed Nov 8 04:53:31 2023 +0700

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

whisper_print_timings: load time = 83.24 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 693.48 ms / 1 runs ( 693.48 ms per run)
whisper_print_timings: decode time = 874.80 ms / 256 runs ( 3.42 ms per run)
whisper_print_timings: prompt time = 2249.08 ms / 16 runs ( 140.57 ms per run)
whisper_print_timings: total time = 3817.54 ms

dwindibank · 2024-05-23T17:15:45Z

A quick question: When would you want us to run this / report results?

For context, we're looking on using space on one of our old nodes to run a large number of files through Whisper (cpp). It's a server with multiple RTX2080TIs clustered together. I just don't know if knowing that Whisper.cpp runs fast on this out of date (but high spec'd for it's time) setup is useful.

Thanks!

BBC-Esq · 2024-06-06T16:42:39Z

Hello all, I'm trying to benchmark all whisper backends but am having trouble benchmarking whisper.cpp. Since I'm unfamiliar with "compiling" I'm forced to use python bindings. I'm only aware of the following bindings but they all either haven't been updated in a long time or don't implement gpu acceleration:

Also, does whisper.cpp have "batching" by chance? Here's a sample graph I've created. Any feedback would be welcome regarding either how I'm graphing as well as how to test fairly with identical parameters and what not. Thanks!

P.S. faster-whisper doesn't have batching yet so, obviously, that's why there's only one graph for it...

GrantLau1226 · 2024-06-07T08:31:59Z

benchmark result with 11th Gen Intel Core(TM) i7-11700F @ 2.50GHz + Ubuntu 20.04 + gcc version 9.4.0

CPU OS Config Mode Threads Load [ms] Encode [ms]
i7-11700F Ubuntu 20.04 tiny.en 4 46.72 4654.39
i7-11700F Ubuntu 20.04 tiny.en 8 49.85 2981.43
i7-11700F Ubuntu 20.04 small.en 4 175.02 51381.51
i7-11700F Ubuntu 20.04 small.en 8 161.98 29662.80

./bench  -m ./models/ggml-small.en.bin -t 8 -w 2
  64 x   64: Q4_0     4.3 GFLOPS (128 runs) | Q4_1     4.4 GFLOPS (128 runs)
  64 x   64: Q5_0     4.0 GFLOPS (128 runs) | Q5_1     3.5 GFLOPS (128 runs) | Q8_0     4.7 GFLOPS (128 runs)
  64 x   64: F16      4.2 GFLOPS (128 runs) | F32      2.1 GFLOPS (128 runs)
 128 x  128: Q4_0    15.0 GFLOPS (128 runs) | Q4_1    15.3 GFLOPS (128 runs)
 128 x  128: Q5_0    11.9 GFLOPS (128 runs) | Q5_1    12.3 GFLOPS (128 runs) | Q8_0    21.0 GFLOPS (128 runs)
 128 x  128: F16     11.1 GFLOPS (128 runs) | F32      8.7 GFLOPS (128 runs)
 256 x  256: Q4_0    25.4 GFLOPS (128 runs) | Q4_1    29.1 GFLOPS (128 runs)
 256 x  256: Q5_0    17.4 GFLOPS (128 runs) | Q5_1    18.7 GFLOPS (128 runs) | Q8_0    49.1 GFLOPS (128 runs)
 256 x  256: F16     13.8 GFLOPS (128 runs) | F32     10.4 GFLOPS (128 runs)
 512 x  512: Q4_0    31.1 GFLOPS (116 runs) | Q4_1    33.0 GFLOPS (124 runs)
 512 x  512: Q5_0    17.1 GFLOPS ( 64 runs) | Q5_1    20.5 GFLOPS ( 77 runs) | Q8_0    66.3 GFLOPS (128 runs)
 512 x  512: F16     14.0 GFLOPS ( 53 runs) | F32      9.3 GFLOPS ( 35 runs)
1024 x 1024: Q4_0    31.9 GFLOPS ( 16 runs) | Q4_1    31.0 GFLOPS ( 15 runs)
1024 x 1024: Q5_0    20.0 GFLOPS ( 10 runs) | Q5_1    22.9 GFLOPS ( 11 runs) | Q8_0    80.1 GFLOPS ( 38 runs)
1024 x 1024: F16     14.6 GFLOPS (  7 runs) | F32      8.9 GFLOPS (  5 runs)
2048 x 2048: Q4_0    35.9 GFLOPS (  3 runs) | Q4_1    40.1 GFLOPS (  3 runs)
2048 x 2048: Q5_0    21.2 GFLOPS (  3 runs) | Q5_1    23.6 GFLOPS (  3 runs) | Q8_0    88.0 GFLOPS (  6 runs)
2048 x 2048: F16     14.4 GFLOPS (  3 runs) | F32      8.6 GFLOPS (  3 runs)
4096 x 4096: Q4_0    35.4 GFLOPS (  3 runs) | Q4_1    39.2 GFLOPS (  3 runs)
4096 x 4096: Q5_0    20.0 GFLOPS (  3 runs) | Q5_1    21.2 GFLOPS (  3 runs) | Q8_0    85.0 GFLOPS (  3 runs)
4096 x 4096: F16     13.5 GFLOPS (  3 runs) | F32      8.2 GFLOPS (  3 runs)

./bench  -m ./models/ggml-small.en.bin -t 8 -w 1
memcpy:    9.43 GB/s (heat-up)
memcpy:    9.31 GB/s ( 1 thread)
memcpy:    9.15 GB/s ( 1 thread)
memcpy:    8.74 GB/s ( 2 thread)
memcpy:    8.67 GB/s ( 3 thread)
memcpy:    8.43 GB/s ( 4 thread)
memcpy:    8.42 GB/s ( 5 thread)
memcpy:    8.70 GB/s ( 6 thread)
memcpy:    8.63 GB/s ( 7 thread)
memcpy:    8.32 GB/s ( 8 thread)
sum:    -5119997019.000000

 ./bench-all.sh 
Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy:    9.38 GB/s (heat-up)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.12 GB/s ( 2 thread)
memcpy:    9.05 GB/s ( 3 thread)
memcpy:    8.68 GB/s ( 4 thread)
sum:    -3071998678.000000

memcpy:    9.38 GB/s (heat-up)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.12 GB/s ( 2 thread)
memcpy:    9.05 GB/s ( 3 thread)
memcpy:    8.68 GB/s ( 4 thread)
sum:    -3071998678.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     7.3 GFLOPS (128 runs) | Q4_1     7.8 GFLOPS (128 runs)
  64 x   64: Q5_0     6.3 GFLOPS (128 runs) | Q5_1     6.5 GFLOPS (128 runs) | Q8_0     9.4 GFLOPS (128 runs)
  64 x   64: F16      6.2 GFLOPS (128 runs) | F32      2.4 GFLOPS (128 runs)
 128 x  128: Q4_0    15.4 GFLOPS (128 runs) | Q4_1    16.6 GFLOPS (128 runs)
 128 x  128: Q5_0    10.6 GFLOPS (128 runs) | Q5_1    11.5 GFLOPS (128 runs) | Q8_0    25.9 GFLOPS (128 runs)
 128 x  128: F16      9.0 GFLOPS (128 runs) | F32      5.8 GFLOPS (128 runs)
 256 x  256: Q4_0    19.9 GFLOPS (128 runs) | Q4_1    22.8 GFLOPS (128 runs)
 256 x  256: Q5_0    12.8 GFLOPS (128 runs) | Q5_1    13.9 GFLOPS (128 runs) | Q8_0    44.2 GFLOPS (128 runs)
 256 x  256: F16      9.4 GFLOPS (128 runs) | F32      7.6 GFLOPS (128 runs)
 512 x  512: Q4_0    21.7 GFLOPS ( 81 runs) | Q4_1    23.0 GFLOPS ( 86 runs)
 512 x  512: Q5_0    12.9 GFLOPS ( 48 runs) | Q5_1    13.9 GFLOPS ( 52 runs) | Q8_0    48.6 GFLOPS (128 runs)
 512 x  512: F16      8.9 GFLOPS ( 34 runs) | F32      6.8 GFLOPS ( 26 runs)
1024 x 1024: Q4_0    22.1 GFLOPS ( 11 runs) | Q4_1    24.9 GFLOPS ( 12 runs)
1024 x 1024: Q5_0    13.1 GFLOPS (  7 runs) | Q5_1    14.0 GFLOPS (  7 runs) | Q8_0    53.4 GFLOPS ( 25 runs)
1024 x 1024: F16      8.8 GFLOPS (  5 runs) | F32      6.5 GFLOPS (  4 runs)
2048 x 2048: Q4_0    22.6 GFLOPS (  3 runs) | Q4_1    25.7 GFLOPS (  3 runs)
2048 x 2048: Q5_0    13.1 GFLOPS (  3 runs) | Q5_1    14.7 GFLOPS (  3 runs) | Q8_0    57.1 GFLOPS (  4 runs)
2048 x 2048: F16      8.7 GFLOPS (  3 runs) | F32      6.3 GFLOPS (  3 runs)
4096 x 4096: Q4_0    21.5 GFLOPS (  3 runs) | Q4_1    23.4 GFLOPS (  3 runs)
4096 x 4096: Q5_0    12.2 GFLOPS (  3 runs) | Q5_1    13.5 GFLOPS (  3 runs) | Q8_0    53.9 GFLOPS (  3 runs)
4096 x 4096: F16      8.0 GFLOPS (  3 runs) | F32      5.7 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Enc. Dec. Bch5 PP Commit
i7-11700F Ubuntu 20.04 base 4 ms 15.82 15.05 15.71 31989a5a
there is an impressive benchmark result(compare to above bench result in PC which was purchased by RMB12000(about USD 1700) a few years ago) with Xiaomi 14's powerful mobile SoC------Qualcomm SM8650-AB Snapdragon 8 Gen 3 (4 nm) + Xiaomi's HyperOS(derived from Android 14) + Android NDK r21e:

updated on 03-20-2024, Xiaomi 14 + Android NDK r26c( NDK r26c is required for special build optimization:https://github.com/cdeos/kantv/blob/master/external/whispercpp/CMakeLists.txt#L60)

Excuse me, may I ask which way you generated the benchmark app?I am now worried because I am not able to benchmark on my phone. Thanks for your answer.

zhouwg · 2024-06-08T01:30:15Z

running the original "bench(which generated by the original build system in project whisper.cpp)" in X86-Linux(Ubuntu 20.04).

benchmark on Android phone is another topic and scenario. the official project whisper.cpp doesn't care this:they focus on core implementation/improvement and focus on MacOS(iOS)/Windows/Linux(I personally think the Android OS is another special Linux distribution).

I maintained a dedicated ggml learning&study project focus on Android and some benchmark items are also provided in this ggml learning&study project accordingly.

BTW, the codes of above two benchmark items are exactly same to the original codes of above benchmark in the project whisper.cpp essentially/technically.

StuartIanNaylor · 2024-06-13T13:01:23Z

Hello all, I'm trying to benchmark all whisper backends but am having trouble benchmarking whisper.cpp. Since I'm unfamiliar with "compiling" I'm forced to use python bindings. I'm only aware of the following bindings but they all either haven't been updated in a long time or don't implement gpu acceleration:

https://github.com/aarnphm/whispercpp

https://github.com/stlukey/whispercpp.py

https://github.com/abdeladim-s/pywhispercpp

Also, does whisper.cpp have "batching" by chance? Here's a sample graph I've created. Any feedback would be welcome regarding either how I'm graphing as well as how to test fairly with identical parameters and what not. Thanks!

P.S. faster-whisper doesn't have batching yet so, obviously, that's why there's only one graph for it...

https://github.com/ggerganov/whisper.cpp?tab=readme-ov-file#quick-start

aleksas · 2024-09-28T22:17:12Z

System Info

The C compiler: GNU 12.2.0
The CXX compiler: GNU 12.2.0
Docker container image: debian:bookworm
Docker host: Ubuntu 22.04.4 LTS
CPU: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz

n_threads	AVX	AVX2	AVX512	FMA	NEON	ARM_FMA	METAL	F16C	FP16_VA	WASM_SIMD	BLAS	SSE3	SSSE3	VSX	CUDA	COREML	OPENVINO
4 / 8	1	1	0	1	0	0	0	1	0	0	0	1	1	0	0	0	0

memcpy

./bench -w 1 -t 1
memcpy:    4.48 GB/s (heat-up)
memcpy:    5.13 GB/s ( 1 thread)
memcpy:    5.48 GB/s ( 1 thread)
sum:    -1535998239.000000

ggml_mul_mat

./bench -w 2 -t 1
  64 x   64: Q4_0     2.6 GFLOPS (128 runs) | Q4_1     2.6 GFLOPS (128 runs)
  64 x   64: Q5_0     2.4 GFLOPS (128 runs) | Q5_1     2.3 GFLOPS (128 runs) | Q8_0     2.8 GFLOPS (128 runs)
  64 x   64: F16      3.2 GFLOPS (128 runs) | F32      0.7 GFLOPS (128 runs)
 128 x  128: Q4_0     4.3 GFLOPS (128 runs) | Q4_1     4.5 GFLOPS (128 runs)
 128 x  128: Q5_0     4.2 GFLOPS (128 runs) | Q5_1     4.0 GFLOPS (128 runs) | Q8_0     5.4 GFLOPS (128 runs)
 128 x  128: F16      5.7 GFLOPS (128 runs) | F32      2.9 GFLOPS (128 runs)
 256 x  256: Q4_0     6.9 GFLOPS (128 runs) | Q4_1     6.0 GFLOPS (128 runs)
 256 x  256: Q5_0     6.0 GFLOPS (128 runs) | Q5_1     5.4 GFLOPS (128 runs) | Q8_0     9.5 GFLOPS (128 runs)
 256 x  256: F16      8.3 GFLOPS (128 runs) | F32      5.4 GFLOPS (128 runs)
 512 x  512: Q4_0     9.2 GFLOPS ( 35 runs) | Q4_1     8.0 GFLOPS ( 30 runs)
 512 x  512: Q5_0     7.1 GFLOPS ( 27 runs) | Q5_1     7.1 GFLOPS ( 27 runs) | Q8_0    11.2 GFLOPS ( 42 runs)
 512 x  512: F16      9.0 GFLOPS ( 34 runs) | F32      5.0 GFLOPS ( 19 runs)
1024 x 1024: Q4_0    10.2 GFLOPS (  5 runs) | Q4_1     9.1 GFLOPS (  5 runs)
1024 x 1024: Q5_0     8.4 GFLOPS (  4 runs) | Q5_1     8.1 GFLOPS (  4 runs) | Q8_0    13.4 GFLOPS (  7 runs)
1024 x 1024: F16      8.8 GFLOPS (  5 runs) | F32      4.0 GFLOPS (  3 runs)
2048 x 2048: Q4_0    11.4 GFLOPS (  3 runs) | Q4_1    10.2 GFLOPS (  3 runs)
2048 x 2048: Q5_0     7.9 GFLOPS (  3 runs) | Q5_1     7.5 GFLOPS (  3 runs) | Q8_0    11.3 GFLOPS (  3 runs)
2048 x 2048: F16      7.8 GFLOPS (  3 runs) | F32      4.4 GFLOPS (  3 runs)
4096 x 4096: Q4_0     9.7 GFLOPS (  3 runs) | Q4_1     9.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0     7.9 GFLOPS (  3 runs) | Q5_1     7.4 GFLOPS (  3 runs) | Q8_0    11.5 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
i7-8650U	Ubuntu 22.04.4 LTS	AVX2	tiny	4	251.99	2350.88	`c7b6988`
i7-8650U	Ubuntu 22.04.4 LTS	AVX2	base	4	387.53	5259.46	`c7b6988`
i7-8650U	Ubuntu 22.04.4 LTS	AVX2	small	4	939.49	22799.70	`c7b6988`
i7-8650U	Ubuntu 22.04.4 LTS	AVX2	medium	4	2679.57	62713.56	`c7b6988`

aleksas · 2024-10-01T18:24:52Z

System Info

The C compiler: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
The CXX compiler: g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Docker container image: nvidia/cuda:12.2.2-cudnn8-devel-ubuntu22.04
Docker host: Ubuntu 24.04 LTS
CPU: Intel(R) Xeon(R) CPU E5-1660 v3 @ 3.00GHz
GPU: NVIDIA GeForce RTX 4090

n_threads	AVX	AVX2	AVX512	FMA	NEON	ARM_FMA	METAL	F16C	FP16_VA	WASM_SIMD	BLAS	SSE3	SSSE3	VSX	CUDA	COREML	OPENVINO
8 / 16	1	1	0	1	0	0	0	1	0	0	1	1	1	0	1	0	0

memcpy

./bench -w 1 -t 1
memcpy:   13.44 GB/s (heat-up)
memcpy:   13.53 GB/s ( 1 thread)
memcpy:   13.49 GB/s ( 1 thread)
sum:    -1535998239.000000

ggml_mul_mat

./bench -w 2 -t 1
  64 x   64: Q4_0    10.3 GFLOPS (128 runs) | Q4_1     9.8 GFLOPS (128 runs)
  64 x   64: Q5_0     9.3 GFLOPS (128 runs) | Q5_1     8.7 GFLOPS (128 runs) | Q8_0    11.0 GFLOPS (128 runs)
  64 x   64: F16     11.0 GFLOPS (128 runs) | F32      3.0 GFLOPS (128 runs)
 128 x  128: Q4_0    15.5 GFLOPS (128 runs) | Q4_1    15.1 GFLOPS (128 runs)
 128 x  128: Q5_0    13.7 GFLOPS (128 runs) | Q5_1    13.2 GFLOPS (128 runs) | Q8_0    17.6 GFLOPS (128 runs)
 128 x  128: F16     15.6 GFLOPS (128 runs) | F32      9.7 GFLOPS (128 runs)
 256 x  256: Q4_0    20.0 GFLOPS (128 runs) | Q4_1    19.1 GFLOPS (128 runs)
 256 x  256: Q5_0    16.5 GFLOPS (128 runs) | Q5_1    16.0 GFLOPS (128 runs) | Q8_0    23.3 GFLOPS (128 runs)
 256 x  256: F16     19.4 GFLOPS (128 runs) | F32     14.5 GFLOPS (128 runs)
 512 x  512: Q4_0    24.0 GFLOPS ( 90 runs) | Q4_1    23.8 GFLOPS ( 89 runs)
 512 x  512: Q5_0    20.1 GFLOPS ( 76 runs) | Q5_1    19.7 GFLOPS ( 74 runs) | Q8_0    27.8 GFLOPS (104 runs)
 512 x  512: F16     22.9 GFLOPS ( 86 runs) | F32     13.6 GFLOPS ( 51 runs)
1024 x 1024: Q4_0    26.6 GFLOPS ( 13 runs) | Q4_1    27.1 GFLOPS ( 13 runs)
1024 x 1024: Q5_0    21.7 GFLOPS ( 11 runs) | Q5_1    21.5 GFLOPS ( 11 runs) | Q8_0    32.3 GFLOPS ( 16 runs)
1024 x 1024: F16     23.9 GFLOPS ( 12 runs) | F32     13.2 GFLOPS (  7 runs)
2048 x 2048: Q4_0    28.0 GFLOPS (  3 runs) | Q4_1    29.1 GFLOPS (  3 runs)
2048 x 2048: Q5_0    22.4 GFLOPS (  3 runs) | Q5_1    23.3 GFLOPS (  3 runs) | Q8_0    34.4 GFLOPS (  3 runs)
2048 x 2048: F16     24.6 GFLOPS (  3 runs) | F32     12.7 GFLOPS (  3 runs)
4096 x 4096: Q4_0    29.3 GFLOPS (  3 runs) | Q4_1    30.3 GFLOPS (  3 runs)
4096 x 4096: Q5_0    22.9 GFLOPS (  3 runs) | Q5_1    24.0 GFLOPS (  3 runs) | Q8_0    35.3 GFLOPS (  3 runs)
4096 x 4096: F16     24.4 GFLOPS (  3 runs) | F32     11.1 GFLOPS (  3 runs)

Model	Th	Load	Enc.	Commit
CUDA	tiny	8	196.73	2.67
CUDA	base	8	213.35	5.36
CUDA	small	8	313.44	16.15
CUDA	medium	8	570.86	41.80

aleksas · 2024-10-01T18:53:43Z

System Info

The C compiler: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
The CXX compiler: g++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Docker container image: nvidia/cuda:11.4.3-cudnn8-devel-ubuntu20.04
Docker host: Ubuntu 24.04 LTS
CPU: Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz
GPU: NVIDIA GeForce RTX 1080 TI

n_threads	AVX	AVX2	AVX512	FMA	NEON	ARM_FMA	METAL	F16C	FP16_VA	WASM_SIMD	BLAS	SSE3	SSSE3	VSX	CUDA	COREML	OPENVINO
4 / 4	1	1	0	1	0	0	0	1	0	0	1	1	1	0	1	0	0

memcpy

./bench -w 1 -t 1
memcpy:   13.62 GB/s (heat-up)
memcpy:   13.54 GB/s ( 1 thread)
memcpy:   13.62 GB/s ( 1 thread)
sum:    -1535998239.000000

ggml_mul_mat

./bench -w 2 -t 1
  64 x   64: Q4_0    12.2 GFLOPS (128 runs) | Q4_1    11.3 GFLOPS (128 runs)
  64 x   64: Q5_0    11.3 GFLOPS (128 runs) | Q5_1    10.1 GFLOPS (128 runs) | Q8_0    13.2 GFLOPS (128 runs)
  64 x   64: F16     15.3 GFLOPS (128 runs) | F32      3.6 GFLOPS (128 runs)
 128 x  128: Q4_0    19.4 GFLOPS (128 runs) | Q4_1    16.9 GFLOPS (128 runs)
 128 x  128: Q5_0    17.0 GFLOPS (128 runs) | Q5_1    15.5 GFLOPS (128 runs) | Q8_0    21.5 GFLOPS (128 runs)
 128 x  128: F16     22.3 GFLOPS (128 runs) | F32     10.7 GFLOPS (128 runs)
 256 x  256: Q4_0    24.7 GFLOPS (128 runs) | Q4_1    20.5 GFLOPS (128 runs)
 256 x  256: Q5_0    20.4 GFLOPS (128 runs) | Q5_1    18.8 GFLOPS (128 runs) | Q8_0    28.2 GFLOPS (128 runs)
 256 x  256: F16     29.2 GFLOPS (128 runs) | F32     15.4 GFLOPS (128 runs)
 512 x  512: Q4_0    28.9 GFLOPS (108 runs) | Q4_1    25.7 GFLOPS ( 96 runs)
 512 x  512: Q5_0    24.9 GFLOPS ( 93 runs) | Q5_1    23.4 GFLOPS ( 87 runs) | Q8_0    34.3 GFLOPS (128 runs)
 512 x  512: F16     35.0 GFLOPS (128 runs) | F32     13.8 GFLOPS ( 52 runs)
1024 x 1024: Q4_0    33.6 GFLOPS ( 16 runs) | Q4_1    30.2 GFLOPS ( 15 runs)
1024 x 1024: Q5_0    28.3 GFLOPS ( 14 runs) | Q5_1    26.9 GFLOPS ( 13 runs) | Q8_0    40.4 GFLOPS ( 19 runs)
1024 x 1024: F16     33.3 GFLOPS ( 16 runs) | F32     12.9 GFLOPS (  7 runs)
2048 x 2048: Q4_0    36.1 GFLOPS (  3 runs) | Q4_1    32.8 GFLOPS (  3 runs)
2048 x 2048: Q5_0    29.5 GFLOPS (  3 runs) | Q5_1    28.5 GFLOPS (  3 runs) | Q8_0    42.6 GFLOPS (  3 runs)
2048 x 2048: F16     31.0 GFLOPS (  3 runs) | F32     12.2 GFLOPS (  3 runs)
4096 x 4096: Q4_0    36.6 GFLOPS (  3 runs) | Q4_1    33.6 GFLOPS (  3 runs)
4096 x 4096: Q5_0    30.7 GFLOPS (  3 runs) | Q5_1    29.5 GFLOPS (  3 runs) | Q8_0    42.9 GFLOPS (  3 runs)
4096 x 4096: F16     30.1 GFLOPS (  3 runs) | F32     11.7 GFLOPS (  3 runs)

Model	Th	Load	Enc.	Commit
CUDA	tiny	4	196.22	13.52
CUDA	base	4	211.40	27.67
CUDA	small	4	310.89	91.85
CUDA	medium	4	861.47	233.79

slavanorm · 2024-10-27T17:26:22Z

what is faster on mac M1, turbo compiled with coreml or turbo_q5 without it?

magnacartatron · 2024-11-11T12:30:23Z

M4 Mac Mini (Base Model) CoreML flags

./bench -w 1 -t 1 
memcpy:   34.01 GB/s (heat-up)
memcpy:   41.20 GB/s ( 1 thread)
memcpy:   41.44 GB/s ( 1 thread)
sum:    -1536000387.000000

./bench -w 2 -t 1
  64 x   64: Q4_0    40.8 GFLOPS (128 runs) | Q4_1    38.2 GFLOPS (128 runs)
  64 x   64: Q5_0    29.6 GFLOPS (128 runs) | Q5_1    28.8 GFLOPS (128 runs) | Q8_0    44.9 GFLOPS (128 runs)
  64 x   64: F16     43.4 GFLOPS (128 runs) | F32     30.3 GFLOPS (128 runs)
 128 x  128: Q4_0    71.0 GFLOPS (128 runs) | Q4_1    62.6 GFLOPS (128 runs)
 128 x  128: Q5_0    45.7 GFLOPS (128 runs) | Q5_1    42.2 GFLOPS (128 runs) | Q8_0    73.7 GFLOPS (128 runs)
 128 x  128: F16     65.4 GFLOPS (128 runs) | F32     35.2 GFLOPS (128 runs)
 256 x  256: Q4_0    81.4 GFLOPS (128 runs) | Q4_1    72.9 GFLOPS (128 runs)
 256 x  256: Q5_0    57.0 GFLOPS (128 runs) | Q5_1    50.9 GFLOPS (128 runs) | Q8_0    92.7 GFLOPS (128 runs)
 256 x  256: F16     69.4 GFLOPS (128 runs) | F32     40.9 GFLOPS (128 runs)
 512 x  512: Q4_0    85.5 GFLOPS (128 runs) | Q4_1    76.9 GFLOPS (128 runs)
 512 x  512: Q5_0    62.1 GFLOPS (128 runs) | Q5_1    54.1 GFLOPS (128 runs) | Q8_0   100.8 GFLOPS (128 runs)
 512 x  512: F16     81.0 GFLOPS (128 runs) | F32     44.7 GFLOPS (128 runs)
1024 x 1024: Q4_0    89.6 GFLOPS ( 42 runs) | Q4_1    80.0 GFLOPS ( 38 runs)
1024 x 1024: Q5_0    65.8 GFLOPS ( 31 runs) | Q5_1    56.9 GFLOPS ( 27 runs) | Q8_0   110.5 GFLOPS ( 52 runs)
1024 x 1024: F16     88.0 GFLOPS ( 41 runs) | F32     43.4 GFLOPS ( 21 runs)
2048 x 2048: Q4_0    92.2 GFLOPS (  6 runs) | Q4_1    81.4 GFLOPS (  5 runs)
2048 x 2048: Q5_0    67.2 GFLOPS (  4 runs) | Q5_1    57.9 GFLOPS (  4 runs) | Q8_0   116.6 GFLOPS (  7 runs)
2048 x 2048: F16     83.7 GFLOPS (  5 runs) | F32     37.3 GFLOPS (  3 runs)
4096 x 4096: Q4_0    92.7 GFLOPS (  3 runs) | Q4_1    81.5 GFLOPS (  3 runs)
4096 x 4096: Q5_0    67.9 GFLOPS (  3 runs) | Q5_1    58.2 GFLOPS (  3 runs) | Q8_0   119.7 GFLOPS (  3 runs)
4096 x 4096: F16     73.4 GFLOPS (  3 runs) | F32     35.8 GFLOPS (  3 runs)

M1 Ultra 48 Core GPU 64 GB - Standard Metal

/bench -w 1 -t 1
memcpy:   36.55 GB/s (heat-up)
memcpy:   41.00 GB/s ( 1 thread)
memcpy:   41.86 GB/s ( 1 thread)
sum:    -1536000387.000000

./bench -w 2 -t 1
  64 x   64: Q4_0    20.0 GFLOPS (128 runs) | Q4_1    18.4 GFLOPS (128 runs)
  64 x   64: Q5_0    15.3 GFLOPS (128 runs) | Q5_1    14.8 GFLOPS (128 runs) | Q8_0    21.1 GFLOPS (128 runs)
  64 x   64: F16     21.3 GFLOPS (128 runs) | F32     16.4 GFLOPS (128 runs)
 128 x  128: Q4_0    40.9 GFLOPS (128 runs) | Q4_1    37.3 GFLOPS (128 runs)
 128 x  128: Q5_0    27.5 GFLOPS (128 runs) | Q5_1    26.0 GFLOPS (128 runs) | Q8_0    44.1 GFLOPS (128 runs)
 128 x  128: F16     43.8 GFLOPS (128 runs) | F32     27.4 GFLOPS (128 runs)
 256 x  256: Q4_0    51.0 GFLOPS (128 runs) | Q4_1    45.2 GFLOPS (128 runs)
 256 x  256: Q5_0    33.4 GFLOPS (128 runs) | Q5_1    31.3 GFLOPS (128 runs) | Q8_0    58.6 GFLOPS (128 runs)
 256 x  256: F16     53.3 GFLOPS (128 runs) | F32     30.5 GFLOPS (128 runs)
 512 x  512: Q4_0    59.9 GFLOPS (128 runs) | Q4_1    53.1 GFLOPS (128 runs)
 512 x  512: Q5_0    39.3 GFLOPS (128 runs) | Q5_1    35.4 GFLOPS (128 runs) | Q8_0    71.5 GFLOPS (128 runs)
 512 x  512: F16     62.8 GFLOPS (128 runs) | F32     33.4 GFLOPS (125 runs)
1024 x 1024: Q4_0    65.0 GFLOPS ( 31 runs) | Q4_1    58.1 GFLOPS ( 28 runs)
1024 x 1024: Q5_0    42.6 GFLOPS ( 20 runs) | Q5_1    38.3 GFLOPS ( 18 runs) | Q8_0    80.2 GFLOPS ( 38 runs)
1024 x 1024: F16     64.9 GFLOPS ( 31 runs) | F32     30.5 GFLOPS ( 15 runs)
2048 x 2048: Q4_0    68.1 GFLOPS (  4 runs) | Q4_1    60.3 GFLOPS (  4 runs)
2048 x 2048: Q5_0    44.1 GFLOPS (  3 runs) | Q5_1    39.5 GFLOPS (  3 runs) | Q8_0    85.8 GFLOPS (  5 runs)
2048 x 2048: F16     60.0 GFLOPS (  4 runs) | F32     25.4 GFLOPS (  3 runs)
4096 x 4096: Q4_0    69.4 GFLOPS (  3 runs) | Q4_1    61.0 GFLOPS (  3 runs)
4096 x 4096: Q5_0    44.9 GFLOPS (  3 runs) | Q5_1    39.8 GFLOPS (  3 runs) | Q8_0    85.8 GFLOPS (  3 runs)
4096 x 4096: F16     50.8 GFLOPS (  3 runs) | F32     25.3 GFLOPS (  3 runs)

i5-14600k 4070 Ti Super 16GB (555 drivers), 32GB, Ubuntu 24.04 - CUDA Version

./bench -w 1 -t 1
memcpy:   23.88 GB/s (heat-up)
memcpy:   24.09 GB/s ( 1 thread)
memcpy:   24.25 GB/s ( 1 thread)
sum:    -1535998239.000000

./bench -w 2 -t 1
  64 x   64: Q4_0    26.7 GFLOPS (128 runs) | Q4_1    26.1 GFLOPS (128 runs)
  64 x   64: Q5_0    24.3 GFLOPS (128 runs) | Q5_1    22.6 GFLOPS (128 runs) | Q8_0    29.9 GFLOPS (128 runs)
  64 x   64: F16     34.6 GFLOPS (128 runs) | F32      6.5 GFLOPS (128 runs)
 128 x  128: Q4_0    43.0 GFLOPS (128 runs) | Q4_1    41.5 GFLOPS (128 runs)
 128 x  128: Q5_0    37.8 GFLOPS (128 runs) | Q5_1    34.9 GFLOPS (128 runs) | Q8_0    51.7 GFLOPS (128 runs)
 128 x  128: F16     45.8 GFLOPS (128 runs) | F32     14.7 GFLOPS (128 runs)
 256 x  256: Q4_0    57.1 GFLOPS (128 runs) | Q4_1    53.5 GFLOPS (128 runs)
 256 x  256: Q5_0    47.7 GFLOPS (128 runs) | Q5_1    44.9 GFLOPS (128 runs) | Q8_0    69.9 GFLOPS (128 runs)
 256 x  256: F16     50.6 GFLOPS (128 runs) | F32     23.1 GFLOPS (128 runs)
 512 x  512: Q4_0    66.7 GFLOPS (128 runs) | Q4_1    61.3 GFLOPS (128 runs)
 512 x  512: Q5_0    53.2 GFLOPS (128 runs) | Q5_1    50.6 GFLOPS (128 runs) | Q8_0    80.3 GFLOPS (128 runs)
 512 x  512: F16     53.2 GFLOPS (128 runs) | F32     26.2 GFLOPS ( 98 runs)
1024 x 1024: Q4_0    74.2 GFLOPS ( 35 runs) | Q4_1    67.3 GFLOPS ( 32 runs)
1024 x 1024: Q5_0    58.4 GFLOPS ( 28 runs) | Q5_1    54.0 GFLOPS ( 26 runs) | Q8_0    82.9 GFLOPS ( 39 runs)
1024 x 1024: F16     54.7 GFLOPS ( 26 runs) | F32     25.9 GFLOPS ( 13 runs)
2048 x 2048: Q4_0    78.4 GFLOPS (  5 runs) | Q4_1    70.3 GFLOPS (  5 runs)
2048 x 2048: Q5_0    61.0 GFLOPS (  4 runs) | Q5_1    55.5 GFLOPS (  4 runs) | Q8_0    85.6 GFLOPS (  5 runs)
2048 x 2048: F16     55.1 GFLOPS (  4 runs) | F32     25.9 GFLOPS (  3 runs)
4096 x 4096: Q4_0    79.4 GFLOPS (  3 runs) | Q4_1    71.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    62.5 GFLOPS (  3 runs) | Q5_1    55.1 GFLOPS (  3 runs) | Q8_0    83.0 GFLOPS (  3 runs)
4096 x 4096: F16     51.4 GFLOPS (  3 runs) | F32     24.4 GFLOPS (  3 runs)

What is strange is in the standard ./bench -m models/ggml-large-v3-turbo.bin the m1 Ultra is about twice as fast. But in practice the M4 with MacMini (CoreML converted models) is only about 10% slower in hour long transcriptions compared to the M1 Ultra 48core (using standard Metal make -j).

So the M4 is quite a beefy CPU, the ANE is nice though limiting in what it can do, GPU when running MLX models is about 2x M1 performance. E.g. getting 24 tokens per second on M1 vs 45 on M4, vs 120 on M1 Ultra using Llama 3.2 3B 4bit MLX. I'm surprised that the 4_k quant running on a 4070 Ti Super also gets about 120 tokens/s.

ggerganov added the performance CPU and memory usage - results and comparisons label Oct 25, 2022

ggerganov pinned this issue Oct 26, 2022

This comment was marked as outdated.

Sign in to view

This was referenced Jan 27, 2024

Add avx512 flags to makevars bnosac/audio.whisper#29

Merged

Integrate Metal bnosac/audio.whisper#34

Closed

kultivator-consulting pushed a commit to KultivatorConsulting/whisper.cpp that referenced this issue Feb 12, 2024

Merge pull request ggerganov#89 from marmistrz/whisper.cpp

3fd6010

Update whisper.cpp

nanocosmos-ol mentioned this issue Apr 28, 2024

CPU Performance Regression? (Older version much faster) #2099

Open

jtrmal mentioned this issue Jul 25, 2024

bench fails on Apple M3 Pro #2322

Closed

aleksas mentioned this issue Sep 28, 2024

First load time in Nvidia Jetson AGX Xavier and Orin is more than 10 minutes #2402

Closed

Benchmark results #89

Benchmark results #89

Comments

ggerganov commented Oct 25, 2022 • edited Loading

Encoder

memcpy

MacBook M1 Pro

Ryzen 9 5950X

ggml_mul_mat

MacBook M1 Pro

Ryzen 9 5950X

cdosoftei commented Oct 25, 2022 • edited by ggerganov Loading

rjwilmsi commented Oct 26, 2022

ArtyomZemlyak commented Oct 26, 2022 • edited Loading

ArtyomZemlyak commented Oct 26, 2022 • edited Loading

ArtyomZemlyak commented Oct 26, 2022 • edited Loading

ArtyomZemlyak commented Oct 26, 2022

ggerganov commented Oct 26, 2022

cristianglezm commented Oct 28, 2022 • edited Loading

tazz4843 commented Oct 29, 2022 • edited Loading

yujinqiu commented Oct 30, 2022

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

ggerganov commented Oct 31, 2022

This comment was marked as outdated.

trholding commented Oct 31, 2022

rgerganov commented Nov 5, 2022

jaybinks commented Nov 5, 2022 • edited Loading

mark-beeby commented Nov 8, 2022

niksedk commented Nov 9, 2022 • edited Loading

ggerganov commented Nov 9, 2022

niksedk commented Nov 9, 2022 • edited Loading

j1nx commented Nov 17, 2022 • edited Loading

StuartIanNaylor commented Nov 17, 2022 • edited Loading

dodysw commented Nov 17, 2022 • edited Loading

matth commented Nov 21, 2022

ggerganov commented Nov 21, 2022

matth commented Nov 21, 2022 • edited Loading

nickovs commented Nov 3, 2023

marjisound commented Nov 3, 2023

StuartIanNaylor commented Nov 3, 2023 • edited Loading

ggerganov commented Nov 3, 2023 • edited Loading

nickovs commented Nov 3, 2023

StuartIanNaylor commented Nov 4, 2023 • edited Loading

nickovs commented Nov 4, 2023

jwinarske commented Nov 4, 2023

StuartIanNaylor commented Nov 5, 2023 • edited Loading

petterreinholdtsen commented Feb 24, 2024

zhouwg commented Mar 6, 2024 • edited Loading

obeone commented Apr 24, 2024

nanocosmos-ol commented Apr 28, 2024

Different results for different code commits - older version is much faster!

dwindibank commented May 23, 2024

BBC-Esq commented Jun 6, 2024 • edited Loading

GrantLau1226 commented Jun 7, 2024

zhouwg commented Jun 8, 2024 • edited Loading

StuartIanNaylor commented Jun 13, 2024 • edited Loading

aleksas commented Sep 28, 2024

System Info

memcpy

ggml_mul_mat

aleksas commented Oct 1, 2024 • edited Loading

System Info

memcpy

ggml_mul_mat

aleksas commented Oct 1, 2024

System Info

memcpy

ggml_mul_mat

slavanorm commented Oct 27, 2024

magnacartatron commented Nov 11, 2024 • edited Loading

ggerganov commented Oct 25, 2022 •

edited

Loading

cdosoftei commented Oct 25, 2022 •

edited by ggerganov

Loading

ArtyomZemlyak commented Oct 26, 2022 •

edited

Loading

ArtyomZemlyak commented Oct 26, 2022 •

edited

Loading

ArtyomZemlyak commented Oct 26, 2022 •

edited

Loading

cristianglezm commented Oct 28, 2022 •

edited

Loading

tazz4843 commented Oct 29, 2022 •

edited

Loading

jaybinks commented Nov 5, 2022 •

edited

Loading

niksedk commented Nov 9, 2022 •

edited

Loading

niksedk commented Nov 9, 2022 •

edited

Loading

j1nx commented Nov 17, 2022 •

edited

Loading

StuartIanNaylor commented Nov 17, 2022 •

edited

Loading

dodysw commented Nov 17, 2022 •

edited

Loading

matth commented Nov 21, 2022 •

edited

Loading

StuartIanNaylor commented Nov 3, 2023 •

edited

Loading

ggerganov commented Nov 3, 2023 •

edited

Loading

StuartIanNaylor commented Nov 4, 2023 •

edited

Loading

StuartIanNaylor commented Nov 5, 2023 •

edited

Loading

zhouwg commented Mar 6, 2024 •

edited

Loading

BBC-Esq commented Jun 6, 2024 •

edited

Loading

zhouwg commented Jun 8, 2024 •

edited

Loading

StuartIanNaylor commented Jun 13, 2024 •

edited

Loading

aleksas commented Oct 1, 2024 •

edited

Loading

magnacartatron commented Nov 11, 2024 •

edited

Loading