Performance of llama.cpp on Apple Silicon A-series #4508

ggerganov · 2023-12-17T17:57:46Z

ggerganov
Dec 17, 2023
Maintainer

Summary

🟥 - benchmark data missing
🟨 - benchmark data partial
✅ - benchmark data available

PP means "prompt processing" (bs = 512), TG means "text-generation" (bs = 1)

TinyLlama 1.1B

	CPU Cores	GPU Cores	F16 PP [t/s]	F16 TG [t/s]	Q8_0 PP [t/s]	Q8_0 TG [t/s]	Q4_0 PP [t/s]	Q4_0 TG [t/s]
✅ A14 ¹	2+4	4	251.98	10.26	250.54	24.11	242.37	39.21
🟥 A15 ²	2+3	5
✅ A15 ²	2+4	4	X	X	411.16	24.12	405.30	39.03
✅ A15 ²	2+4	5	531.03	13.66	494.18	23.84	496.49	39.09
✅ A16 ³	2+4	5	565.68	20.06	511.30	34.30	505.52	54.24
✅ A17 ⁴	2+4	6	683.95	20.23	637.14	35.60	646.06	56.86

Phi-2 2.7B

	CPU Cores	GPU Cores	Q8_0 PP [t/s]	Q8_0 TG [t/s]	Q4_0 PP [t/s]	Q4_0 TG [t/s]
✅ A14 ¹	2+4	4	X	X	51.39	8.52
🟥 A15 ²	2+3	5
🟥 A15 ²	2+4	4
✅ A15 ²	2+4	5	X	X	120.47	16.73
✅ A16 ³	2+4	5	119.58	14.06	121.64	23.31
✅ A17 ⁴	2+4	6	158.03	14.74	157.33	24.71

Mistral 7B

	CPU Cores	GPU Cores	Q4_0 PP [t/s]	Q4_0 TG [t/s]
✅ A14 ¹	2+4	4	X	X
🟥 A15 ²	2+3	5
🟥 A15 ²	2+4	4
✅ A15 ²	2+4	5	X	X
🟥 A16 ³	2+4	5
✅ A17 ⁴	2+4	6	80.55	9.01

Description

This is a collection of short llama.cpp benchmarks on various Apple Silicon hardware. It can be useful to compare the performance that llama.cpp achieves across the A-Series chips. Similar collection for the M-series is available here: #4167

	CPU Cores	GPU Cores	Memory [GB]	Devices
A14	2+4	4	4-6	iPhone 12 (all variants), iPad Air (4th gen), iPad (10th gen)
A15	2+3	5	4	Apple TV 4K (3rd gen)
A15	2+4	4	4	iPhone SE (3rd gen), iPhone 13 & Mini
A15	2+4	5	4-6	iPad Mini (6th gen), iPhone 13 Pro & Pro Max, iPhone 14 & Plus
A16	2+4	5	6	iPhone 14 Pro & Pro Max, iPhone 15 & Plus
A17 Pro	2+4	6	8	iPhone 15 Pro & Pro Max

Instructions

Clone the project

git clone https://github.com/ggerganov/llama.cpp
git checkout 0e18b2e

Open the examples/llama.swiftui with Xcode
Enable Release build
Deploy on your iPhone / iPad
Stop Xcode and run the app from the device. This is important because the performance when running through Xcode is significantly slower
Download the models and run the "Bench" for each one
Running the "Bench" a second time can give more accurate results
Copy the results in the comments below, adding information about the device

iPhone 13 mini ✅

model	size	params	backend	test	t/s
llama 1B Q8_0	1.09 GiB	1.10 B	Metal	pp 512	411.16 ± 6.22
llama 1B Q8_0	1.09 GiB	1.10 B	Metal	tg 128	24.12 ± 0.04
llama 1B Q4_0	0.59 GiB	1.10 B	Metal	pp 512	405.30 ± 7.26
llama 1B Q4_0	0.59 GiB	1.10 B	Metal	tg 128	39.03 ± 0.08

jhen0409 · 2023-12-17T22:47:28Z

jhen0409
Dec 17, 2023
Collaborator Sponsor

iPhone 15 Pro (A17 Pro) ✅

model	size	params	backend	test	t/s
llama 1B F16	2.05 GiB	1.10 B	Metal	pp 512	683.95 ± 8.24
llama 1B F16	2.05 GiB	1.10 B	Metal	tg 128	20.23 ± 0.08
llama 1B Q8_0	1.09 GiB	1.10 B	Metal	pp 512	637.14 ± 18.73
llama 1B Q8_0	1.09 GiB	1.10 B	Metal	tg 128	35.60 ± 0.25
llama 1B Q4_0	0.59 GiB	1.10 B	Metal	pp 512	646.06 ± 17.17
llama 1B Q4_0	0.59 GiB	1.10 B	Metal	tg 128	56.86 ± 0.17
phi2 3B Q8_0	2.75 GiB	2.78 B	Metal	pp 512	158.03 ± 14.03
phi2 3B Q8_0	2.75 GiB	2.78 B	Metal	tg 128	14.74 ± 0.07
phi2 3B Q4_0	1.49 GiB	2.78 B	Metal	pp 512	157.33 ± 14.25
phi2 3B Q4_0	1.49 GiB	2.78 B	Metal	tg 128	24.71 ± 0.04
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	pp 512	80.55 ± 21.88
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	tg 128	9.01 ± 0.50

7 replies

ggerganov Dec 18, 2023
Maintainer Author

The models above are now available in the app

jhen0409 Dec 18, 2023
Collaborator Sponsor

Added phi-2 and f16 tinyllama results. Also checked the prev results are not changed in commit 0e18b2e.

Dampfinchen Dec 20, 2023

Text Generation speed using Mistral is more than useable on newer iPhones it seems. Prompt processing is very slow however, even when using Metal. I wonder if this is a compute or bandwidth limitation.

shouryan01 Feb 6, 2024

What's the minimum t/s for it to be usable? For example, is 9 t/s usable for Mistral 7b?

rhematt Feb 11, 2024

It's a UX design principle. For a model to be usable, the benchmark I've been working towards is 400ms for the first token and subsequent tokens. If a token isn't being processed to the user at least every 400ms, then the model won't be usable. Minimum usability should, therefore, be about 3t/s...

ymcui · 2023-12-18T07:50:07Z

ymcui
Dec 18, 2023

iPhone 15 Pro Max (A17 Pro) ✅

model	size	params	backend	test	t/s
llama 1B F16	2.05 GiB	1.10 B	Metal	pp 512	652.70 ± 18.14
llama 1B F16	2.05 GiB	1.10 B	Metal	tg 128	19.82 ± 0.30
llama 1B Q8_0	1.09 GiB	1.10 B	Metal	pp 512	662.89 ± 11.28
llama 1B Q8_0	1.09 GiB	1.10 B	Metal	tg 128	34.95 ± 0.08
llama 1B Q4_0	0.59 GiB	1.10 B	Metal	pp 512	645.78 ± 9.16
llama 1B Q4_0	0.59 GiB	1.10 B	Metal	tg 128	54.95 ± 0.14

Tested under iOS 17.3 Developer beta 1 (21D5026f)

3 replies

ggerganov Dec 18, 2023
Maintainer Author

Thanks - the iPhone 15 Pro Max should be using the same A17 chip as iPhone 15 Pro, correct? At least this is what I get from wikipedia and the numbers seem to mostly match the one from @jhen0409 above

ymcui Dec 18, 2023

Yes. Both iPhone 15 Pro and Pro Max use A17. Concretely, it is named with A17 Pro by Apple. See here.

ymcui Dec 18, 2023

@ggerganov Just updated F16 results. The model is taken from https://huggingface.co/SergiusFlavius/TinyLlama-1.1B-1T-OpenOrca-GGUF/blob/main/tinyllama-1.1b-1t-openorca.F16.gguf

ymcui · 2023-12-18T10:39:26Z

ymcui
Dec 18, 2023

iPhone 12 mini (A14) ✅

tinyllama:

model	size	params	backend	test	t/s
llama 1B F16	2.05 GiB	1.10 B	Metal	pp 512	251.98 ± 5.15
llama 1B F16	2.05 GiB	1.10 B	Metal	tg 128	10.26 ± 4.23
llama 1B Q8_0	1.09 GiB	1.10 B	Metal	pp 512	250.54 ± 0.95
llama 1B Q8_0	1.09 GiB	1.10 B	Metal	tg 128	24.11 ± 0.02
llama 1B Q4_0	0.59 GiB	1.10 B	Metal	pp 512	242.37 ± 0.81
llama 1B Q4_0	0.59 GiB	1.10 B	Metal	tg 128	39.21 ± 0.25

phi-2:

model	size	params	backend	test	t/s
phi2 3B Q8_0	2.75 GiB	2.78 B	Metal	pp 512	CRASHED
phi2 3B Q8_0	2.75 GiB	2.78 B	Metal	tg 128	CRASHED
phi2 3B Q4_0	1.49 GiB	2.78 B	Metal	pp 512	51.39 ± 11.89
phi2 3B Q4_0	1.49 GiB	2.78 B	Metal	tg 128	8.52 ± 2.78

phi-2 3B Q8_0 (2.75 GiB) cannot be loaded. The phone gets restarted when loading it.
I won't test Mistral-7B-Q4_0 (3.8 GiB) on iPhone 12 mini either, because it's too large to fit in memory (4 GiB). iPhone 12 Pro & Pro Max may have a chance to run it, as they have 6 GB RAM.

Tested under iOS 17.1.2 (21B101)

1 reply

ymcui Dec 19, 2023

Add phi-2 3B Q4_0 results. Others can't be loaded.

ymcui · 2023-12-19T05:32:29Z

ymcui
Dec 19, 2023

Some additional info with memory and relevant devices.

	CPU Cores	GPU Cores	Memory [GB]	Devices
A14	2+4	4	4-6	iPhone 12 (all variants), iPad Air (4th gen), iPad (10th gen)
A15	2+3	5	4	Apple TV 4K (3rd gen)
A15	2+4	4	4	iPhone SE (3rd gen), iPhone 13 & Mini
A15	2+4	5	4-6	iPad Mini (6th gen), iPhone 13 Pro & Pro Max, iPhone 14 & Plus
A16	2+4	5	6	iPhone 14 Pro & Pro Max, iPhone 15 & Plus
A17 Pro	2+4	6	8	iPhone 15 Pro & Pro Max

4 replies

ymcui Dec 21, 2023

Apple TV 4K (3rd gen) seems to be only one that has A15 (5CPU + 5GPU).
I checked my Apple TV 4K, and unfortunately, it is 2nd gen (A12) 😂

ggerganov Dec 21, 2023
Maintainer Author

Damn, first LLM on a TV 😄

nikolay-kapustin Dec 21, 2023

if you really need this, i can build and run these tests for aTV 3gen )

ggerganov Dec 21, 2023
Maintainer Author

Don't really need it, but it might be a cool achievement. LLM on a watch has already been demonstrated: https://twitter.com/shxf0072/status/1736713832045982040, but I don't think that is the case for LLM on a TV

nikolay-kapustin · 2023-12-19T23:27:43Z

nikolay-kapustin
Dec 19, 2023

iPhone 13 Pro (A15) ✅

model	size	params	backend	test	t/s
llama 1B Q4_0	0.59 GiB	1.10 B	Metal	pp 512	496.49 ± 3.82
llama 1B Q4_0	0.59 GiB	1.10 B	Metal	tg 128	39.09 ± 0.12
llama 1B Q8_0	1.09 GiB	1.10 B	Metal	pp 512	494.18 ± 4.93
llama 1B Q8_0	1.09 GiB	1.10 B	Metal	tg 128	23.84 ± 0.05
llama 1B F16	2.05 GiB	1.10 B	Metal	pp 512	531.03 ± 5.96
llama 1B F16	2.05 GiB	1.10 B	Metal	tg 128	13.66 ± 0.02
phi2 3B Q4_0	1.49 GiB	2.78 B	Metal	pp 512	120.47 ± 1.44
phi2 3B Q4_0	1.49 GiB	2.78 B	Metal	tg 128	16.73 ± 0.02

also a model phi-2 3B Q8_0 cannot be loaded.
and Mistral-7B-Q4_0 (3.8 GiB) not to fit in memory

0 replies

Krish120003 · 2023-12-20T00:09:52Z

Krish120003
Dec 20, 2023

iPhone 14 Pro (A16) ✅

model	size	params	backend	test	t/s
llama 1B Q4_0	0.59 GiB	1.10 B	Metal	pp 512	505.52 ± 0.58
llama 1B Q4_0	0.59 GiB	1.10 B	Metal	tg 128	54.24 ± 0.04
llama 1B Q8_0	1.09 GiB	1.10 B	Metal	pp 512	511.30 ± 1.00
llama 1B Q8_0	1.09 GiB	1.10 B	Metal	tg 128	34.30 ± 0.13
llama 1B F16	2.05 GiB	1.10 B	Metal	pp 512	565.68 ± 0.21
llama 1B F16	2.05 GiB	1.10 B	Metal	tg 128	20.06 ± 0.04
phi2 3B Q4_0	1.49 GiB	2.78 B	Metal	pp 512	121.64 ± 0.01
phi2 3B Q4_0	1.49 GiB	2.78 B	Metal	tg 128	23.31 ± 0.05
phi2 3B Q8_0	2.75 GiB	2.78 B	Metal	pp 512	119.58 ± 0.05
phi2 3B Q8_0	2.75 GiB	2.78 B	Metal	tg 128	14.06 ± 0.14

2 replies

ggerganov Dec 20, 2023
Maintainer Author

Thanks! I guess Mistral 7B does not fit on this device?

Krish120003 Dec 20, 2023

@ggerganov It does, I can load it and run it by sending a message, but the Bench button keeps aborting indicating heat up time being too long.

Pablo-Merino · 2023-12-20T17:24:07Z

Pablo-Merino
Dec 20, 2023

iPhone 12 (A14) 🟨

model	size	params	backend	test	t/s
llama 1B Q4_0	0.59 GiB	1.10 B	Metal	pp 512	227.46 ± 14.55
llama 1B Q4_0	0.59 GiB	1.10 B	Metal	tg 128	37.89 ± 0.27
llama 1B Q8_0	1.09 GiB	1.10 B	Metal	pp 512	224.22 ± 22.57
llama 1B Q8_0	1.09 GiB	1.10 B	Metal	tg 128	23.23 ± 0.12

llama 1B F16 has a long heat up time, so it aborted (28,16 seconds)
phi2 3B Q4_0 has a long heat up time, so it aborted (9,44 seconds)
phi2 Q8_0 crashes the app
Not testing the Mistral model since it's too large for the device's RAM (3.8GB model vs 4GB device)

0 replies

anchorbob · 2023-12-27T20:19:58Z

anchorbob
Dec 27, 2023

can anyone tell me what does the output metric (t/s) mean? tokens per second or what?

2 replies

XiongjieDai Jan 4, 2024

It's tokens per second.

anchorbob Jan 29, 2024

thanks for confirmation

xsailor511 · 2024-01-29T06:22:29Z

xsailor511
Jan 29, 2024

Can anyone tell me about llama 1b download link? I can't find it on HF or not sure which is.

1 reply

lawyinking Feb 28, 2024

https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v0.3-GGUF/tree/main

cosmo3769 · 2024-03-13T05:32:37Z

cosmo3769
Mar 13, 2024

Hi, I was trying to load starcoderbase-3b-GGUF. It is not getting loaded in iphone 15 pro simulator. It is stuck with Loading model....
When investigating, I encountered one warning message: Publishing changes from background threads is not allowed; make sure to publish values from the main thread (via operators like receive(on:)) on model updates. What could be the cause of this? Thank you.

1 reply

cosmo3769 Mar 14, 2024

The above model got loaded on my device (iphone 13, ios 17.3). But when I try to send a message or benchmark it, I get Heat up time is too long message. The same message I am getting in iphone 15 (simulator), iphone 15 pro (simulator), and iphone 15 pro max (simulator) as well. The size of the model is 2.05 GB. @ggerganov

beebopkim · 2024-03-21T14:59:45Z

beebopkim
Mar 21, 2024

iPhone SE (3rd Generation), A15 2+4 CPU, 4 GPU, 4 GB of RAM

model	size	params	backend	test	t/s
llama 1B F16	2.05 GiB	1.10 B	Metal	pp 512	428.48 ± 1.24
llama 1B F16	2.05 GiB	1.10 B	Metal	tg 128	13.63 ± 0.03
llama 1B Q8_0	1.09 GiB	1.10 B	Metal	pp 512	141.71 ± 59.55
llama 1B Q8_0	1.09 GiB	1.10 B	Metal	tg 128	15.03 ± 0.87
llama 1B Q4_0	0.59 GiB	1.10 B	Metal	pp 512	152.40 ± 53.64
llama 1B Q4_0	0.59 GiB	1.10 B	Metal	tg 128	18.37 ± 1.67
phi2 3B Q8_0	2.75 GiB	2.78 B	Metal	pp 512	Model loaded but benchmark failed because llama.swift was killed
phi2 3B Q8_0	2.75 GiB	2.78 B	Metal	tg 128	Model loaded but benchmark failed because llama.swift was killed
phi2 3B Q4_0	1.49 GiB	2.78 B	Metal	pp 512	29.73 ± 5.09
phi2 3B Q4_0	1.49 GiB	2.78 B	Metal	tg 128	7.08 ± 2.08
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	pp 512	Looks like model was loaded but benchmark failed with `Heat up time is too long`
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	tg 128	Looks like model was loaded but benchmark failed with `Heat up time is too long`

0 replies

aneeshmb02 · 2024-05-10T11:11:08Z

aneeshmb02
May 10, 2024

What data/prompts are used for this?

0 replies

eakarsu · 2024-06-12T18:21:31Z

eakarsu
Jun 12, 2024

I have run llamma.cpp on ios device (iphone) described here. But models are giving garbage response. what am I doing wrong?

0 replies

kinchahoy · 2024-11-05T10:08:35Z

kinchahoy
Nov 5, 2024

Would it be possible to update these instructions for a recent version of XCode? I get a simple error that I can't quite figure out:
"/Users/USER/inference/llama.cpp/air-lld:1:1 81 duplicated symbols for target 'air64_v23-apple-ios14.0.0-simulator"

0 replies

Beingpax · 2024-11-26T11:13:57Z

Beingpax
Nov 26, 2024

Same error as kinchahoy. Getting the same error.

0 replies

Performance of llama.cpp on Apple Silicon A-series #4508

ggerganov Dec 17, 2023 Maintainer

Summary

Description

Instructions

iPhone 13 mini ✅

Footnotes

Replies: 15 comments · 21 replies

jhen0409 Dec 17, 2023 Collaborator Sponsor

iPhone 15 Pro (A17 Pro) ✅

ggerganov Dec 18, 2023 Maintainer Author

jhen0409 Dec 18, 2023 Collaborator Sponsor

iPhone 15 Pro Max (A17 Pro) ✅

ggerganov Dec 18, 2023 Maintainer Author

iPhone 12 mini (A14) ✅

ggerganov Dec 21, 2023 Maintainer Author

ggerganov Dec 21, 2023 Maintainer Author

iPhone 13 Pro (A15) ✅

iPhone 14 Pro (A16) ✅

ggerganov Dec 20, 2023 Maintainer Author

iPhone 12 (A14) 🟨

iPhone SE (3rd Generation), A15 2+4 CPU, 4 GPU, 4 GB of RAM

ggerganov
Dec 17, 2023
Maintainer

Replies: 15 comments 21 replies

jhen0409
Dec 17, 2023
Collaborator Sponsor

ggerganov Dec 18, 2023
Maintainer Author

jhen0409 Dec 18, 2023
Collaborator Sponsor

ggerganov Dec 18, 2023
Maintainer Author

ggerganov Dec 21, 2023
Maintainer Author

ggerganov Dec 21, 2023
Maintainer Author

ggerganov Dec 20, 2023
Maintainer Author