Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Heavy throttling during token generation on Apple Silicon #10444

Open
Azirine opened this issue Nov 21, 2024 · 8 comments
Open

Bug: Heavy throttling during token generation on Apple Silicon #10444

Azirine opened this issue Nov 21, 2024 · 8 comments
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

Comments

@Azirine
Copy link

Azirine commented Nov 21, 2024

What happened?

There is heavy throttling during token generation on Apple Silicon. The machine tested is MacBook Pro 14" M3 Max with 128 GB memory. In my experience, throttling occurs more often with larger models (≥70B). Qwen 72B Q4_0 GGUF is tested in this case, although throttling does not happen exclusively with this model.

The tests were performed under high-power mode with the original 96W adapter plugged in, to ensure that the machine is not power limited. The max core temperature during throttling (middle of the 4th run in this case) hovered between 60-70°C, meaning the throttling should not be due to thermal limitations. I have experienced this issue for months across many different versions of llama.cpp, so it is not version specific.

Name and Version

version: 4104 (0fff7fd)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin24.1.0

What operating system are you seeing the problem on?

Mac

Relevant log output

Steps to reproduce:
./llama-bench -m qwen2.5-72b-instruct-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
sleep 10
./llama-bench -m qwen2.5-72b-instruct-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
sleep 10
... (repeated)

Results:
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.53 GiB |    72.96 B | Metal,BLAS |      12 |  1 |    0 |          tg32 |          8.68 ± 0.07 |
| qwen2 70B Q4_0                 |  38.53 GiB |    72.96 B | Metal,BLAS |      12 |  1 |    0 |          tg32 |          8.60 ± 0.15 |
| qwen2 70B Q4_0                 |  38.53 GiB |    72.96 B | Metal,BLAS |      12 |  1 |    0 |          tg32 |          8.51 ± 0.11 |
| qwen2 70B Q4_0                 |  38.53 GiB |    72.96 B | Metal,BLAS |      12 |  1 |    0 |          tg32 |          5.40 ± 1.10 |
| qwen2 70B Q4_0                 |  38.53 GiB |    72.96 B | Metal,BLAS |      12 |  1 |    0 |          tg32 |          2.67 ± 0.58 |
| qwen2 70B Q4_0                 |  38.53 GiB |    72.96 B | Metal,BLAS |      12 |  1 |    0 |          tg32 |          2.37 ± 0.58 |

build: 0fff7fd7 (4104)
@Azirine Azirine added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Nov 21, 2024
@Azirine
Copy link
Author

Azirine commented Nov 21, 2024

I have coordinated with an Apple Senior Specialist in attempt to resolve this issue. Under their advice, I have tested this throttling under many different conditions (such as with a brand new OS with nothing installed), but the issue remained. I have also provided them with screen recordings of when the throttling occurred, along with detailed diagnostics obtained with the proprietary Capture Data tool by Apple. With this data, they concluded that there was no hardware issue with my device, and that there was no overheating during the tests.

@slaren
Copy link
Collaborator

slaren commented Nov 21, 2024

I see the same throttling with my M3 Max (same config as yours), but there is not much we can do about that. llama.cpp does not keep any state between runs, the issue is entirely within the OS or hardware.

@ggerganov
Copy link
Owner

ggerganov commented Nov 21, 2024

It's probably something related to M3, because I don't reproduce it neither on MacBook M1 Pro, nor Mac Studio M2 Ultra:

./llama-bench -m models/qwen2.5-32b-coder-instruct/ggml-model-q4_k.gguf -mmp 0 -fa 1 -p 0 \
    -n 32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,

MacBook M1 Pro

model size params backend threads fa mmap test t/s
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.72 ± 0.08
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.62 ± 0.12
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.64 ± 0.12
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.63 ± 0.12
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.69 ± 0.16
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.57 ± 0.15
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.71 ± 0.13
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.67 ± 0.15
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.67 ± 0.12
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.66 ± 0.10
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.66 ± 0.13
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.68 ± 0.16
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.64 ± 0.11
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.59 ± 0.11
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.66 ± 0.11
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.66 ± 0.09
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.62 ± 0.09
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.70 ± 0.13
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.61 ± 0.15
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.68 ± 0.15
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.63 ± 0.14
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.67 ± 0.14
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.68 ± 0.14
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.69 ± 0.11
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.64 ± 0.09
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.69 ± 0.13
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.61 ± 0.12
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.69 ± 0.15
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.59 ± 0.11
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.62 ± 0.10
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.73 ± 0.07
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.63 ± 0.16
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.68 ± 0.12
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.71 ± 0.17
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.69 ± 0.14
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.63 ± 0.11
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.61 ± 0.17
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.70 ± 0.10
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.68 ± 0.14
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.65 ± 0.15
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.64 ± 0.13
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.65 ± 0.10
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.66 ± 0.08
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.61 ± 0.11
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.75 ± 0.20
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.58 ± 0.09
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.63 ± 0.12
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.68 ± 0.14
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.70 ± 0.13
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.64 ± 0.11
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.68 ± 0.13
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.61 ± 0.11
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.70 ± 0.13
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.57 ± 0.15
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.66 ± 0.07
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.72 ± 0.12
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.62 ± 0.18
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.65 ± 0.15
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.67 ± 0.15
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.74 ± 0.14
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.66 ± 0.10
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.65 ± 0.12
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.68 ± 0.14
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.67 ± 0.15
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.66 ± 0.18
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.59 ± 0.08
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.71 ± 0.12
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.66 ± 0.13
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.66 ± 0.21
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.60 ± 0.13
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.69 ± 0.15
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.68 ± 0.15
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.64 ± 0.10
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.69 ± 0.14
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.63 ± 0.07
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.59 ± 0.10
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.76 ± 0.13
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.68 ± 0.12
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.68 ± 0.11
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.68 ± 0.09
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.84 ± 0.33
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.67 ± 0.11
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.61 ± 0.16
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.72 ± 0.11
llama 3B Q4_0 1.78 GiB 3.21 B Metal,BLAS 8 1 0 tg32 68.61 ± 0.10

build: 3ee6382 (4132)

M2 Ultra

model size params backend threads fa mmap test t/s
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.82 ± 0.01
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.80 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.76 ± 0.04
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.77 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.77 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.78 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.80 ± 0.01
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.84 ± 0.04
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.86 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.93 ± 0.06
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.81 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.77 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.79 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.93 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.88 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.93 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.89 ± 0.04
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.94 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.90 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.88 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.88 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.84 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.88 ± 0.04
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.90 ± 0.04
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.90 ± 0.05
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.87 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.85 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.81 ± 0.06
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.78 ± 0.01
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.78 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.88 ± 0.04
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.89 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.88 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.88 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.88 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.88 ± 0.06
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.88 ± 0.04
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.86 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.90 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.89 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.87 ± 0.04
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.86 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.89 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.88 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.85 ± 0.05
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.87 ± 0.04
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.86 ± 0.04
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.90 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.92 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.87 ± 0.04
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.91 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.89 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.91 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.92 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.89 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.90 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.77 ± 0.10
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.57 ± 0.11
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.74 ± 0.07
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.75 ± 0.06
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.80 ± 0.05
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.83 ± 0.05
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.89 ± 0.04
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.92 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.91 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.89 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.92 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.78 ± 0.14
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.93 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.90 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.86 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.84 ± 0.04
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.76 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.80 ± 0.04
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.89 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.90 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.87 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.91 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.90 ± 0.02
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.89 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.93 ± 0.03
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.89 ± 0.01
qwen2 32B Q4_K 18.48 GiB 32.76 B Metal,BLAS 16 1 0 tg32 25.87 ± 0.02

build: 1bb30bf (4149)

@ggerganov
Copy link
Owner

It does not seem to reproduce on M4 Mac Mini either:

model size params backend threads fa mmap test t/s
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.43 ± 0.98
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.46 ± 0.31
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.59 ± 0.41
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.29 ± 0.41
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 101.03 ± 0.37
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 101.27 ± 0.42
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 101.14 ± 0.49
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 101.20 ± 0.47
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 101.12 ± 0.48
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.90 ± 0.52
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.34 ± 1.67
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 98.96 ± 0.37
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.48 ± 0.07
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.36 ± 0.93
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 101.34 ± 0.28
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 101.05 ± 0.21
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.99 ± 0.32
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 101.00 ± 0.51
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 101.09 ± 0.49
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.99 ± 0.52
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.99 ± 0.77
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.65 ± 0.15
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.45 ± 0.25
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.80 ± 0.38
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.24 ± 0.20
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 101.03 ± 0.14
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 101.23 ± 0.17
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.99 ± 0.07
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.34 ± 0.67
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.58 ± 0.61
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.77 ± 0.88
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.38 ± 0.30
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.29 ± 0.19
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.63 ± 0.49
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.19 ± 0.67
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.97 ± 0.75
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.79 ± 0.38
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 101.10 ± 0.42
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.62 ± 0.51
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.79 ± 0.61
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.72 ± 0.45
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.61 ± 0.56
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.70 ± 0.47
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.39 ± 0.63
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.58 ± 0.55
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.46 ± 0.14
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.23 ± 0.63
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 98.77 ± 0.90
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 98.12 ± 0.31
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.25 ± 0.57
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.46 ± 0.28
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.03 ± 0.99
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.34 ± 0.35
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.72 ± 0.32
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.74 ± 0.23
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.76 ± 0.22
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.39 ± 0.38
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.28 ± 0.16
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.34 ± 0.40
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.88 ± 0.26
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.50 ± 0.35
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.54 ± 0.45
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.60 ± 0.29
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.85 ± 0.46
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.60 ± 0.47
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.62 ± 1.60
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.37 ± 0.66
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.25 ± 0.29
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.32 ± 0.28
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.00 ± 0.32
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.49 ± 0.65
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.47 ± 0.39
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.78 ± 0.34
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.71 ± 0.15
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.05 ± 0.35
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.31 ± 0.43
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.60 ± 0.76
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.16 ± 0.27
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 98.97 ± 0.31
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.22 ± 0.31
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.46 ± 0.31
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.71 ± 0.37
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.40 ± 0.47
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.37 ± 0.27
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 98.31 ± 1.15
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.28 ± 0.19
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.68 ± 0.45
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.26 ± 0.47
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.23 ± 0.45
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.21 ± 0.59
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 98.80 ± 0.18
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.13 ± 0.31
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 99.65 ± 0.53
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.45 ± 0.44
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.35 ± 0.50
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.50 ± 0.32
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.51 ± 0.21
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.78 ± 0.32
gptneox 1.4B Q4_0 786.31 MiB 1.41 B Metal,BLAS 4 1 0 tg32 100.49 ± 0.45

@Azirine
Copy link
Author

Azirine commented Nov 22, 2024

@ggerganov Can you reproduce it with the exact model and steps (with sleep 10)?

./llama-gguf-split --merge qwen2.5-72b-instruct-q4_0-00001-of-00011.gguf qwen2.5-72b-instruct-q4_0.gguf
SHA256: 8ba2efc6f55d3ddb9544ed2b4f7e8df8d79b87308e4cd624561063d0bc1e0033

@max-krasnyansky
Copy link
Collaborator

Here is M2 Max behavior. I cannot run that qwen 72b model because my M2 Max is has 32GB DDR
but large enough prompt show GPU throttling quite well.

~/src/llama.cpp-master$ ./build-macos/bin/llama-bench -m ../gguf/llama-v3.1-8b-instruct.q4_0.gguf -ngl 99 -mmp 0 -fa 1 -p 0 -n 512,512,512,512,512,512
model size params backend threads fa mmap test t/s
llama 8B Q4_0 4.33 GiB 8.03 B Metal,BLAS 8 1 0 tg512 50.36 ± 4.06
llama 8B Q4_0 4.33 GiB 8.03 B Metal,BLAS 8 1 0 tg512 45.84 ± 0.81
llama 8B Q4_0 4.33 GiB 8.03 B Metal,BLAS 8 1 0 tg512 44.51 ± 0.59
llama 8B Q4_0 4.33 GiB 8.03 B Metal,BLAS 8 1 0 tg512 44.11 ± 0.33

You can check the actual frequencies and power with

$ sudo powermetrics --samplers gpu_power | grep 'Power:\|frequency:'
...
GPU HW active frequency: 444 MHz.               <<< Idle
GPU Power: 8 mW
GPU HW active frequency: 470 MHz
GPU Power: 10 mW
GPU HW active frequency: 1385 MHz               <<< Initial burst
GPU Power: 17490 mW
GPU HW active frequency: 1227 MHz
GPU Power: 28022 mW
GPU HW active frequency: 919 MHz                 <<< Starts throttling
GPU Power: 17929 mW
GPU HW active frequency: 925 MHz
GPU Power: 18163 mW
GPU HW active frequency: 916 MHz
GPU Power: 17993 mW
GPU HW active frequency: 909 MHz
GPU Power: 17915 mW
GPU HW active frequency: 903 MHz
GPU Power: 17857 mW
GPU HW active frequency: 892 MHz
GPU Power: 17661 mW
GPU HW active frequency: 879 MHz
GPU Power: 17409 mW
GPU HW active frequency: 873 MHz               <<< settles around this freq/power

As slaren already explained there is not much we can do.
The frequency scaling is controlled internally by the OS and DCVS FW.

@ggerganov
Copy link
Owner

@Azirine Here is the same steps as yours using Q4_0 model which I converted from the F16 model using llama-quantize ... q4_0. It does not reproduce the throttling on M2 Ultra:

$ ▶ bash test.sh 
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.90 ± 0.04 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.96 ± 0.01 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.96 ± 0.01 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.95 ± 0.01 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.97 ± 0.00 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.96 ± 0.01 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.98 ± 0.01 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.97 ± 0.00 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.97 ± 0.00 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.98 ± 0.00 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.97 ± 0.01 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         15.00 ± 0.03 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.97 ± 0.01 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.97 ± 0.01 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.99 ± 0.00 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.99 ± 0.01 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.99 ± 0.01 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.99 ± 0.00 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.99 ± 0.00 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.98 ± 0.00 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.99 ± 0.01 |

build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.39 GiB |    72.71 B | Metal,BLAS |      16 |  1 |    0 |          tg32 |         14.99 ± 0.01 |

build: 6dfcfef0 (4153)

sha256sum ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf 
6ad1ee4cd0330387434608b20dd0ebd26bc9a9355abb9042166d587ef6e17538  ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf

@Azirine
Copy link
Author

Azirine commented Nov 23, 2024

I observed that this behaviour is different from the throttling that occurs after the initial burst.

Throttling after initial burst:
mpv-shot0001

Heavy throttling (note nominal thermal pressure):
mpv-shot0002

Slow recovery in frequency and power after one minute:
mpv-shot0003

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Projects
None yet
Development

No branches or pull requests

4 participants