Bug: gemma-2-9b-it inference speed very slow 1.73 tokens per second #9906
Labels
bug-unconfirmed
low severity
Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)
What happened?
System Info
Device: Ascend 910B3
OS: Ubuntu 20.04.6 LTS
Arch: aarch64
command:
./build/bin/llama-cli -m ./models/gemma-2-9b-it.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
log:
llama_perf_sampler_print: sampling time = 152.66 ms / 414 runs ( 0.37 ms per token, 2711.82 tokens per second)
llama_perf_context_print: load time = 7152.40 ms
llama_perf_context_print: prompt eval time = 619.28 ms / 14 tokens ( 44.23 ms per token, 22.61 tokens per second)
llama_perf_context_print: eval time = 230735.63 ms / 399 runs ( 578.28 ms per token, 1.73 tokens per second)
llama_perf_context_print: total time = 231911.05 ms / 413 tokens
Name and Version
./build/bin/llama-cli --version
version: 3923 (becfd38)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
The CPU usage is very high, while the NPU usage is low, suggesting that the NPU is not being utilized during inference.
The text was updated successfully, but these errors were encountered: