Measure perplexity delta between Q4_0 and F16 "output" tensor #1003
Labels
generation quality
Quality of model output
good first issue
Good for newcomers
help wanted
Extra attention is needed
high priority
Very important issue
The last tensor of the transformer (called
output
in llama.cpp) is one of the biggest ones:llama.cpp/llama.cpp
Line 945 in 0ad9646
I wonder how the perplexity improves by keeping it in F16 format instead of quantizing that particular tensor
Results
Q4_0 M1 Pro (with BLAS) [655]6.2838 (i.e. reference)
Q4_0 + F16 "output" M1 Pro (with BLAS) [655]6.2355
Perplexity delta:
-0.0542
Q4_0 + F16 "tok_embd" M1 Pro (with BLAS) [655]6.2838
Perplexity delta:
-0.0059
Q4_0 + F16 "output" + F16 "tok_embd" M1 Pro (with BLAS) [655]6.2357
Perplexity delta:
-0.0540
M1 Pro results
tok_embd
output
Q4_0
Q4_0
6.2897
0.0000
3.9G
Q4_0
F16
6.2355
-0.0542
4.1G
F16
Q4_0
6.2838
-0.0059
4.1G
F16
F16
6.2357
-0.0540
4.3G
The text was updated successfully, but these errors were encountered: