Replies: 1 comment
-
I should have asked a model first, my apologizes. in case anyone else looks it up: the output weights are quantized, but the math in producing the logits is still done in an fp32 context, so the logits are fp32. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I noticed that llama_cpp_python's model.scores was always returning np arrays with dtype fp32, and I filed a bug, assuming they had bound the underlying llama_get_logits() function incorrectly. But then I came here and I see:
That's a hardcoded up-converted fp16 right? whether from fp16, bf16, or a k_l quantized model (float 8)
Am I missing something?
I have a bf16 model that has a context size of 8k and 256000 tokens. That means my fp32 chunk size is ~8GB. But it is bf16, its weights are only ~4GB (right?). That's a lot of unnecessary ram, if I am not mistaken.
Beta Was this translation helpful? Give feedback.
All reactions