are we wasting ram with llama_get_logits() ? #10258

robbiemu · 2024-11-11T23:09:23Z

robbiemu
Nov 11, 2024

I noticed that llama_cpp_python's model.scores was always returning np arrays with dtype fp32, and I filed a bug, assuming they had bound the underlying llama_get_logits() function incorrectly. But then I came here and I see:

    // Token logits obtained from the last call to llama_decode()
    // The logits for which llama_batch.logits[i] != 0 are stored contiguously
    // in the order they have appeared in the batch.
    // Rows: number of tokens for which llama_batch.logits[i] != 0
    // Cols: n_vocab
    LLAMA_API float * llama_get_logits(struct llama_context * ctx);

That's a hardcoded up-converted fp16 right? whether from fp16, bf16, or a k_l quantized model (float 8)

Am I missing something?

I have a bf16 model that has a context size of 8k and 256000 tokens. That means my fp32 chunk size is ~8GB. But it is bf16, its weights are only ~4GB (right?). That's a lot of unnecessary ram, if I am not mistaken.

robbiemu · 2024-11-11T23:26:50Z

robbiemu
Nov 11, 2024
Author

I should have asked a model first, my apologizes. in case anyone else looks it up:

the output weights are quantized, but the math in producing the logits is still done in an fp32 context, so the logits are fp32.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

are we wasting ram with llama_get_logits() ? #10258

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

are we wasting ram with llama_get_logits() ? #10258

robbiemu Nov 11, 2024

Replies: 1 comment

robbiemu Nov 11, 2024 Author

robbiemu
Nov 11, 2024

robbiemu
Nov 11, 2024
Author