Doubts about KV Cache and Obtaining LogProbs #2549

aflah02 · 2024-01-22T14:26:28Z

Hi
Thanks for the great library!
I need to run inference on a ton of sequences and get their log probabilities. I have approximately 100K sequences which can be binned into groups of 100 which share a significant amount of common prefix. For example, I have 100 sequences which start with 'Wikipedia was built in' and have different suffixes.

Does the library automatically figure out the optimal KV Cache? or can I specify it somehow?

If I build batches of say size 200 where one batch might have 100 sequences starting with 'Wikipedia was built in' and another 100 starting with 'Google was built in', will the vLLM engine automatically optimize the KV cache to reuse the computation done for the prefix?

Since I only need the Log Probs and I don't really need the next generated token, I've set max tokens to be generated as 1 but can I somehow eliminate the generation process and only get the log probs?

linbeyoung · 2024-02-02T06:45:14Z

same quesiton

aflah02 · 2024-02-04T13:55:06Z

@linbeyoung This might be relevant for you: sgl-project/sglang#81
I don't know how this issue got closed as I still have some of these questions. Reopening it

hmellor · 2024-04-04T12:07:07Z

This sounds like an issue that's solved with prefix caching #2614

aflah02 closed this as completed Jan 22, 2024

aflah02 reopened this Feb 4, 2024

hmellor closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doubts about KV Cache and Obtaining LogProbs #2549

Doubts about KV Cache and Obtaining LogProbs #2549

aflah02 commented Jan 22, 2024

linbeyoung commented Feb 2, 2024

aflah02 commented Feb 4, 2024

hmellor commented Apr 4, 2024

Doubts about KV Cache and Obtaining LogProbs #2549

Doubts about KV Cache and Obtaining LogProbs #2549

Comments

aflah02 commented Jan 22, 2024

linbeyoung commented Feb 2, 2024

aflah02 commented Feb 4, 2024

hmellor commented Apr 4, 2024