You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi
Thanks for the great library!
I need to run inference on a ton of sequences and get their log probabilities. I have approximately 100K sequences which can be binned into groups of 100 which share a significant amount of common prefix. For example, I have 100 sequences which start with 'Wikipedia was built in' and have different suffixes.
Does the library automatically figure out the optimal KV Cache? or can I specify it somehow?
If I build batches of say size 200 where one batch might have 100 sequences starting with 'Wikipedia was built in' and another 100 starting with 'Google was built in', will the vLLM engine automatically optimize the KV cache to reuse the computation done for the prefix?
Since I only need the Log Probs and I don't really need the next generated token, I've set max tokens to be generated as 1 but can I somehow eliminate the generation process and only get the log probs?
The text was updated successfully, but these errors were encountered:
@linbeyoung This might be relevant for you: sgl-project/sglang#81
I don't know how this issue got closed as I still have some of these questions. Reopening it
Hi
Thanks for the great library!
I need to run inference on a ton of sequences and get their log probabilities. I have approximately 100K sequences which can be binned into groups of 100 which share a significant amount of common prefix. For example, I have 100 sequences which start with 'Wikipedia was built in' and have different suffixes.
Does the library automatically figure out the optimal KV Cache? or can I specify it somehow?
If I build batches of say size 200 where one batch might have 100 sequences starting with 'Wikipedia was built in' and another 100 starting with 'Google was built in', will the vLLM engine automatically optimize the KV cache to reuse the computation done for the prefix?
Since I only need the Log Probs and I don't really need the next generated token, I've set max tokens to be generated as 1 but can I somehow eliminate the generation process and only get the log probs?
The text was updated successfully, but these errors were encountered: