How to do speculative sampling with vllm? #1042

lsy643 · 2023-09-14T08:44:44Z

When I want to use speculative sampling in the vllm, in the generation step, the number of input tokens of each sequence is larger than one, and then error "an illegal memory access was encountered" is reported.

Can you guys suggest a way to support speculative sampling with vllm?

Thanks a lot

casper-hansen · 2023-09-14T18:08:40Z

I would love to see speculative decoding implemented in vLLM as it could potentially 2x the throughput. It's not possible at the moment, though.

cadedaniel · 2024-01-24T06:50:11Z

It is currently not possible. Once #2188 is done, you can enable it.

lsy643 changed the title ~~Can the Op single_query_cached_kv_attention in PageAttention Support Multiple token in one sequence?~~ How to do speculative sampling with vllm? Sep 14, 2023

hmellor closed this as not planned Won't fix, can't repro, duplicate, stale Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to do speculative sampling with vllm? #1042

How to do speculative sampling with vllm? #1042

lsy643 commented Sep 14, 2023 •

edited

Loading

casper-hansen commented Sep 14, 2023

cadedaniel commented Jan 24, 2024

How to do speculative sampling with vllm? #1042

How to do speculative sampling with vllm? #1042

Comments

lsy643 commented Sep 14, 2023 • edited Loading

casper-hansen commented Sep 14, 2023

cadedaniel commented Jan 24, 2024

lsy643 commented Sep 14, 2023 •

edited

Loading