[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #5016

cadedaniel · 2024-05-23T20:54:22Z

🚀 The feature, motivation and pitch

Speculative decoding can achieve 50%+ latency reduction, but in vLLM it can suffer from the throughput-optimized default scheduling strategy where prefills are prioritized eagerly. Chunked prefill is a recent work in vLLM which optimizes this by spreading out the prefill work over many different decode batches. We can combine chunked prefill with speculative decoding's dynamic speculation length to get the best of both worlds.

This is a complex task that requires some design, if you're interested please reach out.

Alternatives

No response

Additional context

cc @LiuXiaoxuanPKU @comaniac @rkooo567

Dbxwz · 2024-06-07T03:05:08Z

Hello, you mentioned optimizations for scoring time in #4630

P1 (Large) Replace CPU-based batch expansion with multi-query attention kernel call

I think multi-query attention kernel is not equal to MQA here, it is more like the append stage in flashinfer, am I right?
And I notice that the calculation process of append is similar to that of chunked prefill's one step. So I use chunked prefill to implement the AppendTop1Scorer which get a 10% speedup compared to BatchExpansionTop1Scorer. It's a dirty solution, since I create a new SequenceGroupMetadata which change the scoring sequence to a chunked prefill sequence. This implementation conflicts with recompute and chunked prefille.

So the perfect implementation should be that ModelRunner and Backend support the append stage, Backend should already support it if it supports chunked prefill.

In addition, is this issue about solving the scheduling problem of speculative decoding? Can you give a detailed introduction to what needs to be done in this issue?

cadedaniel · 2024-06-07T03:13:02Z

That's awesome. You should chat with @LiuXiaoxuanPKU who is removing batch expansion from vLLM.

FYI this issue is about combining the ITL improvements obtained from chunked prefill scheduling with spec decode.

NickLucche · 2024-09-20T16:00:30Z

I can look into this

cadedaniel added feature request speculative-decoding labels May 23, 2024

cadedaniel mentioned this issue May 24, 2024

[Speculative Decoding] Medusa Implementation with Top-1 proposer #4978

Merged

hmellor mentioned this issue May 31, 2024

[Performance]: What can we learn from OctoAI #5167

Closed

bong-furiosa mentioned this issue Jun 17, 2024

[Bugfix] We have fixed the bug that occurred when using FlashInfer as the backend in vLLM Speculative Decoding. #5412

Closed

wallashss mentioned this issue Sep 2, 2024

[DRAFT] Compatibility Matrix opendatahub-io/vllm#140

Closed

wallashss mentioned this issue Sep 16, 2024

[Doc] Compatibility matrix for mutual exclusive features #8512

Merged

njhill assigned NickLucche Sep 24, 2024

NickLucche mentioned this issue Oct 11, 2024

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #9291

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #5016

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #5016

cadedaniel commented May 23, 2024

Dbxwz commented Jun 7, 2024 •

edited

Loading

cadedaniel commented Jun 7, 2024 •

edited

Loading

NickLucche commented Sep 20, 2024

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #5016

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #5016

Comments

cadedaniel commented May 23, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

Dbxwz commented Jun 7, 2024 • edited Loading

cadedaniel commented Jun 7, 2024 • edited Loading

NickLucche commented Sep 20, 2024

Dbxwz commented Jun 7, 2024 •

edited

Loading

cadedaniel commented Jun 7, 2024 •

edited

Loading