-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #5016
Comments
Hello, you mentioned optimizations for scoring time in #4630
I think multi-query attention kernel is not equal to MQA here, it is more like the append stage in flashinfer, am I right? So the perfect implementation should be that ModelRunner and Backend support the append stage, Backend should already support it if it supports chunked prefill. In addition, is this issue about solving the scheduling problem of speculative decoding? Can you give a detailed introduction to what needs to be done in this issue? |
That's awesome. You should chat with @LiuXiaoxuanPKU who is removing batch expansion from vLLM. FYI this issue is about combining the ITL improvements obtained from chunked prefill scheduling with spec decode. |
I can look into this |
🚀 The feature, motivation and pitch
Speculative decoding can achieve 50%+ latency reduction, but in vLLM it can suffer from the throughput-optimized default scheduling strategy where prefills are prioritized eagerly. Chunked prefill is a recent work in vLLM which optimizes this by spreading out the prefill work over many different decode batches. We can combine chunked prefill with speculative decoding's dynamic speculation length to get the best of both worlds.
This is a complex task that requires some design, if you're interested please reach out.
Alternatives
No response
Additional context
cc @LiuXiaoxuanPKU @comaniac @rkooo567
The text was updated successfully, but these errors were encountered: