[Feature]: Reduce LoRA latency via speculative decoding #6912

cadedaniel · 2024-07-29T19:43:19Z

🚀 The feature, motivation and pitch

The speculative decoding framework allows the target model to have LoRAs, however the work to set up batch expansion has not yet been done. We can implement batch expansion for LoRA and allow speculative decoding for LoRA.

The work required is basically to implement batch expansion but pass through the LoRA arguments. See "Let’s talk about code" in the following notes: https://docs.google.com/document/d/1z4Tgb1FcDr3YXvFPelyn-T-DEnLqqrlrxRi3TvIyAmg/edit

I expect this to work well for larger models (e.g. 70B) but more difficult with smaller models due to latency constraints and vLLM overheads. Perhaps with a speculator like ngram / eagle / mlpspeculator it can work for 7b models as well.

Note this work does not include applying LoRA to the speculator; that can be a future work.

Alternatives

No response

Additional context

No response

kevmo314 · 2024-07-31T00:31:45Z

I took a first pass, admittedly there's a lot of knowledge I'm not so familiar with but I would really like this feature so I'll invest some time into it and see if I can make some progress. If anyone else is interested, happy to collaborate.

cadedaniel · 2024-07-31T00:33:13Z

Awesome! also recommend checking out https://www.youtube.com/watch?v=9wNAgpX6z_4 if you're new to speculative decoding in vllm.

skylee-01 · 2024-08-27T09:40:13Z

May I ask how soon this feature will be supported? @cadedaniel

cadedaniel added the feature request label Jul 29, 2024

cadedaniel mentioned this issue Jul 29, 2024

[Bug]: Shape error encountered in speculative decoding when enable_lora=True #4872

Open

cadedaniel mentioned this issue Aug 5, 2024

[WIP] Speculative decoding using a draft model #2188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Reduce LoRA latency via speculative decoding #6912

[Feature]: Reduce LoRA latency via speculative decoding #6912

cadedaniel commented Jul 29, 2024 •

edited

Loading

kevmo314 commented Jul 31, 2024

cadedaniel commented Jul 31, 2024

skylee-01 commented Aug 27, 2024

[Feature]: Reduce LoRA latency via speculative decoding #6912

[Feature]: Reduce LoRA latency via speculative decoding #6912

Comments

cadedaniel commented Jul 29, 2024 • edited Loading

🚀 The feature, motivation and pitch

Alternatives

Additional context

kevmo314 commented Jul 31, 2024

cadedaniel commented Jul 31, 2024

skylee-01 commented Aug 27, 2024

cadedaniel commented Jul 29, 2024 •

edited

Loading