Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Reduce LoRA latency via speculative decoding #6912

Open
cadedaniel opened this issue Jul 29, 2024 · 3 comments
Open

[Feature]: Reduce LoRA latency via speculative decoding #6912

cadedaniel opened this issue Jul 29, 2024 · 3 comments

Comments

@cadedaniel
Copy link
Collaborator

cadedaniel commented Jul 29, 2024

🚀 The feature, motivation and pitch

The speculative decoding framework allows the target model to have LoRAs, however the work to set up batch expansion has not yet been done. We can implement batch expansion for LoRA and allow speculative decoding for LoRA.

The work required is basically to implement batch expansion but pass through the LoRA arguments. See "Let’s talk about code" in the following notes: https://docs.google.com/document/d/1z4Tgb1FcDr3YXvFPelyn-T-DEnLqqrlrxRi3TvIyAmg/edit

I expect this to work well for larger models (e.g. 70B) but more difficult with smaller models due to latency constraints and vLLM overheads. Perhaps with a speculator like ngram / eagle / mlpspeculator it can work for 7b models as well.

Note this work does not include applying LoRA to the speculator; that can be a future work.

Alternatives

No response

Additional context

No response

@kevmo314
Copy link

I took a first pass, admittedly there's a lot of knowledge I'm not so familiar with but I would really like this feature so I'll invest some time into it and see if I can make some progress. If anyone else is interested, happy to collaborate.

@cadedaniel
Copy link
Collaborator Author

Awesome! also recommend checking out https://www.youtube.com/watch?v=9wNAgpX6z_4 if you're new to speculative decoding in vllm.

@skylee-01
Copy link
Contributor

May I ask how soon this feature will be supported? @cadedaniel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants