-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Speculative Decoding #242
Conversation
Code Metrics Report─────────────────────────────────────────────────────────────────────────────── Language Files Lines Blanks Comments Code Complexity ─────────────────────────────────────────────────────────────────────────────── Rust 72 23863 1572 530 21761 1325 ─────────────────────────────────────────────────────────────────────────────── Total 72 23863 1572 530 21761 1325 ─────────────────────────────────────────────────────────────────────────────── Estimated Cost to Develop 85,737 Estimated Schedule Effort 11.916649 months Estimated People Required 5.112342 ─────────────────────────────────────────────────────────────────────────────── Processed 793364 bytes, 0.793 megabytes (SI) ─────────────────────────────────────────────────────────────────────────────── |
It would be very useful to relax the requirement of exact same tokenizer for main and draft models like here: vllm-project/vllm#2188 |
Yes, this implementation only checks if the vocabs are the same: see this check. |
I understand that same vocab case is much easier to code, but if this requirement is relaxes people can use a ready-made small draft model even if their LLM is incompatible with it (which often will be the case). |
That sounds great! Can you please give an example of how I should relax the requirement? |
This PR adds the base framework for SD. Further improvements to speed will be added in addition to self-speculative decoding. |
Speculative decoding: https://arxiv.org/pdf/2211.17192
This will refactor the pipeline structure to make the sampling process more abstracted. Additionally, it will also abstract the scheduling and kv cache management.
Restriction
Algorithm
Given draft model q and target model p with probability distributions$q_i(x)$ and $p_i(x)$ for each token