Will vLLM supports Self-Speculative Decoding ？ #2559

MeJerry215 · 2024-01-23T07:56:06Z

github repo: https://github.com/dilab-zju/self-speculative-decoding

Using partial layers for guess, and achive about 1.78x speed up. No draft model, the only thing needed to be cared is kv cache.

Seems supports samping decoding. Cause the hidden size and intermediate size is same，so the kv cache is reusable, the only thing need to do is to kv cache memory reclaim when reject tokens.

What will the future of VLLM speculative sampling look like? Is there a rough plan?

@cadedaniel @LiuXiaoxuanPKU

cadedaniel · 2024-01-24T06:49:05Z

Hi @MeJerry215 . Once #2188 is merged, self-speculative decoding can be added easily as a replacement for the draft model. Follow along in that PR for more details.

hmellor closed this as not planned Won't fix, can't repro, duplicate, stale Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will vLLM supports Self-Speculative Decoding ？ #2559

Will vLLM supports Self-Speculative Decoding ？ #2559

MeJerry215 commented Jan 23, 2024 •

edited

Loading

cadedaniel commented Jan 24, 2024 •

edited by linear bot

Loading

Will vLLM supports Self-Speculative Decoding ？ #2559

Will vLLM supports Self-Speculative Decoding ？ #2559

Comments

MeJerry215 commented Jan 23, 2024 • edited Loading

cadedaniel commented Jan 24, 2024 • edited by linear bot Loading

MeJerry215 commented Jan 23, 2024 •

edited

Loading

cadedaniel commented Jan 24, 2024 •

edited by linear bot

Loading