[Misc] Reduce medusa weight #10422

skylee-01 · 2024-11-18T13:19:29Z

Medusa predicts N tokens in speculative decoding and trains N lheads，In actual deployments, only ResidualBlock is usually trained, not lm_head.So I just keep a copy of lm_head and share it in different heads. In practice, every lm_head saved will reduce 1G of HBM, which is crucial on graphics cards such as the 4090.At the same time, medusa can be predicted longer.

github-actions · 2024-11-18T13:19:42Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

skylee-01 · 2024-11-18T14:32:39Z

The test didn't pass, which is very strange. The code looks nothing unusual.

[Misc] Reduce medusa weight

61c8447

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc] Reduce medusa weight #10422

[Misc] Reduce medusa weight #10422

skylee-01 commented Nov 18, 2024 •

edited

Loading

github-actions bot commented Nov 18, 2024

skylee-01 commented Nov 18, 2024

[Misc] Reduce medusa weight #10422

Are you sure you want to change the base?

[Misc] Reduce medusa weight #10422

Conversation

skylee-01 commented Nov 18, 2024 • edited Loading

github-actions bot commented Nov 18, 2024

skylee-01 commented Nov 18, 2024

skylee-01 commented Nov 18, 2024 •

edited

Loading