-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align vLLM's beam search implementation with HF generate #857
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhuohan123 Awesome! Thanks for the amazing work! 🚀
As we discussed offline, I think the PR only needs small fixes for further clarification. I like the changes in the system design. Thanks again for the hard work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good to me! Many thanks for the hard work. I believe this will make vLLM a unique inference engine that effectively supports beam search. Nice work!
@zhuohan123 BTW, please close some issues fixed by this PR. |
This PR refactors the changes in #646.
The goal of this PR is to align the beam search with
hf_model.generate()
, which is also aligned with many other older frameworks, includingtensor2tensor
andfairseq
. When meeting a finished beam candidate, our old beam search algorithm will always keep this finished beam and reduce the beam width of the remaining beam search by 1. However, for HF, the beam width will always be a fixed number, and we will select the top-"beam width" running requests for the next iteration.This change breaks the assumption that every sequence group in vLLM will always have a fixed number (which actually always equals to
best_of
) of requests. Therefore, we need to dynamically grow the number of sequences in a sequence group. After this PR, every request will start with only one sequence (for prompt computation). Later, each request will grow into multiple other sequences during the decoding process based on their sampling algorithms.Should be merged after #867
TODOs:
cc @hsm1997