[RFC]: Reimplement and separate beam search on top of vLLM core #8306

youkaichao · 2024-09-09T20:17:13Z

Motivation.

A rework of #6226

After discussing further with the community, we find that the common use case for beam search is:

throughput oriented
mainly offline batch inference
use one beam search parameter for all the prompts in the batch

After discussing with many contributors, we find:

because beam search is a search algorithm, it conflicts with all the rest sampling algorithm. As a result, many features in vllm already directly assert beam search is not used, e.g.

vllm/vllm/spec_decode/batch_expansion.py

Lines 303 to 305 in 6e36f4f

    
           assert len(input_seq_group_metadata.seq_data) == 1, ( 
        
               "Beam search " 
        
               "not supported in speculative decoding")

vllm/vllm/engine/output_processor/multi_step.py

Lines 100 to 103 in 6e36f4f

    
           assert len(seqs) == 1, ( 
        
               "Beam search not supported in multi-step decoding.") 
        
           seq = seqs[0]

keeping beam-search as-is in the codebase, will not benefit current beam search user, as no optimization will target at better beam search performance. What's worse, very few developers understand beam search. Keeping beam-search as-is will not only increase the bugs for beam search as the codebase evolves, but also increase the maintenance cost of all contributors.

in search of a win-win solution, on behalf of the vllm team, I propose to separate and reimplement beam search on top of the vllm core code.

to be specific, we can:

remove beam search logic from the scheduler
add an LLM.beam_search interface, that calls the engine to generate 1 tokens with logprobs every step, and maintain beam-search logic only in the LLM.beam_search function.
add a beam search emulator over commonly used openai api server, which internally calls the generation endpoint to generate one step with logprobs, and maintain beam-search logic only in the emulator.

From the initial discussion, one concern is the efficiency of such implementation, as the request will come and go again and again from the vllm core's perspective. It should be solvable in two-folds:

turning on prefix caching can reuse computation from the last step so that we don't need to recompute the kv cache of prompt again and again.
after separating beam search and the vllm core, they can be optimized individually. The simplified code will be much easier to optimize.

vLLM is a community project, and we'd like to not only seek opinions from beam-search users, but also seek contributions from beam-search users. Your help is truly needed to shape the future of beam-search support in vLLM.

Proposed Change.

summary of the change: implement beam-search on top of vllm core and add wrappers for users. remove beam-search from the vllm core (scheduler).

Feedback Period.

1 week, from 9/9 to 9/15 (both inclusive)

CC List.

@hrsmanian @zhouyuan @lanking520 @nightflight-dk @HeegonJin @SemMulder @darabos @DhruvaBansal00 @tmostak @physicsrob @YooSungHyun @denadai2 @sjmielke @Reichenbachian @AaronFriel @hinnefe2 @mflaxman10
@WoosukKwon @zhuohan123 @simon-mo

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

AaronFriel · 2024-09-09T20:58:35Z

after separating beam search and the vllm core, they can be optimized individually. The simplified code will be much easier to optimize.

This is a good goal to work toward, as ensuring that API interfaces (OpenAI, beam search, or otherwise) can efficiently and reliably schedule new sequences benefits all consumers.

turning on prefix caching can reuse computation from the last step so that we don't need to recompute the kv cache of prompt again and again.

The flood of vLLM notifications is hard to keep up with, so I may be out of date. My understanding was that prefix caching was not precise and was block based, resulting in some amount of excess computation. Is there an issue to allow APIs to specify the "prefix length" that should be cached?

This new approach could see performance degrade when the sequence length approaches a multiple of the KV block length, if each arm of the beam search schedules a new sequence and must prefill O(kv_block_size) tokens plus decoding O(1) tokens. Ideally both would be O(1) with a hint to allow beam search to cache the entire prefix.

youkaichao · 2024-09-09T21:11:41Z

My understanding was that prefix caching was not precise and was block based, resulting in some amount of excess computation

we can set block size to 1 for the vLLM instance when we use beam search, then we don't have to waste any computation.

simon-mo · 2024-09-09T21:39:49Z

There are also some alternative implementation of this by moving this functionality to a special class of Worker or Executor, which can be configured when beam search is turned on for any engine that needs it.

AaronFriel · 2024-09-10T18:39:00Z

@youkaichao How well does the KV cache handle a block size of 1, in terms of compute or memory overhead?

youkaichao · 2024-09-11T01:02:34Z

@AaronFriel I don't think setting block size of 1 will affect performance a lot. But we need to test and measure the impact.

youkaichao · 2024-09-11T01:03:37Z

@simon-mo can you explain more? What special functions / interfaces would these new Worker or Executor need?

nFunctor · 2024-09-25T16:32:19Z

Hello, thanks for your work, I am not sure if I want to create a new issue for this, technically those are still comments!

I have done some manual testing of your new beam search implementation and here are some observations, together with a late ROC response:

I'd believe that the new offline method would be made available in server completions just like before? The current implementation in the OpenAI server is very basic anyway and personally I would not need much more.
The speed win is more apparent the longer the model generates. The current (soon to be legacy?) implementation is suffering from lots of GPU idleness when generating around 500 tokens or more.
I have not seen v2_block_manager do anything in my tests, and same for the scheduler. This is probably normal, considering that I've hit the GPU usage to near max.
I believe the new (=HF) implementation is not the same as the old one (I have not read its code, admittedly)? The results for the same prompts and num_beams are different between old and new.
We now have the possibility to change temperature in beam_search_params ([Feature]: Beam Search with Temperature > 0 #8067 opened a while ago).

The tests were run on Llama 3.1 8B AWQ / RTX 3090 / three basic completion prompts like "Today is a good day". Can provide more details if it is of any use.

youkaichao · 2024-09-27T05:41:20Z

@nFunctor it's great to hear that you find the new implementation is faster! We do have plan to add beam search back in the openai server, with implementation similar to the LLM.beam_search . Please stay tuned.

Regarding the exact equivalence with the old implementation, we cannot guarantee generating 500 tokens is exactly the same as huggingface one (and the old one). As long as the algorithm still follows beam search, it should be fine. And we have checked the first 64 tokens are the same. It should be enough for practical usage.

nFunctor · 2024-09-27T09:57:55Z

Thanks for your response @youkaichao . What do you think about the temperature implementation?

The new method can still be slower, if generation is done with less beams and less tokens. As you say the block_size parameter is what is holding the method back. I tried activating Flashinfer with block_size 8 and it indeed gave some speedup (12s->10s in one experiment).

What I found, as a byproduct, is that the mentioned Llama becomes completely insane on Flashinfer with the old beam method, repeating the same phrase (I am running an instruct-tuned outside of its chat template format, but still). So, in regards to what you said about the results being similar, maybe the new results will be actually more "numerically stable" in some cases.

youkaichao · 2024-09-27T23:04:28Z

What do you think about the temperature implementation?

beam search is a search algorithm. I don't see how it is related with temperature.

the new results will be actually more "numerically stable" in some cases

glad to hear that.

nFunctor · 2024-09-27T23:44:11Z

@youkaichao If I understand correctly, the .generate returns log(softmax(logits/T)) as logprobs so there is an impact on the sequences' weights that can lead to significant deviations, cf this explanation. In the new implem, setting a non-zero temperature in beam_search_params in vllm.py does change the generated sequence.

We don't change the logic of the algorithm but since our next-token distribution changes, so might the results.

youkaichao · 2024-09-28T04:11:17Z

sorry, I don't get it. what is your ask?

yunyipower · 2024-10-10T13:49:41Z

@youkaichao hi, it's great to see your design, so does it support multi-batch beam-search or not? I mean, in terms of op, not a loop above prompts list

youkaichao · 2024-10-12T16:43:26Z

multi-batch beam-search

what is multi-batch beam-search?

varuniyer · 2024-10-13T15:20:05Z

@youkaichao The beam search docs for vllm.LLM still list these TODO items:

TODO: how does beam search work together with length penalty, frequency penalty, and stopping criteria, etc.?

I see you mentioned that beam search conflicts with the sampling algorithm. However, logit processors (currently an argument of the constructor of the SamplingParams object passed into generate) can be used to add penalties like these. They can affect the top k beams selected at each iteration of search even without sampling. Is there progress on supporting logit processors in the new beam search implementation? The closest issue I found is #9253 regarding stop conditions but not logit processors.

yunyipower · 2024-10-14T03:23:37Z

multi-batch beam-search

what is multi-batch beam-search?

inference in batch，say 16？

liho00 · 2024-10-14T13:07:02Z

how to enable vllm openai server with beam search? seems like there is no engine args available there?

HeegonJin · 2024-10-16T06:32:38Z

As mentioned in #9253, the current implementation does not stop generating when the EOS token is encountered, and continues until it reaches the maximum token limit. This appears to be the major issue.

nFunctor · 2024-10-16T07:30:28Z

@HeegonJin yes, and I tried my workaround in #9264 (we will see if the team approves).

As an external contributor I lack greater understanding of what's going on but it seems to me that a beam gets completed but never gets pushed to completed beams due to the insufficient execution of the eos check. The proposed stop conditions seem to do that but they are not exactly elegant.

youkaichao · 2024-10-24T00:58:57Z

@youkaichao The beam search docs for vllm.LLM still list these TODO items:

TODO: how does beam search work together with length penalty, frequency penalty, and stopping criteria, etc.?

I see you mentioned that beam search conflicts with the sampling algorithm. However, logit processors (currently an argument of the constructor of the SamplingParams object passed into generate) can be used to add penalties like these. They can affect the top k beams selected at each iteration of search even without sampling. Is there progress on supporting logit processors in the new beam search implementation? The closest issue I found is #9253 regarding stop conditions but not logit processors.

we plan to improve beam search so that all the sampling parameters should work.

denadai2 · 2024-11-28T20:54:28Z

to all: I added a feature request for a more powerful beam search (as it was in the old vllm) here #10754

youkaichao added the RFC label Sep 9, 2024

simon-mo changed the title ~~[RFC]: Reimplement and separate beam search on top of vllm core~~ [RFC]: Reimplement and separate beam search on top of vLLM core Sep 9, 2024

AaronFriel mentioned this issue Sep 10, 2024

[RFC]: Pinned Caching with Automatic Prefix Caching (Related to Anthropic Prompt Caching API) #8333

Open

1 task

youkaichao pinned this issue Sep 12, 2024

youkaichao unpinned this issue Sep 27, 2024

nFunctor mentioned this issue Sep 28, 2024

[Frontend] Make beam search emulator temperature modifiable #8928

Merged

This was referenced Oct 6, 2024

[core] remove beam search from the core #9105

Merged

[Frontend] API support for beam search #9087

Merged

youkaichao closed this as completed in #9105 Oct 7, 2024

noooop mentioned this issue Nov 24, 2024

[Usage]: Why use_beam_search is eliminated in vllm.SamplingParams from v0.6.3? #10605

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Reimplement and separate beam search on top of vLLM core #8306

[RFC]: Reimplement and separate beam search on top of vLLM core #8306

youkaichao commented Sep 9, 2024

AaronFriel commented Sep 9, 2024 •

edited

Loading

youkaichao commented Sep 9, 2024 •

edited

Loading

simon-mo commented Sep 9, 2024

AaronFriel commented Sep 10, 2024

youkaichao commented Sep 11, 2024

youkaichao commented Sep 11, 2024

nFunctor commented Sep 25, 2024

youkaichao commented Sep 27, 2024

nFunctor commented Sep 27, 2024

youkaichao commented Sep 27, 2024

nFunctor commented Sep 27, 2024

youkaichao commented Sep 28, 2024

yunyipower commented Oct 10, 2024 •

edited

Loading

youkaichao commented Oct 12, 2024

varuniyer commented Oct 13, 2024 •

edited

Loading

yunyipower commented Oct 14, 2024

liho00 commented Oct 14, 2024

HeegonJin commented Oct 16, 2024 •

edited

Loading

nFunctor commented Oct 16, 2024

youkaichao commented Oct 24, 2024

denadai2 commented Nov 28, 2024

[RFC]: Reimplement and separate beam search on top of vLLM core #8306

[RFC]: Reimplement and separate beam search on top of vLLM core #8306

Comments

youkaichao commented Sep 9, 2024

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

AaronFriel commented Sep 9, 2024 • edited Loading

youkaichao commented Sep 9, 2024 • edited Loading

simon-mo commented Sep 9, 2024

AaronFriel commented Sep 10, 2024

youkaichao commented Sep 11, 2024

youkaichao commented Sep 11, 2024

nFunctor commented Sep 25, 2024

youkaichao commented Sep 27, 2024

nFunctor commented Sep 27, 2024

youkaichao commented Sep 27, 2024

nFunctor commented Sep 27, 2024

youkaichao commented Sep 28, 2024

yunyipower commented Oct 10, 2024 • edited Loading

youkaichao commented Oct 12, 2024

varuniyer commented Oct 13, 2024 • edited Loading

yunyipower commented Oct 14, 2024

liho00 commented Oct 14, 2024

HeegonJin commented Oct 16, 2024 • edited Loading

nFunctor commented Oct 16, 2024

youkaichao commented Oct 24, 2024

denadai2 commented Nov 28, 2024

AaronFriel commented Sep 9, 2024 •

edited

Loading

youkaichao commented Sep 9, 2024 •

edited

Loading

yunyipower commented Oct 10, 2024 •

edited

Loading

varuniyer commented Oct 13, 2024 •

edited

Loading

HeegonJin commented Oct 16, 2024 •

edited

Loading