[RFC] Drop beam search support #6226

WoosukKwon · 2024-07-08T22:25:48Z

~~TL;DR: To reduce system complexity and enable future optimizations, we propose discontinuing beam search support.~~

Due to strong pushback from the community, we have decided to reconsider this proposal. vLLM will continue to support beam search until further notice. Thanks everyone for the feedback!

Motivation.

Currently, vLLM supports 3 types of sampling: greedy, random, and beam search. Beam search, which dynamically creates and removes top-k branches at each step, is the most complex of the three. Traditionally, beam search has been popular for NLP tasks like translation and summarization. However, in the LLM era, beam search has become less common. Major LLM APIs such as GPT, Gemini, and Claude do not support it.

In vLLM, beam search initially motivated the idea of PagedAttention. Actually, vLLM excels at beam search compared to other inference engines, since PagedAttention can efficiently handle the dynamic nature of beam search and minimize its KV cache usage. Despite this, implementing beam search introduces significant system complexity, hindering potential optimizations. It complicates the system while being used rarely.

To resolve this, we propose eliminating beam search support, which will provide the following benefits:

Reduced Complexity in Sampling and Output Processing
- The current code for sampling and output processing is complex partly because beam search is considered. Without beam search, vLLM will only need to support greedy or random sampling, and this will greatly simplify the code.
More Predictable Block Table
- Beam search causes the block table for PagedAttention to change dynamically at each step. This leads to synchronization between the model runner and the scheduler. Removing beam search will be the first step to allow them to operate asynchronously.
Potential Future Removal of SequenceGroup
- SequenceGroup is used when a request maps to multiple output sequences via parallel sampling or beam search. It helps manage memory sharing and enforce gang-scheduling of the sequences. Without beam search, we can potentially eliminate SequenceGroup, as parallel sampling does not require gang-scheduling, and memory sharing can be managed by prefix caching.

Proposed Change.

We plan to execute this in 3 steps:

Remove beam search and its directly related code in the sampler and output processor.
Simplify the code further, leveraging the fact that vLLM will only support greedy or random sampling.
Enable the future optimizations described above.

We are open to reintroducing beam search if there is strong demand from the community. Please share any concerns regarding this decision. We apologize for any inconvenience caused by this change.

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

cadedaniel · 2024-07-09T06:05:30Z

We should disable beam search superficially in the next vLLM release (e.g. assert False, "beam search is deprecated, see #6226") and see the reaction. If there is a lot of noise then we should consider taking a path that maintains compatibility.

hrsmanian · 2024-07-09T11:27:03Z

Beam Search gives consistent results and is used in Production level systems where predictable results are important. So dropping beam search would be a bad idea IMHO. Setting temperature=0 provides some predictable results but not always.

zhouyuan · 2024-07-10T07:42:11Z

MLPerf inference benchmark requires the beam search feature on so I think this is still useful in the industry. Here's the link to the MLPerf inference rules:
https://github.com/mlcommons/inference_policies/blob/master/inference_rules.adoc#413-additional-inference-parameters

thanks, -yuan

mgoin · 2024-07-11T18:07:20Z

Regarding MLPerf Inference @zhouyuan , it is only needed for the GPT-J benchmark (which was the first LLM task they added) and is not used for Llama 2 70B or Mixtral 8x7B (which are more recent). I don't believe beam search will be used in future tasks since it is generally not practical for cost-effective deployment.

lanking520 · 2024-07-12T00:04:07Z

To serve as an alternative to still enable customers would like similar features. I would like to propose a new param to introduce in our current vLLM system, let's call that num_q (number of queries or former num_beams). With this being set, let's say we set num_q=5, what it does will be similar to best_of or n. but instead of doing that, it will bring the top 5 token for the first token generation and generate them to max_tokens length of the output.

Request:

{"prompt": "recommend me the best city to visit in this world: ", "num_q": 5, "max_tokens": 100}

Response:

{"result": [
{  "output":  "Paris", log_probs: -0.123123},
{  "output":  "Amsterdam", log_probs: -0.123123},
{  "output":  "Beijing", log_probs: -0.123123},
{  "output":  "Dubai", log_probs: -0.123123},
{  "output":  "Bogota", log_probs: -0.123123},
]}

Customer can gauranteed to get 5 different responses and its logprobs. Where in the meantime, they can still conduct beam search themselves through choosing the one with best log_probs. But with doing this, it is much lesser complication introduced to vLLM to ahieve something. It also brings the freedom for more customization for users to decide what sequence they want to use.

num_q => num_beams
max_tokens => beam_width

WoosukKwon · 2024-07-13T04:04:51Z

@cadedaniel Thanks for the suggestion!

Here's what we've decided to do:

We'll add a deprecation warning for beam search ([Misc] Add deprecation warning for beam search #6402) and plan to release a new version next week.
After the release, we'll gather user feedback and usage data (Report usage for beam search #6404) for 2-3 weeks.
In the meantime, we'll work on a separate branch to remove beam search and implement code simplification and optimizations.
For the v0.6.0 release, unless we receive strong pushback, we'll merge the changes from the branch developed in step 3.

nightflight-dk · 2024-07-13T04:42:16Z

Beam Search gives consistent results and is used in Production level systems where predictable results are important. So dropping beam search would be a bad idea IMHO. Setting temperature=0 provides some predictable results but not always.

+1, our teams observe benefits for reliability and occasionally even latency from beam search, highly relevant in Prod

zhyncs · 2024-07-13T05:28:02Z

Major LLM APIs such as GPT, Gemini, and Claude do not support it.

Yes. The most commonly used now is top-p and top-k sampling.

HeegonJin · 2024-07-15T04:39:13Z

I kindly suggest maintaining beam search support, as it is the primary option for translation tasks, even with LLMs.

WoosukKwon · 2024-07-15T18:41:37Z

@nightflight-dk Thanks for your input! Are you using vLLM in production? If so, we'd be happy to discuss our plan with you.

SemMulder · 2024-07-16T11:48:18Z

A potential use-case we have is that sometimes using guidance/outlines/lm-format-enforcer can result in "forcing" the model down a path it doesn't really want to go. So e.g. if we ask the model to extract the color from 'Navy blue T-shirt' and we restrict the output to be in Spanish (e.g. 'Azul', 'Naranja'), smaller models will output Naranja since the model is aiming for outputting Navy blue (so the first token will be Na, after which we force the model the output Naranja). With beam search we can let the model "look-ahead" across tokens. We planned on experimenting with beam search to experiment with whether that would help in cases like these.

Adding the fact that the model should choose from Azul and Naranja to the prompt doesn't work well enough for smaller models, they still want to output Navy blue.

darabos · 2024-07-16T13:29:21Z

I think the typical use case for taking multiple samples is when you have a method for "trying" a sample. Perhaps the first sample "fails", and then you want to try the second sample, etc. (Our specific use case is formal proof search.)

Beam search is well suited for this application, because the beams provide diversity. With random sampling I could end up retrying the same "almost surely good" idea over and over, instead of continuing to the second idea. It's true that beams ranking lower are likely bad. But trying a bad idea still beats trying the same good idea twice.

That said, I'm a fan of simpler code. If random sampling is much faster than beam search, we can just deduplicate the samples or something. I will run some experiments to measure how this will affect us.

DhruvaBansal00 · 2024-07-16T17:32:28Z

We have noticed that token level logprobs from beam search are quite informational compared to those from nucleus sampling. A lot of our workflows depend on these logprobs and I'd suggest keeping beam search support as well!

tmostak · 2024-07-16T22:06:29Z

We heavily depend on beam search at Heavy.ai in VLLM in production with customers to give optimal accuracy for text-to-SQL tasks (https://www.heavy.ai/heavyiq/overview), and would lose significant accuracy with it turned off. Perhaps we could implement it ourselves using the log probabilities (would be nervous about the performance though) or freeze our version to 0.5.2, but neither is ideal at all.

We are also looking at various sampled approaches using a judge model to pick the best, and here again taking the top-n beam search generations provides better accuracy than setting a non-zero temperature and taking n samples.

From the above I understand the motives but I'd request that this be reconsidered. It's not just us either, pretty much all the SOTA text-to-SQL approaches use beam search to get best accuracy.

physicsrob · 2024-07-17T13:29:36Z

Beam search is a deal breaker for our use case. We use it extensively in prod. We have found that it increases the accuracy of our LLM's responses by roughly 1%, which is absolutely critical for our use case. Unfortunately if vLLM stops supporting beam search we'll have to switch to an unoptimized inference engine.

YooSungHyun · 2024-07-18T06:56:14Z

We are considering using beam search as it actually improves performance and we are reviewing its use at the production level.

This alone might make us reconsider using vLLM. The speed and complexity of implementation could be seen as a trade-off for better performance and the ability to infer model choice paths.
Furthermore, we might overcome limitations with streaming operations.

Must we really delete it? We do not want that.
Alternatively, if must delete it, need to another solution. HOW?

denadai2 · 2024-07-18T11:14:41Z

We, at Spotify, use vLLM beam search to sample multiple answers from the same prompt in multiple textual tasks. This breaking change would hurt us significantly and we may have to reconsider vllm usage for some of our use cases, if there are no alternatives :( please, reconsider it

Feel free to DM me

sjmielke · 2024-07-18T13:46:19Z

We are very much relying on beam search for our biomedical industry applications to significantly boost performance in our setting. That benefit is large enough to consider alternative projects for serving, but we would hate to have to abandon vllm :(

Reichenbachian · 2024-07-18T23:19:29Z

We are using beam search in production and would appreciate its continued support

youkaichao · 2024-07-18T23:27:58Z

For production usacases, please also indicate why you choose beam search, and why not the rest sampling method. Many public API service does not provide beam search, and what would you do if you don't have beam search? (i.e. any workaround?)

A possible workaround: LLMs are very smart at present, if you just want output diversity, how about adding a system prompt to instruct it for more diverse output?

AaronFriel · 2024-07-18T23:53:29Z

As a user of guidance/AICI/other methods of constraining LLM output, disabling beam search can reduce quality of outputs. For the reason users describe above.

We've noticed that across a wide array of models, these two facts interact poorly:

LLM tokens may contain more than one lexical JSON token, e.g.: the token ]} and prefer these, as they are more efficient to generate than ] and } independently - they are "over-weighted" in sampling.
Once the model emits a token closing a string, array, or object, the model cannot backtrack and correct itself if the schema requires - or it would be "sensible" - to admit another chunk of text, array element, or key.

For vLLM with open source models, beam search helps overcome this obstacle, in effect giving the model a weak form of backtracking.

With LLM APIs, we maintain a list of tokens which we add a small negative weight to, however this list is not exhaustive and of course, we need to derive the token IDs for each unique tokenizer.

In my experience, beam search works better than negative weighting these tokens, and is more straightforward and adaptable to multiple models.

This is a sample of our "verboten tokens" file:

[
  "\",",
  "],",
  "[\"",
  "[]",
  ",\"",
  "\"]",
  "][",
  "},",
  "\",\"",
  "{{",
  "\"\"",
  "}}",
  "{\"",
  "]]",

youkaichao · 2024-07-19T00:04:54Z

@AaronFriel do you try the guided decoding at https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters-for-chat-api ?

AaronFriel · 2024-07-19T00:22:13Z

I'm familiar with Guidance, yes, I mentioned it in my reply.

tmostak · 2024-07-19T15:49:16Z

My understanding is that guided decoding in particular benefits from beam search, for reasons alluded to here and here, i.e. you can get into nasty situations with guided decoding where the probabilities of earlier tokens can be skewed by the probabilities of later tokens, even if some of those combinations are disallowed by the guided choice/regex/json.

hinnefe2 · 2024-07-22T16:20:55Z

We also use beam search in a production deployment of vllm and would probably have to migrate off of vllm without it. We're optimizing for accuracy on a structured task, not diversity of output, and have found that beam search produces the best results.

mflaxman10 · 2024-07-22T22:44:50Z

My HEAVY.AI colleagues have already commented, but to add a little detail ... we use beam search so that we can performantly constrain things to our specific SQL syntax. We've found it to be faster than alternatives and are using it in production across multiple accounts.

WoosukKwon · 2024-07-22T22:51:50Z

@hrsmanian @zhouyuan @lanking520 @nightflight-dk @HeegonJin @SemMulder @darabos @DhruvaBansal00 @tmostak @physicsrob @YooSungHyun @denadai2 @sjmielke @Reichenbachian @AaronFriel @hinnefe2 @mflaxman10

Due to strong pushback from the community, we have decided to reconsider this proposal. vLLM will continue to support beam search until further notice. We will try to find other ways to optimize the system overheads and get back to this proposal after more exploration. Thanks everyone for the feedback!

denadai2 · 2024-07-23T12:25:48Z

@WoosukKwon thank you. Super appreciated!

To give more context, Spotify, among other use cases, needs to have long exact inference (e.g. for recommendations). Thus, beam search is great for this :)

youkaichao · 2024-09-02T18:36:38Z

An update on this thread:

For users who need beam search, I'd like to know how sensitive you are w.r.t. the latency and throughput of the inference. Per my understanding, beam search is quite slow in terms of both latency and throughput. If you use beam search, I assume you are not very sensitive to the speed, but just want the quality of generation from beam search.

Why I ask this? Because I'd like to move the beam search logic one level higher above the current vLLM. Say we have an inference engine that supports openai api server, it seems we can emulate one api server with beam search, by asking the api server to produce one token at a time, with multiple logprobs:

def beam_search_proxy(sequence, beam_width, max_tokens):
    candidates = [sequence]
    finished = []
    while candidates:
        new_candidates = []
        for seq in candidates:
            for token, logprob in generate(seq, max_tokens=1, logprobs=beam_width):
                new_candidates.append(new_seq(seq, token, logprob))
        finished += [x.is_finished() for x in new_candidates]
        new_candidates = [x for x in new_candidates if not x.is_finished()]
        new_candidates.sort(key=lambda x: x.cummulative_logprobs, reverse=True)
        candidates = new_candidates[:beam_width]
    finished.sort(key=lambda x: x.cummulative_logprobs, reverse=True)
    return finished[:beam_width]

the sharing of memory and computation among sequences, can be achieved via prefix caching.

disclaimer: I'm not familiar with beam search, and the semantic of the above function can be wrong. please just read the idea, to emulate beam search with a normal openai api server.

If we can go to this direction, the outcome would be:

vllm does not support beam search by itself
but vllm will provide a beam search emulator to turn the openai api server into a server with beam search functionality
what's more, this emulator is not specific to vllm. you can also use it to turn any openai api server into a server with beam search functionality

physicsrob · 2024-09-02T19:49:31Z

For users who need beam search, I'd like to know how sensitive you are w.r.t. the latency and throughput of the inference.

We are very sensitive to throughput, but not latency. We need the highest possible throughput with beam search. If there's a substantial drop in overall compute efficiency, or drop of beam search support, we would migrate our inference elsewhere (or possibly fork, although TBH we don't want to be in the business of optimizing inference.)

For what it's worth, I think it's unlikely that moving to a higher level abstraction would work without a substantial drop in throughput. My weak evidence for this: #1646

We currently monkeypatch our VLLM in production to make the fork operation performant. I honestly hate that we do this, but the cost implications of not doing it are unacceptable.

youkaichao · 2024-09-03T00:18:38Z

We currently monkeypatch our VLLM in production to make the fork operation performant.

@physicsrob can you elaborate on that?

youkaichao · 2024-09-03T00:28:49Z

I think it's unlikely that moving to a higher level abstraction would work without a substantial drop in throughput

The point is, very few developers understand beam search, and many new features directly hard fail when beam search is used:

vllm/vllm/spec_decode/batch_expansion.py

Lines 303 to 305 in 6e36f4f

    
           assert len(input_seq_group_metadata.seq_data) == 1, ( 
        
               "Beam search " 
        
               "not supported in speculative decoding")

vllm/vllm/engine/output_processor/multi_step.py

Lines 100 to 103 in 6e36f4f

    
           assert len(seqs) == 1, ( 
        
               "Beam search not supported in multi-step decoding.") 
        
           seq = seqs[0]

I'm pretty sure this will happen more often in the future.

If we keep beam search in vllm, even if the performance is untouched, you will find more and more bugs related to beam search.

By separating beam search and vllm, both of them can be optimized separately. And it is even possible that finally beam search with new vllm gets better than the current beam search in vllm.

darabos · 2024-09-03T18:47:26Z

We are very sensitive to throughput, but not latency.

Same here. But we find throughput very poor already. 2,500 t/s with n=1, 100 t/s with 10 beams. (I was always wondering if we're doing something wrong. 😅)

By separating beam search and vllm, both of them can be optimized separately. And it is even possible that finally beam search with new vllm gets better than the current beam search in vllm.

I'm not qualified to judge, but I like the idea of moving beam search to a higher layer. I can imagine it may make it easier to do batching for the beams. E.g. in your example:

        for seq in candidates:
            for token, logprob in generate(seq, max_tokens=1, logprobs=beam_width):
                new_candidates.append(new_seq(seq, token, logprob))

Perhaps this could be replaced with a batch completion:

        for seq, token, logprob in generate(candidates, max_tokens=1, logprobs=beam_width):
            new_candidates.append(new_seq(seq, token, logprob))

So we only make one inference call per token, which covers all beams at once.

youkaichao · 2024-09-03T19:32:37Z

@darabos

we only make one inference call per token, which covers all beams at once

we should definitely add batching for the beams. please take the code snippet as just a demonstration for the ideas lol .

we find throughput very poor already. 2,500 t/s with n=1, 100 t/s with 10 beams. (I was always wondering if we're doing something wrong. 😅)

this is possible, because beam search is very complicated search algorithm. in normal decoding, you can stream back every token you generate. However, in beam search, the tokens you get might be discarded later. In fact many tokens will be decoded and then discarded.

I like the idea of moving beam search to a higher layer.

thank you for your support!

youkaichao · 2024-09-07T07:30:18Z

Another update on this thread:

For people who use beam search, what are the common sampling parameters? besides basic beam search, do you need to compose beam search with the rest features?

e.g. beam search + guided decoding? beam search with temperature? beam search with presence_penalty/frequency_penalty/repetition_penalty/length_penalty ? beam search with logprobs?

it is hard for me to imagine, what does it mean for beam search + guided decoding , and what is the comparison criterion for beam search with presence_penalty/frequency_penalty/repetition_penalty/length_penalty (i.e. is the penalty included in telling the quality of candidates?).

basically, because beam search is a search algorithm, it usually conflicts with all the rest sampling algorithm. And as I mentioned before, many features in vllm already directly assert beam search is not used.

please provide your further feedback on the specific use case of beam search.

If throughput is the only concern to move beam search one level above the vllm core, I'm pretty sure we should be able to optimize the speed as fast as the current vllm implementation.

darabos · 2024-09-07T09:25:08Z

For people who use beam search, what are the common sampling parameters? besides basic beam search, do you need to compose beam search with the rest features?

For me, it's just beam search with up to 100 beams. No other sampling features used. Temperature=0.

Currently we rely on getting back logprobs for the generated samples. We use these to get an overall "sentence likelihood" which is then used for comparisons across different generations. (It's a best-first search.) There are some conceptual issues with this and we plan to switch to a better scoring function.

youkaichao · 2024-09-07T17:02:16Z

@darabos thanks for the response!

followup questions:

do you use openai api server for beam search? or do you use the LLM class? can you show a simple example of your usage?

beam search with up to 100 beams

when you create one vLLM instance to use beam search, do you need different beam search width for different prompts? or they all have the same beam search width?

darabos · 2024-09-07T19:46:51Z

Oh my, this feature will be personalized for my needs! 😄

do you use openai api server for beam search? or do you use the LLM class? can you show a simple example of your usage?

The LLM class. Here's an excerpt from our code that hopefully includes the bits you're looking for. The model is often a DeepSeek-Coder 1.3b fine-tune.

    def __init__(self):
        self.model = vllm.LLM(model=model_name, gpu_memory_utilization=0.5, max_model_len=10240)

    def candidates(self, ...):
        sp = vllm.SamplingParams(temperature=0, n=num_beams, use_beam_search=True, max_tokens=100, early_stopping=True)
        outputs = self.model.generate(prompts, sp, use_tqdm=False)

I don't think there is a lot of thought behind how we set these parameters.

beam search with up to 100 beams

when you create one vLLM instance to use beam search, do you need different beam search width for different prompts? or they all have the same beam search width?

The same. The way our code works, and I think this may be a typical use of beam search, is that we want to try the best generation, then the second best, etc. Generating 16 samples is just a compromise. Often we won't use all 16, other times we would need more than 16. The ideal for us would be if we could pull N samples one by one, without guessing N ahead of time. I know this is not on the table with beam search.

youkaichao · 2024-09-09T18:00:12Z

@darabos thanks! your explanation helps a lot

denadai2 · 2024-10-03T11:36:21Z

Another update on this thread:

For people who use beam search, what are the common sampling parameters? besides basic beam search, do you need to compose beam search with the rest features?

e.g. beam search + guided decoding? beam search with temperature? beam search with presence_penalty/frequency_penalty/repetition_penalty/length_penalty ? beam search with logprobs?

it is hard for me to imagine, what does it mean for beam search + guided decoding , and what is the comparison criterion for beam search with presence_penalty/frequency_penalty/repetition_penalty/length_penalty (i.e. is the penalty included in telling the quality of candidates?).

basically, because beam search is a search algorithm, it usually conflicts with all the rest sampling algorithm. And as I mentioned before, many features in vllm already directly assert beam search is not used.

please provide your further feedback on the specific use case of beam search.

If throughput is the only concern to move beam search one level above the vllm core, I'm pretty sure we should be able to optimize the speed as fast as the current vllm implementation.

Hi, sorry for the late response but I was in parental leave. I cannot go into details ATM. However, we use beam search to predict some catalog codes for recommendation purposes. We do this for offline inference but we are considering doing online inference as well. Why beam search? Because each catalog item corresponds to a sequence of codes and our model has to predict existing sequences. Early results with top-k sampling are significantly worse than beam search. We usually have ~ 30 or 50 beams and sequences that are between 3 and 15 long. I know, it's quite intense :(
Could I provide additional information to further help you?

An update on this thread:

For users who need beam search, I'd like to know how sensitive you are w.r.t. the latency and throughput of the inference. Per my understanding, beam search is quite slow in terms of both latency and throughput. If you use beam search, I assume you are not very sensitive to the speed, but just want the quality of generation from beam search.

Why I ask this? Because I'd like to move the beam search logic one level higher above the current vLLM. Say we have an inference engine that supports openai api server, it seems we can emulate one api server with beam search, by asking the api server to produce one token at a time, with multiple logprobs:
def beam_search_proxy(sequence, beam_width, max_tokens):
    candidates = [sequence]
    finished = []
    while candidates:
        new_candidates = []
        for seq in candidates:
            for token, logprob in generate(seq, max_tokens=1, logprobs=beam_width):
                new_candidates.append(new_seq(seq, token, logprob))
        finished += [x.is_finished() for x in new_candidates]
        new_candidates = [x for x in new_candidates if not x.is_finished()]
        new_candidates.sort(key=lambda x: x.cummulative_logprobs, reverse=True)
        candidates = new_candidates[:beam_width]
    finished.sort(key=lambda x: x.cummulative_logprobs, reverse=True)
    return finished[:beam_width]
the sharing of memory and computation among sequences, can be achieved via prefix caching.

disclaimer: I'm not familiar with beam search, and the semantic of the above function can be wrong. please just read the idea, to emulate beam search with a normal openai api server.

If we can go to this direction, the outcome would be:

vllm does not support beam search by itself

but vllm will provide a beam search emulator to turn the openai api server into a server with beam search functionality

what's more, this emulator is not specific to vllm. you can also use it to turn any openai api server into a server with beam search functionality

Hi, sorry for the late response but I was in parental leave. I cannot go into details ATM but I want to give further details about our use case. However, we use beam search to predict some catalog codes for recommendation purposes. We do this for offline inference but we are considering doing online inference as well. Why beam search? Because each catalog item corresponds to a sequence of codes and our model has to predict existing sequences. Early results with top-k sampling are significantly worse than beam search. We usually have ~ 30 or 50 beams and sequences that are between 3 and 15 long. I know, speed is crucial but this task's quite intense :(
Could I provide additional information to further help you?

denadai2 · 2024-11-28T20:54:10Z

to all: I added a feature request for a more powerful beam search (as it was in the old vllm) here #10754

WoosukKwon added the RFC label Jul 8, 2024

WoosukKwon pinned this issue Jul 8, 2024

WoosukKwon mentioned this issue Jul 13, 2024

[Misc] Add deprecation warning for beam search #6402

Merged

simon-mo mentioned this issue Jul 15, 2024

v0.5.2, v0.5.3, v0.6.0 Release Tracker #6434

Closed

7 tasks

WoosukKwon closed this as completed Jul 22, 2024

WoosukKwon unpinned this issue Jul 22, 2024

youkaichao mentioned this issue Aug 4, 2024

[Performance]: From SequenceGroup-native code to Sequence-native code #7116

Closed

youkaichao mentioned this issue Sep 9, 2024

[RFC]: Reimplement and separate beam search on top of vLLM core #8306

Closed

1 task

noooop mentioned this issue Nov 24, 2024

[Usage]: Why use_beam_search is eliminated in vllm.SamplingParams from v0.6.3? #10605

Closed

1 task

denadai2 mentioned this issue Nov 28, 2024

[Feature]: Beam search: top_p, min_p and logit processors #10754

Open

1 task

[RFC] Drop beam search support #6226

[RFC] Drop beam search support #6226

Comments

WoosukKwon commented Jul 8, 2024 • edited Loading

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

cadedaniel commented Jul 9, 2024

hrsmanian commented Jul 9, 2024

zhouyuan commented Jul 10, 2024

mgoin commented Jul 11, 2024

lanking520 commented Jul 12, 2024 • edited Loading

WoosukKwon commented Jul 13, 2024

nightflight-dk commented Jul 13, 2024 • edited Loading

zhyncs commented Jul 13, 2024

HeegonJin commented Jul 15, 2024 • edited Loading

WoosukKwon commented Jul 15, 2024

SemMulder commented Jul 16, 2024 • edited Loading

darabos commented Jul 16, 2024

DhruvaBansal00 commented Jul 16, 2024

tmostak commented Jul 16, 2024 • edited Loading

physicsrob commented Jul 17, 2024

YooSungHyun commented Jul 18, 2024 • edited Loading

denadai2 commented Jul 18, 2024 • edited Loading

sjmielke commented Jul 18, 2024

Reichenbachian commented Jul 18, 2024

youkaichao commented Jul 18, 2024

AaronFriel commented Jul 18, 2024 • edited Loading

youkaichao commented Jul 19, 2024

AaronFriel commented Jul 19, 2024

tmostak commented Jul 19, 2024

hinnefe2 commented Jul 22, 2024

mflaxman10 commented Jul 22, 2024

WoosukKwon commented Jul 22, 2024

denadai2 commented Jul 23, 2024 • edited Loading

youkaichao commented Sep 2, 2024

physicsrob commented Sep 2, 2024

youkaichao commented Sep 3, 2024

youkaichao commented Sep 3, 2024 • edited Loading

darabos commented Sep 3, 2024

youkaichao commented Sep 3, 2024

youkaichao commented Sep 7, 2024

darabos commented Sep 7, 2024

youkaichao commented Sep 7, 2024

darabos commented Sep 7, 2024

youkaichao commented Sep 9, 2024

denadai2 commented Oct 3, 2024

denadai2 commented Nov 28, 2024

WoosukKwon commented Jul 8, 2024 •

edited

Loading

lanking520 commented Jul 12, 2024 •

edited

Loading

nightflight-dk commented Jul 13, 2024 •

edited

Loading

HeegonJin commented Jul 15, 2024 •

edited

Loading

SemMulder commented Jul 16, 2024 •

edited

Loading

tmostak commented Jul 16, 2024 •

edited

Loading

YooSungHyun commented Jul 18, 2024 •

edited

Loading

denadai2 commented Jul 18, 2024 •

edited

Loading

AaronFriel commented Jul 18, 2024 •

edited

Loading

denadai2 commented Jul 23, 2024 •

edited

Loading

youkaichao commented Sep 3, 2024 •

edited

Loading