beam search support #722

leiwen83 · 2023-07-28T08:50:37Z

Feature request

Beam search is useful feature provided by transformer library, but it seem it is missing in TGI?
Would it be supported?

Motivation

beam search would be helpful for response quality.

Your contribution

I'd have a try if this feature is implemented

Narsil · 2023-07-31T17:20:16Z

Hi @leiwen83

Indeed beam search is not implemented however we have a different algorithm which seems to work just as good or even better.

best_of taking the best of n potential sampling replies: #736 (comment)

Is that option what you could be looking for.
It seems to perform better with current LLMs where sampling is better than greedy for most answers.

jiguanglizipao · 2023-08-01T11:36:16Z

I vote for Beam Search. In the case of using Page Attention, Beam Search can share one Prifill operation and save computation with long prompts.

Quang-elec44 · 2023-08-24T02:52:07Z

@jiguanglizipao I agree with you, it seems that the argument "best_of" does not provide good results. Moreover, in the case of my model, using "do_sample" leads to unwanted results

PawelFaron · 2023-09-18T19:52:43Z

Would ge great to have. best_of is great but way to slow.
With best_of=1 I have time_per_token="92.055402ms"
With best_of=2 I have time_per_token="307.8662ms"

Narsil · 2023-09-19T07:55:04Z

Beam search is much worse than best_of performance wise.

The timing difference you show here a surprisingly different. How did you measure
(model, harward, where did you get the timing information from)?

PawelFaron · 2023-09-19T13:04:29Z

@Narsil Thanks for your response. Probably you are right, just saying my observations so far.

The timing is from the docker container itself. It prints that after it generates text.
More about my setup:
Using the g2-standard-4 instance from GCP that has T4 GPU.

Starting the docker like this:

model=meta-llama/Llama-2-13b-chat-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=$token

docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 4000:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model --quantize bitsandbytes-nf4 --max-input-length=4095 --max-total-tokens=4096 --trust-remote-code

Testing with that:

curl 127.0.0.1:4000 \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":2048, "temperature": 0.8, "best_of": 2, "do_sample": true}}' \
    -H 'Content-Type: application/json'

Narsil · 2023-09-19T15:31:25Z

Oh I see bnb-nf4 is just super slow on anything above batch_size=1.

It has nothing to do with best_of.

github-actions · 2024-04-19T01:45:35Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions bot added the Stale label Apr 19, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

beam search support #722

beam search support #722

leiwen83 commented Jul 28, 2023

Narsil commented Jul 31, 2023

jiguanglizipao commented Aug 1, 2023

Quang-elec44 commented Aug 24, 2023

PawelFaron commented Sep 18, 2023 •

edited

Loading

Narsil commented Sep 19, 2023 •

edited

Loading

PawelFaron commented Sep 19, 2023 •

edited

Loading

Narsil commented Sep 19, 2023

github-actions bot commented Apr 19, 2024

beam search support #722

beam search support #722

Comments

leiwen83 commented Jul 28, 2023

Feature request

Motivation

Your contribution

Narsil commented Jul 31, 2023

jiguanglizipao commented Aug 1, 2023

Quang-elec44 commented Aug 24, 2023

PawelFaron commented Sep 18, 2023 • edited Loading

Narsil commented Sep 19, 2023 • edited Loading

PawelFaron commented Sep 19, 2023 • edited Loading

Narsil commented Sep 19, 2023

github-actions bot commented Apr 19, 2024

PawelFaron commented Sep 18, 2023 •

edited

Loading

Narsil commented Sep 19, 2023 •

edited

Loading

PawelFaron commented Sep 19, 2023 •

edited

Loading