-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: vllm stuck when using prompt_token_ids and setting prompt_logprobs #5872
Comments
Hi, Can you give #5846 a try? I think it would fix this bug. Thanks! |
Can confirm it is fixed! Thank you |
Unfortunately I'm still seeing this issue using lm_eval after applying #5846. But now it triggers when submitting more than 1 prompt at a time. Attached a minimum reproduction script. |
Thanks for the reproducer. I find that I can only trigger the endless-loop when both |
The same problem. I have confirmed #5846 , cannot fix this bug! |
I found an an alternative approach, and only require log_probs, you can decode the tokens yourself. Simply block the following two lines in
Full function:
|
If you don't need the decoded text, just set SamplingParams.detokenize to False. |
@zifeitong But if you set detokenize=False, and prompt_logprobs=1 is confilct !
output:
|
你好,现在这个问题解决了吗,我想用sampling_kwargs = SamplingParams(temperature=0,prompt_logprobs=0,max_tokens=1) 跑一下ppl,但是发现中间总是卡顿,而且gpu的利用率比较低 |
Your current environment
🐛 Describe the bug
The Issue
When using the LLM class with both
prompt_token_ids
andprompt_logprobs
, I have found vLLM sometimes would stuck. A minimal reproducing example is as follows:Running with the official docker image:
docker run --gpus all --shm-size=10g --rm -e CUDA_VISIBLE_DEVICES=0 -v "$(pwd):/app" --entrypoint python3 vllm/vllm-openai:v0.5.0.post1 /app/vllm_reproduce.py
The generation would stuck. Note that this does not happen every time. For example, with a relatively small
N_REQUESTS
, sometimes the generation will run just fine.If we detokenize and use raw text as input, or if we turn
prompt_logprobs=None
, the code would not stuck.The Analysis
I've done some initial analysis. The code seems to stuck at the detokenization stage of output_processor.
The
decode_prompt_logprobs_inplace
function (https://github.com/vllm-project/vllm/blob/v0.5.0.post1/vllm/transformers_utils/detokenizer.py#L24-L87) seems suspicious to me.I checked for two things. First, printing
len(prev_tokens)
. Second, checking the advancement logic and see if conditionif token_id == all_token_ids[token_position]
was ever met.I've found that for examples that run fine, usually prev_tokens is always empty and the condition is rarely True. Then, for the example that vLLM got stuck, I saw this:
It seems
prev_tokens
was growing unexpectedly.Overall, the detokenization logic seems confusing to me, especially the parts where prev_tokens is updated. I am not sure why the problem did not occur when not using prompt_token_ids. I am also not sure if the issue has been observed previously. This is supposed to be a pretty common use case for evaluations. (E.g., in lm-evaluation-harness).
The text was updated successfully, but these errors were encountered: