-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Incoherent Offline Inference Single Video with Qwen2-VL #9723
Comments
@alex-jw-brooks can you add this model to your test suite to check whether the current model implementation is ok? |
@hector-gr Did you manage to get coherent single image inference generation or do you also experience the same issue (#9732) there? |
Single image inference works fine in this setup. Note that I only tried a few small images, so it might be related to your issue. |
Did you installed vllm from pip or from source code?? |
@hector-gr Can you please try a 4k or 5120x1440 image? :) |
@bhavyajoshi-mahindra Do you experience the same issue as me and need to run large images or is your model even incoherent for small images? I simply installed it via pip abd installed the qwen utils. However if you are just interested in quickly deploying your model, definitely take a look at the Qwen2VL repo. They give you commands on how to make it work |
I did went through Qwen2-vl repo, tried exactly the same as mentioned. But I got this error
Thats why I want to know which version of vllm and transformers are to be used and how to install them (from pip or source) in order to infer my custom qwen2-vl gptq 4 bit model for single image. |
You should use either vLLM v0.6.1 and transformers v4.44, or vllm v0.6.3 and transformers v4.45+. |
I used a
with output
|
Note downgrading to
allows for coherent generation after the video input. |
@hector-gr Thank you for testing! :) May I ask you how you installed vLLM? I am assuming your test ran on 0.6.3? And could you please test the same 4k and 5120x1440 images with then OpenAI API endpoint? (with latest vLLM) And may I ask how long the 4k image took for you to process? In my testing small images process in 200ms but 4k ones take several seconds |
Can you please mention CUDA, Torch and python version as well. |
@hector-gr Seemingly
I try to reproduce your results and get the same problem, have you solved it yet, is there something related to too much image tokens? |
I ended up with vllm 0.6.3, transformers 4.46.1, torch 2.4.0, CUDA 12.1, python 3.10
I got this error:
Note: "Qwen2VLForConditionalGeneration" is in the list of supported models but still I got the error. @hector-gr can you help me with this? |
Created new issues : |
The OpenAI API endpoint works correctly with those two images (base64 encoded). |
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
I get incoherent generation outputs when using offline vLLM for inference with videos. This happens both when using URL or local paths, with 7B or 72B model, with or without tensor parallelism. The setup works well (provides coherent answers) when providing also text or text+image, but not video. This are also very different from the generated outputs when using transformers with the same arguments.
The code below follows the example on the Qwen repo (https://github.com/QwenLM/Qwen2-VL?tab=readme-ov-file#inference-locally), but is also what seems to be recommended in vLLM docs
with output:
For transformers the code is the default shown in the Qwen repo, which is indeed very similar. I tried to check through other issues and commits, and from my understanding this feature is supported, and the only difference in implementations seem to be minimal (#8408 (comment))
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: