-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: The accuracy of vllm-Qwen2-VL-7B-Instruct is low. #8408
Comments
@fyabc can you help look into this if you have time? Thanks! |
@DarkLight1337 @xiangxinhello I will take a look at it. |
@xiangxinhello Hi, you set dtype to 'float32' in your example code. I want to confirm that which dtype do you use in vllm and transformers? |
If vllm using fp32 and transformers using fp16, the difference may be acceptable... @ShuaiBai623 can you take a look at this diff? |
vllm float32 and float16 have the same effect, both have errors. |
Hi, @fyabc, Do you support Qwen-VL-Chat? |
@xiangxinhello #8029 already supported Qwen-VL-Chat, you can try latest vllm-0.6.1. |
击掌(531,516),(581,596) transformers-qwem2-vl-fp16 |
Hi @fyabc, The models transformers-qwen-vl-float16 and vllm-qwen-vl-float16 show discrepancies. Could you help me with this? |
HI, @DarkLight1337 @fyabc, https://github.com/QwenLM/Qwen-VL |
@xiangxinhello Hi, I have tested Qwen2-VL-7B-Instruct fp16/fp32 on vllm and HF, and got the same outputs |
Hi @fyabc A100-PCIE-40GB this transformers-Qwen2-VL-7B-Instruct environment information: this is transformers test script: model = Qwen2VLForConditionalGeneration.from_pretrained( processor = AutoProcessor.from_pretrained("/workspace/mnt/storage/trt-llama/Qwen2-VL-7B-Instruct") messages = [ text = processor.apply_chat_template( generated_ids = model.generate(**inputs, max_new_tokens=128) |
The main issue is that there is a slight difference in the coordinate values between the two. |
@xiangxinhello Hi, can you add |
@fyabc |
@xiangxinhello Hi, can you add |
@fyabc, I set transformers model.generation_config.repetition_penalty = 1.05 and is ['击掌(531,516),(587,594)']. |
Diving into the code step by step to check the difference between vllm and hf, I found some differences. The first difference lies in the In my test case, by aligning the implementations of these two components, HF produced the same output as vLLM. However, I have not performed a more comprehensive set of tests. Personally, I feel that these differences are quite small. Errors of this magnitude are generally acceptable. |
Hi @kq-chen and @fyabc . Thanks you for your help. this is vllm-Qwen2-VL-7B-Instruct test script:
this is huggingface-Qwen2-VL-7B-Instruct test script:
|
Your current environment
from PIL import Image
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
MODEL_PATH = '/workspace/mnt/storage/trt-llama/Qwen2-VL-7B-Instruct'
IMAGE_PATH = '/workspace/mnt/storage/llm_storge/vllm/examples/demo.jpeg'
llm = LLM(
model=MODEL_PATH,
dtype = 'float32',
limit_mm_per_prompt={'image': 10, 'video': 10},
)
sampling_params = SamplingParams(
temperature=0.1, top_p=0.001, repetition_penalty=1.05, max_tokens=256,
stop_token_ids=[],
)
messages = [
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': [
{
'type': 'image',
'image': IMAGE_PATH,
'max_pixels': 12845056,
},
{
'type': 'text',
'text': '输出击掌的检测框',
},
]},
]
processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)
mm_data = {}
if image_inputs is not None:
mm_data['image'] = image_inputs
if video_inputs is not None:
mm_data['video'] = video_inputs
llm_inputs = {
'prompt': prompt,
'multi_modal_data': mm_data,
}
#击掌(529,516),(583,594)
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
Model Input Dumps
No response
🐛 Describe the bug
Qwen2-VL-7B-Instruct:vllm-qwenvl-fp16 have a bug, The accuracy between vllm-qwenvl and transformer-qwenvl differs.
击掌(529,513),(584,605) vllm-fp16
击掌(531,516),(581,596) transformers-qwem2-vl-fp16
The coordinates of vllm are (529,513),(584,605).
The coordinates of transformers are (536,509),(588,602).
There is a significant difference in their errors.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: