-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage]: ValueError: Unexpected weight for Qwen2-VL GPTQ 4-bit custom model. #9832
Comments
I have switched to Linux (Colab).
Here is my environment. |
Can you try using the latest main branch of vLLM? #9772 might already have fixed this issue. |
cc @mgoin |
Still getting the same error after installing vLLM from the main branch. |
Using main before #9817 landed, I am able to load @DarkLight1337 this gets into the larger issue we have with enabling quantization for more modules in vLLM, but many quantization methods/configurations do not have proper "ignored" lists of modules As an example, if you look at Qwen's official GPTQ checkpoint for Qwen2-VL you can see that all of the "model." submodules are quantized but none of the "visual." ones are https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4?show_file_info=model.safetensors.index.json However within that model's gptq quantization_config, there is nothing specifying that those modules were ignored - it looks like the config should be applied everywhere https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4/blob/main/config.json#L20-L30 "quantization_config": {
"bits": 4,
"damp_percent": 0.1,
"dataset": null,
"desc_act": false,
"group_size": 128,
"modules_in_block_to_quantize": null,
"quant_method": "gptq",
"sym": true,
"true_sequential": true
}, Luckily not all quant configs have this issue - obviously compressed-tensors has an ignore list, and AWQ has a "modules_to_not_convert" list |
Is it feasible to change the model initialization code to switch between the regular and the quantized version based on whether the corresponding weight is available from the model file? |
Not easily at all. We commonly rely on the assumption that we can allocate and distribute the model parameters by looking at the model config. Model loading from the weights is a separate step |
I mean to understand, is there anything wrong while quantizing the model or something is wrong while loading the model using vLLM? |
Hmm, a more practical way might be to let the user specify additional config arguments via CLI then... |
@bhavyajoshi-mahindra the issue is that AutoGPTQ will not quantize the visual section of qwen2-vl, but it does not leave anything in the config to signify that that linear layers are skipped @DarkLight1337 I think we should simply add a special case for GPTQ models, like was done here for AWQ vllm/vllm/model_executor/models/internvl.py Lines 453 to 463 in 5608e61
|
That may work for now. Does AWQ have an implicit list of modules that it quantized? What if this changes in the future? |
The thread here seems to indicate that AWQ should work, but I get the same issue with AWQ version.
Yet the layer is specified as unconverted in the config file:
I'm trying with latest main. |
Thanks for testing @cedonley, it seems if you run |
Any update on the issue? |
Hi @bhavyajoshi-mahindra I only have a workaround for Qwen2-VL locally, so I sat on it as I think about a more general solution. I will work on a PR just using the workaround for now |
Your current environment
I tried to infer my custom Qwen2-VL GPTQ 4bit model using the below code:
I got this error:
Note:
"nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Feb__8_05:53:42_Coordinated_Universal_Time_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0"
Can anyone help me with this.
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: