Questions about model quantification framework design #1652

lixiaolx · 2023-11-14T02:49:22Z

In the current model quantification design part, the current vllm uses the layer of building the model, which is directly initialized according to the quantization algorithm to the layer required by the corresponding quantization algorithm. It is found that the existing design has two inconveniences, especially when the number of supported models and the number of supported quantification algorithms increase:
1. New model support: When quantification supports a new model (currently only llama is supported), reconstruction is required and change the layer building part and weight loading of model model.py（ https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py#L52 ， https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py#L311）

2. New algorithm support: It is not flexible enough to support new quantization algorithms. Currently, only two linear layers, column and row are abstracted. If new non-linear quantization operators are supported, the model.py part needs to be directly modified. If the number of supported algorithms increases in the future, it will As a result, the layer construction logic of model.py is complex and difficult to maintain.（https://github.com/vllm-project/vllm/pull/1508/files#diff-48d2ca5476d5b776f6401436fcf015c5ce4dc1a23d2b78a09e08fb85acc3697cR83）

Referring to the current design that is better for multi-model and multi-algorithm quantification functions
（ https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/bitsandbytes.py#L127 ， https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/tensorrt_llm/models/quantized/quant.py#L177）

First build a complete model layer on the CPU side
Then call the convert_quant_model function to implement the layer replacement required by the corresponding algorithm.

There are two advantages to this design:
1. adding a new adaptation quantification model, you only need to complete the upgrade and adaptation of the load_weight part, and there is no need to change the original model part.
2. New quantization algorithm: If there are special layer changes, they will be limited to the quantization algorithm, and no changes are required to the model part of model.py.

Can it be upgraded?

simon-mo · 2023-11-14T03:20:56Z

@zhuohan123 thoughts?

zhuohan123 · 2023-11-14T10:21:29Z

Thank you for your suggestions! Please check out this PR: #1622. I believe this PR implements what you are thinking

lixiaolx · 2023-11-15T11:26:19Z

Thank you for your suggestions! Please check out this PR: #1622. I believe this PR implements what you are thinking

@zhuohan123 ,hi,
First, I read your new PR carefully.
I am very excited that you have refactored the implementation of quant-related functions, especially supporting more models and loading model weights to be more flexible and concise.

Secondly, I have two quantitative-related questions to ask:

non-linear process, After introducing quantization, how to deal with non-linear layers that need to be quantized? (For example: rmsnorm in smoothquant; and operator fusion may be required for better performance after quantization.) According to the current method, only model by model can modify the model.py file. Will this lead to one change at a time? Do any non-linear operations related to quantization require multiple changes?
colum or row need different number of parameters, The linear corresponding to column and row is now merged into a unified one and no longer distinguish between the two. Will this cause problems when the new quantization algorithm is introduced? (For example, column and row require different numbers of parameters? How to deal with this?)

hmellor closed this as completed Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about model quantification framework design #1652

Questions about model quantification framework design #1652

lixiaolx commented Nov 14, 2023 •

edited

Loading

simon-mo commented Nov 14, 2023

zhuohan123 commented Nov 14, 2023

lixiaolx commented Nov 15, 2023

Questions about model quantification framework design #1652

Questions about model quantification framework design #1652

Comments

lixiaolx commented Nov 14, 2023 • edited Loading

simon-mo commented Nov 14, 2023

zhuohan123 commented Nov 14, 2023

lixiaolx commented Nov 15, 2023

lixiaolx commented Nov 14, 2023 •

edited

Loading