Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about model quantification framework design #1652

Closed
lixiaolx opened this issue Nov 14, 2023 · 3 comments
Closed

Questions about model quantification framework design #1652

lixiaolx opened this issue Nov 14, 2023 · 3 comments

Comments

@lixiaolx
Copy link

lixiaolx commented Nov 14, 2023

In the current model quantification design part, the current vllm uses the layer of building the model, which is directly initialized according to the quantization algorithm to the layer required by the corresponding quantization algorithm. It is found that the existing design has two inconveniences, especially when the number of supported models and the number of supported quantification algorithms increase:
1. New model support: When quantification supports a new model (currently only llama is supported), reconstruction is required and change the layer building part and weight loading of model model.py( https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py#L52https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py#L311)

2. New algorithm support: It is not flexible enough to support new quantization algorithms. Currently, only two linear layers, column and row are abstracted. If new non-linear quantization operators are supported, the model.py part needs to be directly modified. If the number of supported algorithms increases in the future, it will As a result, the layer construction logic of model.py is complex and difficult to maintain.(https://github.com/vllm-project/vllm/pull/1508/files#diff-48d2ca5476d5b776f6401436fcf015c5ce4dc1a23d2b78a09e08fb85acc3697cR83)

Referring to the current design that is better for multi-model and multi-algorithm quantification functions
https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/bitsandbytes.py#L127https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/tensorrt_llm/models/quantized/quant.py#L177)

  1. First build a complete model layer on the CPU side
  2. Then call the convert_quant_model function to implement the layer replacement required by the corresponding algorithm.

There are two advantages to this design:
1. adding a new adaptation quantification model, you only need to complete the upgrade and adaptation of the load_weight part, and there is no need to change the original model part.
2. New quantization algorithm: If there are special layer changes, they will be limited to the quantization algorithm, and no changes are required to the model part of model.py.

Can it be upgraded?

@simon-mo
Copy link
Collaborator

@zhuohan123 thoughts?

@zhuohan123
Copy link
Member

Thank you for your suggestions! Please check out this PR: #1622. I believe this PR implements what you are thinking

@lixiaolx
Copy link
Author

Thank you for your suggestions! Please check out this PR: #1622. I believe this PR implements what you are thinking

@zhuohan123 ,hi,
First, I read your new PR carefully.
I am very excited that you have refactored the implementation of quant-related functions, especially supporting more models and loading model weights to be more flexible and concise.

Secondly, I have two quantitative-related questions to ask:

  • non-linear process, After introducing quantization, how to deal with non-linear layers that need to be quantized? (For example: rmsnorm in smoothquant; and operator fusion may be required for better performance after quantization.) According to the current method, only model by model can modify the model.py file. Will this lead to one change at a time? Do any non-linear operations related to quantization require multiple changes?

  • colum or row need different number of parameters, The linear corresponding to column and row is now merged into a unified one and no longer distinguish between the two. Will this cause problems when the new quantization algorithm is introduced? (For example, column and row require different numbers of parameters? How to deal with this?)

@hmellor hmellor closed this as completed Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants