-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about model quantification framework design #1652
Comments
@zhuohan123 thoughts? |
Thank you for your suggestions! Please check out this PR: #1622. I believe this PR implements what you are thinking |
@zhuohan123 ,hi, Secondly, I have two quantitative-related questions to ask:
|
In the current model quantification design part, the current vllm uses the layer of building the model, which is directly initialized according to the quantization algorithm to the layer required by the corresponding quantization algorithm. It is found that the existing design has two inconveniences, especially when the number of supported models and the number of supported quantification algorithms increase:
1. New model support: When quantification supports a new model (currently only llama is supported), reconstruction is required and change the layer building part and weight loading of model model.py( https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py#L52 , https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py#L311)
2. New algorithm support: It is not flexible enough to support new quantization algorithms. Currently, only two linear layers, column and row are abstracted. If new non-linear quantization operators are supported, the model.py part needs to be directly modified. If the number of supported algorithms increases in the future, it will As a result, the layer construction logic of model.py is complex and difficult to maintain.(https://github.com/vllm-project/vllm/pull/1508/files#diff-48d2ca5476d5b776f6401436fcf015c5ce4dc1a23d2b78a09e08fb85acc3697cR83)
Referring to the current design that is better for multi-model and multi-algorithm quantification functions
( https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/bitsandbytes.py#L127 , https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/tensorrt_llm/models/quantized/quant.py#L177)
There are two advantages to this design:
1. adding a new adaptation quantification model, you only need to complete the upgrade and adaptation of the load_weight part, and there is no need to change the original model part.
2. New quantization algorithm: If there are special layer changes, they will be limited to the quantization algorithm, and no changes are required to the model part of model.py.
Can it be upgraded?
The text was updated successfully, but these errors were encountered: