-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models #1622
Conversation
@WoosukKwon This PR is ready for review! |
Hi @zhuohan123 , what's the expected date for this PR gets merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhuohan123 Thanks for the amazing work! I'm generally good with this change, except that some assumptions here will be violated by #1580.
def create_weights(self, input_size: int, output_size: int, | ||
params_dtype: torch.dtype) -> Dict[str, torch.Tensor]: | ||
weight = Parameter(torch.empty(output_size, | ||
input_size, | ||
device=torch.cuda.current_device(), | ||
dtype=params_dtype), | ||
requires_grad=False) | ||
set_weight_attrs(weight, {"input_dim": 1, "output_dim": 0}) | ||
return {"weight": weight} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to include the bias term here? While the weight-only quantization methods usually don't quantize the bias, other quantization methods may quantize the bias as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave this in a future PR. The reason that I don't include bias here is that RowParallelLinear
applies all_reduce
before adding bias. Therefore, we cannot directly add bias in the apply_weights
function.
return {"weight": weight} | ||
|
||
def apply_weights(self, | ||
weights: Dict[str, torch.Tensor], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include non-parameter tensors ("buffers" in PyTorch) here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, we need to separate weights
and buffers
in create_weights
function. Let's leave this in a future PR.
@WoosukKwon All comments fixed and the PR is ready to be merged! |
Should be merged in a day or two! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Please run some tests (if not exhaustive) before the merge! Thanks a million for the refactoring. Great work!
…inear logic and extend quantization support to all models (vllm-project#1622) Refactor the tensor parallelism, quantization, and weight-loading codes. Summary of the new features enabled by this PR: - **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](vllm-project#1580). - Model loading code became much simpler. - Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
…inear logic and extend quantization support to all models (vllm-project#1622) Refactor the tensor parallelism, quantization, and weight-loading codes. Summary of the new features enabled by this PR: - **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](vllm-project#1580). - Model loading code became much simpler. - Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
Refactor the tensor parallelism, quantization, and weight-loading codes.
Summary of the new features enabled by this PR:
After this refactor:
vllm/model_executor/layers/quantized_linear/
LinearMethodBase
class. Each linear method needs to provide a method to create the weights and a method to apply the weights to an input tensor. Each weight will also include many attributesvllm/model_executor/layers/linear.py
. They will take the linear method as an argument during initialization. These layers create the weights using the linear method and also implement the weight-loading functions for the weights. The current model-parallel linear layers include:ReplicatedLinear
: Fully replicated linear.ColumnParallelLinear
.RowParallelLinear
.MergedColumnParallelLinear
: Column-parallel linear layer that is a concatenation of multiple Column-parallel linear layers. Follow the same forward logic asColumnParallelLinear
, but during weight loading, each sub-weight matrix is sharded and loaded separately.QKVParallelLinear
: Special column-parallel linear layer for QKV transformation. This class handles the weight-loading of group-query attention where the number of key-value heads are different from the query heads.vllm/model_executor/layers/vocab_parallel_embedding.py
. A new classParallelLMHead
is created for the output logits layer.The code have been tested for
TODO:
linear_method
to the modelinput_is_parallel=True
andgather_output=False
KVCache
to attention.py