Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models #1622

Merged
merged 53 commits into from
Nov 16, 2023

Conversation

zhuohan123
Copy link
Member

@zhuohan123 zhuohan123 commented Nov 10, 2023

Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:

  • All models are able to be quantized with AWQ and SqueezeLLM, and soon GPTQ.
  • Model loading code became much simpler.
  • Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.

After this refactor:

  • The weight-only quantization configs and their corresponding linear implementation are moved to vllm/model_executor/layers/quantized_linear/
  • The implementation of quantized linear layers is inherited from LinearMethodBase class. Each linear method needs to provide a method to create the weights and a method to apply the weights to an input tensor. Each weight will also include many attributes
  • The model-parallel linear layers are moved to vllm/model_executor/layers/linear.py. They will take the linear method as an argument during initialization. These layers create the weights using the linear method and also implement the weight-loading functions for the weights. The current model-parallel linear layers include:
    • ReplicatedLinear: Fully replicated linear.
    • ColumnParallelLinear.
    • RowParallelLinear.
    • [New] MergedColumnParallelLinear: Column-parallel linear layer that is a concatenation of multiple Column-parallel linear layers. Follow the same forward logic as ColumnParallelLinear, but during weight loading, each sub-weight matrix is sharded and loaded separately.
    • [New] QKVParallelLinear: Special column-parallel linear layer for QKV transformation. This class handles the weight-loading of group-query attention where the number of key-value heads are different from the query heads.
  • Embedding layers and the output logit layers are moved to vllm/model_executor/layers/vocab_parallel_embedding.py. A new class ParallelLMHead is created for the output logits layer.
  • For each model, the weight loading function is drastically simplified. Please refer to the new loading code of each model for details.

The code have been tested for

  • LLaMA x (full precision, AWQ, SqueezeLLM) x (with TP, without TP)
  • Mistral x (full precision, AWQ) x (with TP, without TP)
  • All other models x (with TP, without TP)

TODO:

  • Clean up the code
    • Only feed linear_method to the model
    • Remove input_is_parallel=True and gather_output=False
    • Maybe in another PR: Move KVCache to attention.py
  • Add code comments
  • Merge with the main branch
  • Fix the error in [BugFix] get_num_kv_heads #1640

@zhuohan123
Copy link
Member Author

@WoosukKwon This PR is ready for review!

@void-main
Copy link

Hi @zhuohan123 , what's the expected date for this PR gets merge?

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhuohan123 Thanks for the amazing work! I'm generally good with this change, except that some assumptions here will be violated by #1580.

vllm/model_executor/layers/quantized_linear/base_config.py Outdated Show resolved Hide resolved
vllm/model_executor/model_loader.py Outdated Show resolved Hide resolved
vllm/model_executor/weight_utils.py Show resolved Hide resolved
vllm/model_executor/layers/quantized_linear/__init__.py Outdated Show resolved Hide resolved
vllm/model_executor/utils.py Outdated Show resolved Hide resolved
vllm/model_executor/utils.py Outdated Show resolved Hide resolved
vllm/model_executor/layers/linear.py Outdated Show resolved Hide resolved
vllm/model_executor/layers/linear.py Outdated Show resolved Hide resolved
Comment on lines +49 to +57
def create_weights(self, input_size: int, output_size: int,
params_dtype: torch.dtype) -> Dict[str, torch.Tensor]:
weight = Parameter(torch.empty(output_size,
input_size,
device=torch.cuda.current_device(),
dtype=params_dtype),
requires_grad=False)
set_weight_attrs(weight, {"input_dim": 1, "output_dim": 0})
return {"weight": weight}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to include the bias term here? While the weight-only quantization methods usually don't quantize the bias, other quantization methods may quantize the bias as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave this in a future PR. The reason that I don't include bias here is that RowParallelLinear applies all_reduce before adding bias. Therefore, we cannot directly add bias in the apply_weights function.

vllm/model_executor/layers/linear.py Outdated Show resolved Hide resolved
return {"weight": weight}

def apply_weights(self,
weights: Dict[str, torch.Tensor],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include non-parameter tensors ("buffers" in PyTorch) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we need to separate weights and buffers in create_weights function. Let's leave this in a future PR.

@zhuohan123
Copy link
Member Author

@WoosukKwon All comments fixed and the PR is ready to be merged!

@zhuohan123
Copy link
Member Author

Hi @zhuohan123 , what's the expected date for this PR gets merge?

Should be merged in a day or two!

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Please run some tests (if not exhaustive) before the merge! Thanks a million for the refactoring. Great work!

@zhuohan123 zhuohan123 merged commit 7076fa1 into main Nov 16, 2023
2 checks passed
This was referenced Nov 16, 2023
@zhuohan123 zhuohan123 deleted the refactor-quantization branch November 28, 2023 00:05
This was referenced Nov 28, 2023
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
…inear logic and extend quantization support to all models (vllm-project#1622)

Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](vllm-project#1580).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
sjchoi1 pushed a commit to casys-kaist-internal/vllm that referenced this pull request May 7, 2024
…inear logic and extend quantization support to all models (vllm-project#1622)

Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](vllm-project#1580).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants