TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models #1622

zhuohan123 · 2023-11-10T08:10:04Z

Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:

All models are able to be quantized with AWQ and SqueezeLLM, and soon GPTQ.
Model loading code became much simpler.
Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.

After this refactor:

The weight-only quantization configs and their corresponding linear implementation are moved to vllm/model_executor/layers/quantized_linear/
The implementation of quantized linear layers is inherited from LinearMethodBase class. Each linear method needs to provide a method to create the weights and a method to apply the weights to an input tensor. Each weight will also include many attributes
The model-parallel linear layers are moved to vllm/model_executor/layers/linear.py. They will take the linear method as an argument during initialization. These layers create the weights using the linear method and also implement the weight-loading functions for the weights. The current model-parallel linear layers include:
- ReplicatedLinear: Fully replicated linear.
- ColumnParallelLinear.
- RowParallelLinear.
- [New] MergedColumnParallelLinear: Column-parallel linear layer that is a concatenation of multiple Column-parallel linear layers. Follow the same forward logic as ColumnParallelLinear, but during weight loading, each sub-weight matrix is sharded and loaded separately.
- [New] QKVParallelLinear: Special column-parallel linear layer for QKV transformation. This class handles the weight-loading of group-query attention where the number of key-value heads are different from the query heads.
Embedding layers and the output logit layers are moved to vllm/model_executor/layers/vocab_parallel_embedding.py. A new class ParallelLMHead is created for the output logits layer.
For each model, the weight loading function is drastically simplified. Please refer to the new loading code of each model for details.

The code have been tested for

LLaMA x (full precision, AWQ, SqueezeLLM) x (with TP, without TP)
Mistral x (full precision, AWQ) x (with TP, without TP)
All other models x (with TP, without TP)

TODO:

Clean up the code
- Only feed linear_method to the model
- Remove input_is_parallel=True and gather_output=False
- Maybe in another PR: Move KVCache to attention.py
Add code comments
Merge with the main branch
Fix the error in [BugFix] get_num_kv_heads #1640

zhuohan123 · 2023-11-11T23:16:25Z

@WoosukKwon This PR is ready for review!

void-main · 2023-11-15T02:53:21Z

Hi @zhuohan123 , what's the expected date for this PR gets merge?

WoosukKwon

@zhuohan123 Thanks for the amazing work! I'm generally good with this change, except that some assumptions here will be violated by #1580.

vllm/model_executor/layers/quantized_linear/base_config.py

vllm/model_executor/model_loader.py

vllm/model_executor/weight_utils.py

vllm/model_executor/layers/quantized_linear/__init__.py

vllm/model_executor/utils.py

vllm/model_executor/layers/linear.py

WoosukKwon · 2023-11-15T07:38:00Z

vllm/model_executor/layers/linear.py

+    def create_weights(self, input_size: int, output_size: int,
+                       params_dtype: torch.dtype) -> Dict[str, torch.Tensor]:
+        weight = Parameter(torch.empty(output_size,
+                                       input_size,
+                                       device=torch.cuda.current_device(),
+                                       dtype=params_dtype),
+                           requires_grad=False)
+        set_weight_attrs(weight, {"input_dim": 1, "output_dim": 0})
+        return {"weight": weight}


Don't we need to include the bias term here? While the weight-only quantization methods usually don't quantize the bias, other quantization methods may quantize the bias as well.

Let's leave this in a future PR. The reason that I don't include bias here is that RowParallelLinear applies all_reduce before adding bias. Therefore, we cannot directly add bias in the apply_weights function.

vllm/model_executor/layers/linear.py

WoosukKwon · 2023-11-15T07:55:22Z

vllm/model_executor/layers/linear.py

+        return {"weight": weight}
+
+    def apply_weights(self,
+                      weights: Dict[str, torch.Tensor],


Can we include non-parameter tensors ("buffers" in PyTorch) here?

In this case, we need to separate weights and buffers in create_weights function. Let's leave this in a future PR.

zhuohan123 · 2023-11-16T01:06:36Z

@WoosukKwon All comments fixed and the PR is ready to be merged!

zhuohan123 · 2023-11-16T01:10:33Z

Hi @zhuohan123 , what's the expected date for this PR gets merge?

Should be merged in a day or two!

WoosukKwon

LGTM! Please run some tests (if not exhaustive) before the merge! Thanks a million for the refactoring. Great work!

…inear logic and extend quantization support to all models (vllm-project#1622) Refactor the tensor parallelism, quantization, and weight-loading codes. Summary of the new features enabled by this PR: - **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](vllm-project#1580). - Model loading code became much simpler. - Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.

zhuohan123 added 30 commits November 3, 2023 06:43

Create linear method

6541618

Support llama with the new quantization scheme

a97ede8

make awq work

4671286

Fix squeezellm

4579d67

Remove unused codes

4406447

Fix mistral

5a535e3

Fix format

14e66f8

New weight loading method, working for llama

f464375

Fix awq loading

a5852ef

Fix squeeze llm

7bf933f

fix quantization

8af8b60

new weight loader

686dafb

Fix vocab loading

e474020

clean up llama loader

d107613

fix awq

d4aa8c9

wip fix squeezellm

f48381b

fix squeeze llm

c5a9f9c

fix weight loader for embedding

92155da

fix

e528dbc

support mistral

772ab72

fix

0a08e66

Fix aqulia

7d7aa4b

fix vocab loader

1df5d6b

fix baichuan

93685f4

fix bloom

5f5ea90

fix qwen

31af3ea

fix qwen

68f5a3f

fix opt

4f68d07

fix mpt

23099e2

fix internlm

d7d108d

zhuohan123 requested a review from WoosukKwon November 11, 2023 23:16

This was referenced Nov 11, 2023

Add initial support for GPTQ #1580

Closed

Questions about model quantification framework design #1652

Closed

WoosukKwon reviewed Nov 15, 2023

View reviewed changes

zhuohan123 added 7 commits November 15, 2023 23:53

Merge branch 'main' into refactor-quantization

79a6a9a

Fix review comments

f750166

fix naming

fd4f4d5

fix comment

a7dd7f4

rename

18898f7

Fix issues in PR #1640

2d01ce0

Fix config

241bfa8

zhuohan123 requested a review from WoosukKwon November 16, 2023 01:06

WoosukKwon approved these changes Nov 16, 2023

View reviewed changes

s-natsubori mentioned this pull request Nov 16, 2023

load_weights KeyError with quantized GPTBigCodeForCausalLM #1682

Closed

zhuohan123 merged commit 7076fa1 into main Nov 16, 2023
2 checks passed

This was referenced Nov 16, 2023

[models] Microsoft Phi 1.5 #1664

Merged

[BugFix] get_num_kv_heads #1640

Closed

zhuohan123 mentioned this pull request Nov 21, 2023

Fuse q k v gemm to a kernel #845

Closed

zhuohan123 deleted the refactor-quantization branch November 28, 2023 00:05

This was referenced Nov 28, 2023

Fix OPT weight loading #1819

Merged

baichuan2-13b A10 deploy #1698

Closed

WoosukKwon mentioned this pull request Dec 5, 2023

Issue with Vocabulary Size Divisibility and Worker Initialization #1399

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models #1622

TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models #1622

zhuohan123 commented Nov 10, 2023 •

edited

Loading

zhuohan123 commented Nov 11, 2023

void-main commented Nov 15, 2023

WoosukKwon left a comment

WoosukKwon Nov 15, 2023

zhuohan123 Nov 16, 2023

WoosukKwon Nov 15, 2023

zhuohan123 Nov 16, 2023

zhuohan123 commented Nov 16, 2023

zhuohan123 commented Nov 16, 2023

WoosukKwon left a comment

TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models #1622

TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models #1622

Conversation

zhuohan123 commented Nov 10, 2023 • edited Loading

zhuohan123 commented Nov 11, 2023

void-main commented Nov 15, 2023

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon Nov 15, 2023

Choose a reason for hiding this comment

zhuohan123 Nov 16, 2023

Choose a reason for hiding this comment

WoosukKwon Nov 15, 2023

Choose a reason for hiding this comment

zhuohan123 Nov 16, 2023

Choose a reason for hiding this comment

zhuohan123 commented Nov 16, 2023

zhuohan123 commented Nov 16, 2023

WoosukKwon left a comment

Choose a reason for hiding this comment

zhuohan123 commented Nov 10, 2023 •

edited

Loading