Add initial support for GPTQ #1580

WoosukKwon · 2023-11-07T08:32:15Z

This PR is a simplified version of the great PR #916.
The main difference is that, this PR does not use the exllama kernels while #916 does.
The purpose of this PR is to minimize the code change in a PR, and avoid possible conflicts from @zhuohan123's ongoing refactoring effort.

WoosukKwon · 2023-11-08T20:33:45Z

@zhuohan123 The PR is ready for review. Please take a look!

WoosukKwon · 2023-11-08T20:36:30Z

@zhaoyang-star The kernels used in this PR is not optimized, so actually you cannot get any speedup for now. We will optimize the quantized GEMM kernels in the next PR.

@LimpidEarth The kernel we are using now only supports 4 bit quantization, but we can extend it in the next PR. BTW, it seems most of the GPTQ models found in HF model hub are using 4 bit quantization. Could you provide us with any pointer to 8-bit GPTQ model?

bash99 · 2023-11-10T06:38:53Z

@zhaoyang-star The kernels used in this PR is not optimized, so actually you cannot get any speedup for now. We will optimize the quantized GEMM kernels in the next PR.

@LimpidEarth The kernel we are using now only supports 4 bit quantization, but we can extend it in the next PR. BTW, it seems most of the GPTQ models found in HF model hub are using 4 bit quantization. Could you provide us with any pointer to 8-bit GPTQ model?

Most new GPTQ model released by TheBloke now has 8bit version (-1g 128g 32g)，
https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ/tree/gptq-8bit-128g-actorder_True

zhuohan123

Thanks for the work! The changes in general LGTM. However, there are some places that will conflict with #1622. Mainly because of the auxiliary variables like g_idx and shifter. What do you think about the merging plan? Should we merge this first or #1622 first? No matter which plan we go, I can help on the merge.

zhuohan123 · 2023-11-11T23:24:05Z

vllm/config.py

+        logger.warning(f"{self.quantization} quantization is not fully "
+                       "optimized yet. The speed can be slower than "
+                       "non-quantized models.")


Does this mean all quantization methods are not optimized yet?

Unfortunately, yes. For the three quantization methods we support, we are using the original authors' kernels, which can be further optimized. Particularly, the squeezellm and GPTQ kernels are slow if batch size > 1. As for AWQ, I think its kernel is much better than the two, but is still slow for large batch size and does not support bfloat16.

zhuohan123 · 2023-11-11T23:28:31Z

vllm/model_executor/layers/quantized_linear/gptq.py

+        self.shifter = torch.tensor(
+            [0, 4, 8, 12, 16, 20, 24, 28],
+            device="cuda",
+            dtype=torch.int32,
+        )


Is this shifter a must-have? The problem here is that we created a "non-parameter" tensor. We will need to modify the weight creation code in #1622 to make this creation work.

How difficult will it be to add support for such tensors? While this tensor is only used for the pytorch-based GPTQ matmul implementation and thus will be eventually unused once we develop a more optimized kernel, such non-parameter buffers can be used for other quantization methods. I believe we should take this into account in the new design.

zhuohan123 · 2023-11-11T23:30:28Z

vllm/model_executor/layers/quantized_linear/gptq.py

+        out_shape = x.shape[:-1] + (self.qweight.shape[-1], )
+        reshaped_x = x.reshape(-1, x.shape[-1])
+        num_tokens = x.shape[:-1].numel()
+        if num_tokens <= 32:


Why do we have an if here? Is the CUDA kernel slower when num_tokens > 32 or it's just the CUDA kernel will not work at all?

Good point. Actually, the current GPTQ kernel is designed for batch size 1 and performs extremely poorly when the batch size is large, often taking 10+ minutes for the initial memory profiling. As a workaround, I implemented a simple PyTorch-based GPTQ matmul faster than the original kernel for large batch size. Still, the two implementations are very bad and probably much slower than the optimized implementations like exllama.

Added a comment on this.

zhuohan123 · 2023-11-11T23:33:27Z

vllm/model_executor/layers/quantized_linear/gptq.py

+        )
+        # Initialize g_idx to be sequential.
+        # This is required because old GPTQ models may not have g_idx.
+        start_idx = self.tp_rank * self.input_size_per_partition


Just leave as a note: This line will also require some modification in #1622.

LimpidEarth · 2023-11-13T12:11:06Z

@LimpidEarth The kernel we are using now only supports 4 bit quantization, but we can extend it in the next PR. BTW, it seems most of the GPTQ models found in HF model hub are using 4 bit quantization. Could you provide us with any pointer to 8-bit GPTQ model?

@WoosukKwon Got it and looking forward to the PR of 8 bit supporting! The main reason for 8bit GPTQ model is that we found the evaluation results are better than 4 bit model regarding our domain tasks.

…inear logic and extend quantization support to all models (#1622) Refactor the tensor parallelism, quantization, and weight-loading codes. Summary of the new features enabled by this PR: - **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](#1580). - Model loading code became much simpler. - Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.

WoosukKwon · 2023-12-05T00:47:24Z

Closed as some of the changes are already merged and #916 will be merged instead.

…inear logic and extend quantization support to all models (#1622) Refactor the tensor parallelism, quantization, and weight-loading codes. Summary of the new features enabled by this PR: - **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](vllm-project/vllm#1580). - Model loading code became much simpler. - Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.

…inear logic and extend quantization support to all models (vllm-project#1622) Refactor the tensor parallelism, quantization, and weight-loading codes. Summary of the new features enabled by this PR: - **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](vllm-project#1580). - Model loading code became much simpler. - Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.

…inear logic and extend quantization support to all models (#1622) Refactor the tensor parallelism, quantization, and weight-loading codes. Summary of the new features enabled by this PR: - **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](vllm-project/vllm#1580). - Model loading code became much simpler. - Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.

chu-tianxiang and others added 30 commits September 18, 2023 18:40

Add gptq implementation compatible with awq interface

82e6b2e

Add more models

612d7b1

fix bug in model loading

049a37c

Add fallback kernel for desc act models

5563578

Fix engine args and opt model

0470121

Merge main branch

92c7f8d

Add mistral model

f9d0ccc

Fix bug in gpt layer

cbf9433

Fix conflict

a7b391d

Merge main branch

b51ebb7

Fix squeezellm

9a99461

Upgrade to CUDA 12

3020d58

Use exllama v2 kernels for better performance

2593dfe

Roll back changes in models

8954de4

Remove GPTQ kernels

68017c3

Fix GPTQ Parallel Linear

26f0b6b

Fix setup

7d2357c

Use current stream for SqueezeLLM

bda70b8

Minor

8b39d4a

Clean up GPTQ kernel

26f48f9

Remove GPTLinear

240cc63

Remove post init

15d0a2d

Roll back

b21e0c4

Fix typo

a1d918e

Minor

fe25f75

Minor

5f376b7

Fix

cb73ab6

Merge branch 'main' into minimal-gptq

af5e30a

Prevent name conflict

9e6fdf2

Prevent name conflict & Use gpuAtomicAddd

a2163a5

WoosukKwon added 6 commits November 8, 2023 09:22

Support Mistral

aa14d21

Minor

b5922a3

yapf

fbc514a

yapf

1f0a2af

Another GPTQ impl

036cdd3

Add warning

35b0b5e

WoosukKwon marked this pull request as ready for review November 8, 2023 20:33

WoosukKwon requested a review from zhuohan123 November 8, 2023 20:33

Merge remote-tracking branch 'origin/main' into minimal-gptq

7e9bbf9

WoosukKwon added the quantization label Nov 9, 2023

zhuohan123 mentioned this pull request Nov 11, 2023

TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models #1622

Merged

7 tasks

zhuohan123 reviewed Nov 11, 2023

View reviewed changes

Add a comment on kernel performance

b9d785a

WoosukKwon mentioned this pull request Nov 17, 2023

Read quantization_config in hf config #1695

Merged

jklj077 mentioned this pull request Nov 21, 2023

python -m fastchat.serve.vllm_worker --model-path Qwen/Qwen-14B-Chat-Int4 --trust-remote-code vllm 支持量化模型吗？我跑Qwen-14B-Chat-Int4 报错[BUG] QwenLM/Qwen#651

Closed

2 tasks

aresnow1 mentioned this pull request Nov 30, 2023

QUESTION: How to use both gptq and vllm in qwen-14b model? xorbitsai/inference#697

Closed

chenxu2048 mentioned this pull request Dec 4, 2023

Could you help to provide a solution for Qwen-14B-Chat-Int4(gptq) using vllm? Many Thanks! #1881

Closed

WoosukKwon closed this Dec 5, 2023

WoosukKwon deleted the minimal-gptq branch March 12, 2024 06:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial support for GPTQ #1580

Add initial support for GPTQ #1580

WoosukKwon commented Nov 7, 2023

WoosukKwon commented Nov 8, 2023

WoosukKwon commented Nov 8, 2023

bash99 commented Nov 10, 2023

zhuohan123 left a comment

zhuohan123 Nov 11, 2023

WoosukKwon Nov 12, 2023

zhuohan123 Nov 11, 2023

WoosukKwon Nov 12, 2023

zhuohan123 Nov 11, 2023

WoosukKwon Nov 12, 2023

WoosukKwon Nov 12, 2023

zhuohan123 Nov 11, 2023

LimpidEarth commented Nov 13, 2023 •

edited

Loading

WoosukKwon commented Dec 5, 2023

Add initial support for GPTQ #1580

Add initial support for GPTQ #1580

Conversation

WoosukKwon commented Nov 7, 2023

WoosukKwon commented Nov 8, 2023

WoosukKwon commented Nov 8, 2023

bash99 commented Nov 10, 2023

zhuohan123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LimpidEarth commented Nov 13, 2023 • edited Loading

WoosukKwon commented Dec 5, 2023

LimpidEarth commented Nov 13, 2023 •

edited

Loading