SqueezeLLM Support #1326

chooper1 · 2023-10-11T21:37:40Z

This PR adds support for the SqueezeLLM quantization method, which is described in the following preprint: https://arxiv.org/abs/2306.07629, and which has open-source GPU inference code and quantization code available at: https://github.com/SqueezeAILab/SqueezeLLM. SqueezeLLM is a post-training quantization framework that allows for high-accuracy and runtime-efficient quantization at low bit precision. SqueezeLLM leverages non-uniform quantization to better represent the underlying distribution by shifting the quantization signposts to the optimal positions. This PR contains the kernels and quantization configurations files in order to run the 4-bit dense-only non-uniform quantization scheme outlined in the preprint.

WoosukKwon · 2023-10-12T03:15:11Z

Hi @chooper1, thanks for submitting the PR! Before getting into review, could you check the code format? Please run the following and upstream the changes:

pip install -r requirements-dev.txt
./format.sh

casper-hansen · 2023-10-12T12:34:15Z

This is super interesting work, especially after the release of the quantization code to produce newly quantized models.

I am curious if Woosuk or the author could run benchmarks/benchmark_throughput.py to check the thoughput of FP16 versus SqueezeLLM?

EDIT: I am getting very low tokens/s at low batch sizes at around 14.35 tokens/s. Is this the expected performance?

WoosukKwon

@chooper1 Sorry for the late review. The PR looks good to me! I've updated it with the latest main branch and modified QuantizationConfig as I found that the original interface of QuantizationConfig was overfitted to AWQ. Thanks again for the great work!

As the next step, I hope we can see more SqueezeLLM models, especially Mistral and Falcon. Also, please consider optimizing the matmulu kernel.

Co-authored-by: squeeze-ai-lab <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

gesanqiu · 2023-11-08T03:46:18Z

Hi @casper-hansen, It seems there are some issues in SqueezeLLM-gradients? Have you produced SqueezeLLM-gradients for Llama-2-13B, I modified the _model.set_devices() to _model.cuda() and _model.num_linear_layers to 40, and met OOM problem even I set 2 A40(48GB) devices.

Co-authored-by: squeeze-ai-lab <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

chooper1 added 4 commits October 10, 2023 18:57

First draft of PR

f1a6703

First draft of PR

a992f7e

First draft of PR

427edb2

Minor Fixes

9e62ca8

squeeze-ai-lab added 4 commits October 11, 2023 23:38

Format fixes

d4f77dc

Merge branch 'main' of https://github.com/chooper1/vllm into main

6a7d807

Format fixes

9494e6c

Format fixes

5a9546a

WoosukKwon self-requested a review October 12, 2023 07:22

WoosukKwon added 11 commits October 22, 2023 02:55

Merge branch 'main' into chooper1/main

1094483

Fix output shapes

0183531

Add NOTE

1064b73

Add namespace & Refactor

d1c159c

Minor

d77d3ed

Minor

c0ef93c

Add packed

424ed63

Use packed dim for sharding

249f0da

Fix Mistral

8f0e1f7

Fix

afbe226

Err msg

4b58d9a

WoosukKwon approved these changes Oct 22, 2023

View reviewed changes

WoosukKwon merged commit 1f24755 into vllm-project:main Oct 22, 2023
2 checks passed

skrider pushed a commit to skrider/vllm that referenced this pull request Oct 27, 2023

Support SqueezeLLM (vllm-project#1326)

9dd265d

Co-authored-by: squeeze-ai-lab <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

gesanqiu mentioned this pull request Nov 9, 2023

The AWQ model's sampling time cost of first generate token is much slower than FP16 model #1545

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Support SqueezeLLM (vllm-project#1326)

1d127e2

Co-authored-by: squeeze-ai-lab <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

sjchoi1 pushed a commit to casys-kaist-internal/vllm that referenced this pull request May 7, 2024

Support SqueezeLLM (vllm-project#1326)

665cae6

Co-authored-by: squeeze-ai-lab <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SqueezeLLM Support #1326

SqueezeLLM Support #1326

chooper1 commented Oct 11, 2023

WoosukKwon commented Oct 12, 2023

casper-hansen commented Oct 12, 2023 •

edited

Loading

WoosukKwon left a comment

gesanqiu commented Nov 8, 2023 •

edited

Loading

SqueezeLLM Support #1326

SqueezeLLM Support #1326

Conversation

chooper1 commented Oct 11, 2023

WoosukKwon commented Oct 12, 2023

casper-hansen commented Oct 12, 2023 • edited Loading

WoosukKwon left a comment

Choose a reason for hiding this comment

gesanqiu commented Nov 8, 2023 • edited Loading

casper-hansen commented Oct 12, 2023 •

edited

Loading

gesanqiu commented Nov 8, 2023 •

edited

Loading