Skip to content

Commit

Permalink
Fix GPTQ/RTN 3.x example & fix asym quantize (#1611)
Browse files Browse the repository at this point in the history
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Tang, Kaihui <[email protected]>
Co-authored-by: chen, suyue <[email protected]>
  • Loading branch information
Kaihui-intel and chensuyue authored Feb 21, 2024
1 parent 482f87c commit 813d930
Show file tree
Hide file tree
Showing 6 changed files with 797 additions and 2 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
Step-by-Step
============
This document describes the step-by-step instructions to run large language models (LLMs) on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch and Intel® Extension for PyTorch.

The script `run_clm_no_trainer.py` supports `GPTJ`, `OPT`, `LLaMA2`, `BLOOM` and `Falcon` quantization and validates last word prediction accuracy with [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness.git) now, and we are adding more models.

# Prerequisite
## 1. Create Environment
```
# Installation
pip install -r requirements.txt
```

# Run

Here is how to run the scripts:

**Causal Language Modeling (CLM)**

`run_clm_no_trainer.py` quantizes the large language models using the dataset [NeelNanda/pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) calibration and validates `lambada_openai`, `piqa`, `winogrande`, `hellaswag` and other datasets accuracy provided by lm_eval, an example command is as follows.
### GPT-J-6b

#### Quantization
```bash
# "--approach weight_only" is used to enable weight only quantization.
# "--woq_algo GPTQ" is used to enable GPTQ algorithms
# "--double_quant_type BNB_NF4" is used to enable double quant algorithms
python run_clm_no_trainer.py \
--model EleutherAI/gpt-j-6B \
--dataset NeelNanda/pile-10k \
--quantize \
--approach weight_only \
--woq_algo GPTQ \
--woq_bits 4 \
--woq_scheme asym \
--woq_group_size 128 \
--gptq_max_seq_length 2048 \
--gptq_use_max_length \
--accuracy \
--tasks "lambada_openai" \
--double_quant_type "BNB_NF4"

# "--woq_algo RTN" is used to enable RTN algorithms
python run_clm_no_trainer.py \
--model EleutherAI/gpt-j-6B \
--dataset NeelNanda/pile-10k \
--quantize \
--approach weight_only \
--woq_algo RTN \
--woq_bits 4 \
--woq_scheme asym \
--woq_group_size 128 \
--accuracy \
--tasks "lambada_openai" \
--double_quant_type "BNB_NF4"
```
**Notes**: Weight-only quantization based on fake quantization is previewly supported and supports RTN, GPTQ[1], AWQ[2], TEQ algorithms. For more details, please refer to [link](https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md). Our GPTQ API support various CLMs including GPTJ, OPTs, Blooms, Llamas, Falcons, MPTs, ChatGLMs, etc. Simply replace the "--model" argument with other models to quantize different CLMs with GPTQ.


### OPT-125m

#### Quantization

```bash
# "--approach weight_only" is used to enable weight only quantization.
# "--woq_algo GPTQ" is used to enable GPTQ algorithms
# "--double_quant_type BNB_NF4" is used to enable double quant algorithms
python run_clm_no_trainer.py \
--model facebook/opt-125m \
--dataset NeelNanda/pile-10k \
--quantize \
--approach weight_only \
--woq_algo GPTQ \
--woq_bits 4 \
--woq_scheme asym \
--woq_group_size 128 \
--gptq_max_seq_length 2048 \
--gptq_use_max_length \
--accuracy \
--tasks "lambada_openai" \
--double_quant_type "BNB_NF4"

# "--woq_algo RTN" is used to enable RTN algorithms
python run_clm_no_trainer.py \
--model facebook/opt-125m \
--dataset NeelNanda/pile-10k \
--quantize \
--approach weight_only \
--woq_algo RTN \
--woq_bits 4 \
--woq_scheme asym \
--woq_group_size 128 \
--accuracy \
--tasks "lambada_openai" \
--double_quant_type "BNB_NF4"
```

### LLAMA2-7b/13b/30b
#### Quantization

```bash
# "--approach weight_only" is used to enable weight only quantization.
# "--double_quant_type BNB_NF4" is used to enable double quant algorithms
# "--woq_algo GPTQ" is used to enable GPTQ algorithms
python run_clm_no_trainer.py \
--model meta-llama/Llama-2-7b-hf \
--dataset NeelNanda/pile-10k \
--quantize \
--approach weight_only \
--woq_algo GPTQ \
--woq_bits 4 \
--woq_scheme asym \
--woq_group_size 128 \
--gptq_max_seq_length 2048 \
--gptq_use_max_length \
--accuracy \
--tasks "lambada_openai" \
--double_quant_type "BNB_NF4"

# "--woq_algo RTN" is used to enable RTN algorithms
python run_clm_no_trainer.py \
--model meta-llama/Llama-2-7b-hf \
--dataset NeelNanda/pile-10k \
--quantize \
--approach weight_only \
--woq_algo RTN \
--woq_bits 4 \
--woq_scheme asym \
--woq_group_size 128 \
--accuracy \
--tasks "lambada_openai" \
--double_quant_type "BNB_NF4"
```


[1]. Elias, Frantar, et al. "GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers." arXiv preprint arXiv:2210.17323 (2023).
[2]. Lin, Ji, et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv preprint arXiv:2306.00978 (2023).
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
accelerate
protobuf
sentencepiece != 0.1.92
datasets >= 1.1.3
torch >= 1.10
transformers
pytest
wandb
einops
neural-compressor
intel-extension-for-transformers
git+https://github.com/EleutherAI/lm-evaluation-harness.git@cc9778fbe4fa1a709be2abed9deb6180fd40e7e2
peft
Loading

0 comments on commit 813d930

Please sign in to comment.