-
Notifications
You must be signed in to change notification settings - Fork 259
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add llm evaluate for language modeling (#1350)
Signed-off-by: Xin He <[email protected]> Signed-off-by: YIYANGCAI <[email protected]> Signed-off-by: chensuyue <[email protected]>
- Loading branch information
Showing
41 changed files
with
868 additions
and
3,713 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
175 changes: 175 additions & 0 deletions
175
...les/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,175 @@ | ||
Step-by-Step | ||
============ | ||
This document describes the step-by-step instructions to run large language models (LLMs) on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch and Intel® Extension for PyTorch. | ||
|
||
The script `run_clm_no_trainer.py` supports `GPTJ`, `OPT`, `LLaMA2`, `BLOOM` and `Falcon` quantization and validates last word prediction accuracy with [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness.git) now, and we are adding more models. | ||
|
||
# Prerequisite | ||
## 1. Create Environment | ||
``` | ||
# Installation | ||
pip install -r requirements.txt | ||
``` | ||
|
||
# Run | ||
|
||
Here is how to run the scripts: | ||
|
||
**Causal Language Modeling (CLM)** | ||
|
||
`run_clm_no_trainer.py` quantizes the large language models using the dataset [NeelNanda/pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) calibration and validates `lambada_openai`, `piqa`, `winogrande`, `hellaswag` and other datasets accuracy provided by lm_eval, an example command is as follows. | ||
### GPT-J-6b | ||
|
||
#### Quantization | ||
```bash | ||
# "--sq" is used to enable smooth quant | ||
# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16 | ||
python run_clm_no_trainer.py \ | ||
--model EleutherAI/gpt-j-6B \ | ||
--quantize \ | ||
--sq \ | ||
--alpha 1.0 \ | ||
--output_dir "saved_results" \ | ||
--ipex | ||
``` | ||
|
||
**Notes**: Smooth quantization here is based on torch.jit. Without past key value in example_inputs, the quantized model cannot be used for text-generation. For text-generation task, please go to [link](https://github.com/intel/intel-extension-for-transformers/tree/main/examples/huggingface/pytorch/text-generation/quantization) | ||
|
||
```bash | ||
# "--approach weight_only" is used to enable weight only quantization. | ||
python run_clm_no_trainer.py \ | ||
--model EleutherAI/gpt-j-6B \ | ||
--quantize \ | ||
--approach weight_only \ | ||
--woq_bits 4 \ | ||
--woq_group_size 128 \ | ||
--woq_scheme asym \ | ||
--woq_algo RTN \ | ||
--woq_enable_mse_search \ | ||
--output_dir "saved_results" | ||
``` | ||
**Notes**: Weight-only quantization based on fake quantization is previewly supported and supports RTN, GPTQ[1], AWQ[2], TEQ algorithms. For more details, please refer to [link](https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md) | ||
|
||
|
||
#### Accuracy with lm_eval | ||
```bash | ||
# FP32 Accuracy | ||
python run_clm_no_trainer.py \ | ||
--model EleutherAI/gpt-j-6B \ | ||
--accuracy \ | ||
--batch_size 112 \ | ||
--tasks "lambada_openai"\ | ||
--int8 \ | ||
--ipex \ | ||
--output_dir "saved_results" # load int8 model | ||
# to validate FP32 model, please remove "--int8" and "--output_dir". | ||
``` | ||
### OPT-1.3b/2.7b/6.7b | ||
|
||
#### Quantization | ||
|
||
```bash | ||
# "--sq" is used to enable smooth quant | ||
# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16 | ||
python run_clm_no_trainer.py \ | ||
--model facebook/opt-2.7b \ | ||
--quantize \ | ||
--sq \ | ||
--alpha 0.5 \ | ||
--ipex \ | ||
--output_dir "saved_results" \ | ||
--int8_bf16_mixed | ||
``` | ||
|
||
#### Accuracy with lm_eval | ||
```bash | ||
python run_clm_no_trainer.py \ | ||
--model facebook/opt-2.7b \ | ||
--accuracy \ | ||
--batch_size 112 \ | ||
--tasks "lambada_openai" \ | ||
--int8 \ | ||
--ipex \ | ||
--output_dir "saved_results" # load int8 model | ||
# to validate FP32 model, please remove "--int8" and "--output_dir". | ||
``` | ||
### LLAMA2-7b/13b/30b | ||
>Note: LLAMA requires IPEX requirements >= 2.1 to get better accuracy. | ||
#### Quantization | ||
|
||
```bash | ||
# "--sq" is used to enable smooth quant | ||
# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16 | ||
python run_clm_no_trainer.py \ | ||
--model meta-llama/Llama-2-7b-hf \ | ||
--quantize \ | ||
--sq \ | ||
--alpha 0.8 \ | ||
--ipex \ | ||
--output_dir "saved_results" \ | ||
--int8_bf16_mixed | ||
``` | ||
|
||
#### Accuracy with lm_eval | ||
```bash | ||
python run_clm_no_trainer.py \ | ||
--model meta-llama/Llama-2-7b-hf \ | ||
--accuracy \ | ||
--batch_size 112 \ | ||
--tasks "lambada_openai" \ | ||
--int8 \ | ||
--ipex \ | ||
--output_dir "saved_results" # load int8 model | ||
# to validate FP32 model, please remove "--int8" and "--output_dir". | ||
``` | ||
|
||
### BLOOM | ||
#### Quantization | ||
```bash | ||
# "--sq" is used to enable smooth quant | ||
python run_clm_no_trainer.py \ | ||
--model bigscience/bloom-560m \ | ||
--quantize \ | ||
--ipex \ | ||
--sq \ | ||
--alpha 0.5 \ | ||
--output_dir "saved_results" | ||
``` | ||
#### Accuracy with lm_eval | ||
```bash | ||
python run_clm_no_trainer.py \ | ||
--model bigscience/bloom-560m \ | ||
--accuracy \ | ||
--batch_size 112 \ | ||
--tasks "lambada_openai" \ | ||
--int8 \ | ||
--ipex \ | ||
--output_dir "saved_results" # load int8 model | ||
# to validate FP32 model, please remove "--int8" and "--output_dir". | ||
``` | ||
|
||
### Falcon-7b | ||
```bash | ||
# "--sq" is used to enable smooth quant | ||
python run_clm_no_trainer.py \ | ||
--model tiiuae/falcon-7b-instruct \ | ||
--quantize \ | ||
--sq \ | ||
--alpha 0.5 \ | ||
--output_dir "saved_results" | ||
``` | ||
#### Accuracy with lm_eval | ||
```bash | ||
python run_clm_no_trainer.py \ | ||
--model bigscience/bloom-560m \ | ||
--accuracy \ | ||
--batch_size 112 \ | ||
--tasks "lambada_openai" \ | ||
--int8 \ | ||
--output_dir "saved_results" # load int8 model | ||
# to validate FP32 model, please remove "--int8" and "--output_dir". | ||
``` | ||
|
||
|
||
[1]. Elias, Frantar, et al. "GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers." arXiv preprint arXiv:2210.17323 (2023). | ||
[2]. Lin, Ji, et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv preprint arXiv:2306.00978 (2023). |
File renamed without changes.
13 changes: 13 additions & 0 deletions
13
examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/requirements.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
accelerate | ||
protobuf | ||
sentencepiece != 0.1.92 | ||
datasets >= 1.1.3 | ||
torch >= 1.10 | ||
transformers | ||
pytest | ||
wandb | ||
einops | ||
neural-compressor | ||
intel-extension-for-transformers | ||
git+https://github.com/EleutherAI/lm-evaluation-harness.git@83dbfbf6070324f3e5872f63e49d49ff7ef4c9b3 | ||
git+https://github.com/huggingface/peft.git@6c44096c7b8d55a2ecf24be9bc68393467e1584a |
13 changes: 13 additions & 0 deletions
13
examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run-gptq-llm.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
python examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run_clm_no_trainer.py \ | ||
--model facebook/opt-125m \ | ||
--dataset NeelNanda/pile-10k \ | ||
--seed 0 \ | ||
--quantize \ | ||
--approach weight_only \ | ||
--woq_algo GPTQ \ | ||
--woq_bits 4 \ | ||
--woq_group_size 128 \ | ||
--gptq_pad_max_length 2048 \ | ||
--gptq_use_max_length \ | ||
--gptq_gpu \ | ||
--gptq_debug |
Oops, something went wrong.