Skip to content

Commit

Permalink
add llm evaluate for language modeling (#1350)
Browse files Browse the repository at this point in the history
Signed-off-by: Xin He <[email protected]>
Signed-off-by: YIYANGCAI <[email protected]>
Signed-off-by: chensuyue <[email protected]>
  • Loading branch information
xin3he authored Dec 6, 2023
1 parent 0a06448 commit 789779b
Show file tree
Hide file tree
Showing 41 changed files with 868 additions and 3,713 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ In particular, the tool provides the key features, typical examples, and open co

* Support a wide range of Intel hardware such as [Intel Xeon Scalable Processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable.html), [Intel Xeon CPU Max Series](https://www.intel.com/content/www/us/en/products/details/processors/xeon/max-series.html), [Intel Data Center GPU Flex Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/flex-series.html), and [Intel Data Center GPU Max Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series.html) with extensive testing; support AMD CPU, ARM CPU, and NVidia GPU through ONNX Runtime with limited testing

* Validate popular LLMs such as LLama2, [LLama](examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/ptq_static), [MPT](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/README.md), [Falcon](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/language-modeling/quantization/README.md), [GPT-J](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/fx), [Bloom](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant), [OPT](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant), and more than 10,000 broad models such as [Stable Diffusion](/examples/pytorch/nlp/huggingface_models/text-to-image/quantization), [BERT-Large](/examples/pytorch/nlp/huggingface_models/text-classification/quantization/ptq_static/fx), and [ResNet50](/examples/pytorch/image_recognition/torchvision_models/quantization/ptq/cpu/fx) from popular model hubs such as [Hugging Face](https://huggingface.co/), [Torch Vision](https://pytorch.org/vision/stable/index.html), and [ONNX Model Zoo](https://github.com/onnx/models#models), by leveraging zero-code optimization solution [Neural Coder](/neural_coder#what-do-we-offer) and automatic [accuracy-driven](/docs/source/design.md#workflow) quantization strategies
* Validate popular LLMs such as [LLama2](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), [Falcon](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), [GPT-J](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), [Bloom](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), [OPT](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), and more than 10,000 broad models such as [Stable Diffusion](/examples/pytorch/nlp/huggingface_models/text-to-image/quantization), [BERT-Large](/examples/pytorch/nlp/huggingface_models/text-classification/quantization/ptq_static/fx), and [ResNet50](/examples/pytorch/image_recognition/torchvision_models/quantization/ptq/cpu/fx) from popular model hubs such as [Hugging Face](https://huggingface.co/), [Torch Vision](https://pytorch.org/vision/stable/index.html), and [ONNX Model Zoo](https://github.com/onnx/models#models), by leveraging zero-code optimization solution [Neural Coder](/neural_coder#what-do-we-offer) and automatic [accuracy-driven](/docs/source/design.md#workflow) quantization strategies

* Collaborate with cloud marketplaces such as [Google Cloud Platform](https://console.cloud.google.com/marketplace/product/bitnami-launchpad/inc-tensorflow-intel?project=verdant-sensor-286207), [Amazon Web Services](https://aws.amazon.com/marketplace/pp/prodview-yjyh2xmggbmga#pdp-support), and [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/bitnami.inc-tensorflow-intel), software platforms such as [Alibaba Cloud](https://www.intel.com/content/www/us/en/developer/articles/technical/quantize-ai-by-oneapi-analytics-on-alibaba-cloud.html), [Tencent TACO](https://new.qq.com/rain/a/20221202A00B9S00) and [Microsoft Olive](https://github.com/microsoft/Olive), and open AI ecosystem such as [Hugging Face](https://huggingface.co/blog/intel), [PyTorch](https://pytorch.org/tutorials/recipes/intel_neural_compressor_for_pytorch.html), [ONNX](https://github.com/onnx/models#models), [ONNX Runtime](https://github.com/microsoft/onnxruntime), and [Lightning AI](https://github.com/Lightning-AI/lightning/blob/master/docs/source-pytorch/advanced/post_training_quantization.rst)

Expand Down
2 changes: 1 addition & 1 deletion docs/source/smooth_quant.md
Original file line number Diff line number Diff line change
Expand Up @@ -373,7 +373,7 @@ A list of models that achieved a <1% accuracy drop is shown below.
Please note that for models with asterisk(*), we have set all add ops to FP32 during quantization step to achieve desirable results.
## Example

User could refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant/README.md) on how to use smooth quant.
User could refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm) on how to use smooth quant.

```python
recipes = {
Expand Down
112 changes: 70 additions & 42 deletions examples/.config/model_params_pytorch.json
Original file line number Diff line number Diff line change
Expand Up @@ -450,20 +450,83 @@
"main_script": "run_clm.py",
"batch_size": 8
},
"gpt_j_wikitext_weight_only":{
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_weight_only",
"dataset_location": "",
"input_model": "/tf_dataset2/models/pytorch/gpt-j-6B",
"main_script": "run_clm.py",
"batch_size": 8
},
"gpt_neox":{
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_static/fx",
"dataset_location": "/tf_dataset/pytorch/glue_data_new/oscar",
"input_model": "/tf_dataset2/models/huggingface/gpt-neox-japanese-2.7b",
"main_script": "run_clm.py",
"batch_size": 8
},
"opt_125m_woq_awq":{
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
"dataset_location": "",
"input_model": "",
"main_script": "run_clm_no_trainer.py",
"batch_size": 8
},
"opt_125m_woq_gptq":{
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
"dataset_location": "",
"input_model": "",
"main_script": "run_clm_no_trainer.py",
"batch_size": 8
},
"opt_125m_woq_teq":{
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
"dataset_location": "",
"input_model": "",
"main_script": "run_clm_no_trainer.py",
"batch_size": 8
},
"opt_125m_ipex":{
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
"dataset_location": "",
"input_model": "",
"main_script": "run_clm_no_trainer.py",
"batch_size": 8
},
"opt_125m_ipex_sq":{
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
"dataset_location": "",
"input_model": "",
"main_script": "run_clm_no_trainer.py",
"batch_size": 8
},
"bloom_560m_ipex_sq": {
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
"dataset_location": "",
"input_model": "bigscience/bloom-560m",
"batch_size": 1,
"main_script": "run_clm_no_trainer.py"
},
"llama2_7b_ipex_sq":{
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
"dataset_location": "",
"input_model": "",
"main_script": "run_clm_no_trainer.py",
"batch_size": 1
},
"gpt_j_ipex_sq":{
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
"dataset_location": "",
"input_model": "",
"main_script": "run_clm_no_trainer.py",
"batch_size": 1
},
"gpt_j_woq_rtn":{
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
"dataset_location": "",
"input_model": "",
"main_script": "run_clm_no_trainer.py",
"batch_size": 1
},
"falcon_7b_sq":{
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
"dataset_location": "",
"input_model": "",
"main_script": "run_clm_no_trainer.py",
"batch_size": 1
},
"xlm-roberta-base_MRPC": {
"model_src_dir": "nlp/huggingface_models/text-classification/quantization/ptq_static/fx",
"dataset_location": "",
Expand Down Expand Up @@ -583,41 +646,6 @@
"main_script": "run_glue.py",
"batch_size": 64
},
"bloom-560m_sq": {
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant",
"dataset_location": "",
"input_model": "bigscience/bloom-560m",
"batch_size": 1,
"main_script": "eval_lambada.py"
},
"bloom-176b_sq": {
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant",
"dataset_location": "",
"input_model": "bigscience/bloom",
"batch_size": 1,
"main_script": "eval_lambada.py"
},
"opt-125m_sq": {
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant",
"dataset_location": "",
"input_model": "facebook/opt-125m",
"batch_size": 1,
"main_script": "eval_lambada.py"
},
"opt-6.7b_sq": {
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant",
"dataset_location": "",
"input_model": "facebook/opt-6.7b",
"batch_size": 1,
"main_script": "eval_lambada.py"
},
"gpt-j-6B_sq": {
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant",
"dataset_location": "",
"input_model": "EleutherAI/gpt-j-6B",
"batch_size": 1,
"main_script": "eval_lambada.py"
},
"wide_resnet101_2_fx": {
"model_src_dir": "oob_models/gen-efficientnet-pytorch",
"dataset_location": "/tf_dataset/pytorch/ImageNet/raw",
Expand Down
8 changes: 4 additions & 4 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -664,13 +664,13 @@ Intel® Neural Compressor validated examples with multiple compression technique
<td>EleutherAI/gpt-j-6B</td>
<td>Natural Language Processing</td>
<td>Post-Training Static Quantization</td>
<td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/fx">fx</a> / <a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant">smooth quant</a></td>
<td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/fx">fx</a> / <a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/llm">smooth quant</a></td>
</tr>
<tr>
<td>EleutherAI/gpt-j-6B</td>
<td>Natural Language Processing</td>
<td>Post-Training Weight Only Quantization</td>
<td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_weight_only">weight_only</a></td>
<td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/llm">weight_only</a></td>
</tr>
<tr>
<td>abeja/gpt-neox-japanese-2.7b</td>
Expand All @@ -682,13 +682,13 @@ Intel® Neural Compressor validated examples with multiple compression technique
<td>bigscience/bloom</td>
<td>Natural Language Processing</td>
<td>Post-Training Static Quantization</td>
<td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant">smooth quant</a></td>
<td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/llm">smooth quant</a></td>
</tr>
<tr>
<td>facebook/opt</td>
<td>Natural Language Processing</td>
<td>Post-Training Static Quantization</td>
<td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant">smooth quant</a></td>
<td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/llm">smooth quant</a></td>
</tr>
<tr>
<td>SD Diffusion</td>
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
Step-by-Step
============
This document describes the step-by-step instructions to run large language models (LLMs) on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch and Intel® Extension for PyTorch.

The script `run_clm_no_trainer.py` supports `GPTJ`, `OPT`, `LLaMA2`, `BLOOM` and `Falcon` quantization and validates last word prediction accuracy with [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness.git) now, and we are adding more models.

# Prerequisite
## 1. Create Environment
```
# Installation
pip install -r requirements.txt
```

# Run

Here is how to run the scripts:

**Causal Language Modeling (CLM)**

`run_clm_no_trainer.py` quantizes the large language models using the dataset [NeelNanda/pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) calibration and validates `lambada_openai`, `piqa`, `winogrande`, `hellaswag` and other datasets accuracy provided by lm_eval, an example command is as follows.
### GPT-J-6b

#### Quantization
```bash
# "--sq" is used to enable smooth quant
# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16
python run_clm_no_trainer.py \
--model EleutherAI/gpt-j-6B \
--quantize \
--sq \
--alpha 1.0 \
--output_dir "saved_results" \
--ipex
```

**Notes**: Smooth quantization here is based on torch.jit. Without past key value in example_inputs, the quantized model cannot be used for text-generation. For text-generation task, please go to [link](https://github.com/intel/intel-extension-for-transformers/tree/main/examples/huggingface/pytorch/text-generation/quantization)

```bash
# "--approach weight_only" is used to enable weight only quantization.
python run_clm_no_trainer.py \
--model EleutherAI/gpt-j-6B \
--quantize \
--approach weight_only \
--woq_bits 4 \
--woq_group_size 128 \
--woq_scheme asym \
--woq_algo RTN \
--woq_enable_mse_search \
--output_dir "saved_results"
```
**Notes**: Weight-only quantization based on fake quantization is previewly supported and supports RTN, GPTQ[1], AWQ[2], TEQ algorithms. For more details, please refer to [link](https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md)


#### Accuracy with lm_eval
```bash
# FP32 Accuracy
python run_clm_no_trainer.py \
--model EleutherAI/gpt-j-6B \
--accuracy \
--batch_size 112 \
--tasks "lambada_openai"\
--int8 \
--ipex \
--output_dir "saved_results" # load int8 model
# to validate FP32 model, please remove "--int8" and "--output_dir".
```
### OPT-1.3b/2.7b/6.7b

#### Quantization

```bash
# "--sq" is used to enable smooth quant
# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16
python run_clm_no_trainer.py \
--model facebook/opt-2.7b \
--quantize \
--sq \
--alpha 0.5 \
--ipex \
--output_dir "saved_results" \
--int8_bf16_mixed
```

#### Accuracy with lm_eval
```bash
python run_clm_no_trainer.py \
--model facebook/opt-2.7b \
--accuracy \
--batch_size 112 \
--tasks "lambada_openai" \
--int8 \
--ipex \
--output_dir "saved_results" # load int8 model
# to validate FP32 model, please remove "--int8" and "--output_dir".
```
### LLAMA2-7b/13b/30b
>Note: LLAMA requires IPEX requirements >= 2.1 to get better accuracy.
#### Quantization

```bash
# "--sq" is used to enable smooth quant
# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16
python run_clm_no_trainer.py \
--model meta-llama/Llama-2-7b-hf \
--quantize \
--sq \
--alpha 0.8 \
--ipex \
--output_dir "saved_results" \
--int8_bf16_mixed
```

#### Accuracy with lm_eval
```bash
python run_clm_no_trainer.py \
--model meta-llama/Llama-2-7b-hf \
--accuracy \
--batch_size 112 \
--tasks "lambada_openai" \
--int8 \
--ipex \
--output_dir "saved_results" # load int8 model
# to validate FP32 model, please remove "--int8" and "--output_dir".
```

### BLOOM
#### Quantization
```bash
# "--sq" is used to enable smooth quant
python run_clm_no_trainer.py \
--model bigscience/bloom-560m \
--quantize \
--ipex \
--sq \
--alpha 0.5 \
--output_dir "saved_results"
```
#### Accuracy with lm_eval
```bash
python run_clm_no_trainer.py \
--model bigscience/bloom-560m \
--accuracy \
--batch_size 112 \
--tasks "lambada_openai" \
--int8 \
--ipex \
--output_dir "saved_results" # load int8 model
# to validate FP32 model, please remove "--int8" and "--output_dir".
```

### Falcon-7b
```bash
# "--sq" is used to enable smooth quant
python run_clm_no_trainer.py \
--model tiiuae/falcon-7b-instruct \
--quantize \
--sq \
--alpha 0.5 \
--output_dir "saved_results"
```
#### Accuracy with lm_eval
```bash
python run_clm_no_trainer.py \
--model bigscience/bloom-560m \
--accuracy \
--batch_size 112 \
--tasks "lambada_openai" \
--int8 \
--output_dir "saved_results" # load int8 model
# to validate FP32 model, please remove "--int8" and "--output_dir".
```


[1]. Elias, Frantar, et al. "GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers." arXiv preprint arXiv:2210.17323 (2023).
[2]. Lin, Ji, et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv preprint arXiv:2306.00978 (2023).
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
accelerate
protobuf
sentencepiece != 0.1.92
datasets >= 1.1.3
torch >= 1.10
transformers
pytest
wandb
einops
neural-compressor
intel-extension-for-transformers
git+https://github.com/EleutherAI/lm-evaluation-harness.git@83dbfbf6070324f3e5872f63e49d49ff7ef4c9b3
git+https://github.com/huggingface/peft.git@6c44096c7b8d55a2ecf24be9bc68393467e1584a
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
python examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run_clm_no_trainer.py \
--model facebook/opt-125m \
--dataset NeelNanda/pile-10k \
--seed 0 \
--quantize \
--approach weight_only \
--woq_algo GPTQ \
--woq_bits 4 \
--woq_group_size 128 \
--gptq_pad_max_length 2048 \
--gptq_use_max_length \
--gptq_gpu \
--gptq_debug
Loading

0 comments on commit 789779b

Please sign in to comment.