Skip to content

Commit

Permalink
Merge branch 'master' into xinhe/xpu
Browse files Browse the repository at this point in the history
  • Loading branch information
xin3he authored Oct 31, 2023
2 parents 5fa1ae4 + 9036dce commit 1b87325
Show file tree
Hide file tree
Showing 32 changed files with 1,237 additions and 141 deletions.
2 changes: 1 addition & 1 deletion .azure-pipelines/scripts/models/env_setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ if [[ "${inc_new_api}" == "false" ]]; then
fi

cd ${model_src_dir}
pip install ruamel_yaml
pip install ruamel.yaml==0.17.40
pip install psutil
pip install protobuf==4.23.4
if [[ "${framework}" == "tensorflow" ]]; then
Expand Down
40 changes: 34 additions & 6 deletions docs/source/quantization_weight_only.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,36 @@ torch.save(compressed_model.state_dict(), "compressed_model.pt")

The saved_results folder contains two files: `best_model.pt` and `qconfig.json`, and the generated q_model is a fake quantized model.


### **WOQ algorithms tuning**

To find the best algorithm, users can omit specifying a particular algorithm. In comparison to setting a specific algorithm, this tuning process will traverse through a set of pre-defined WOQ configurations and identify the optimal one with the best result. For details usage, please refer to the [tuning strategy](./tuning_strategies.md#Basic).

> **Note:** Currently, this behavior is specific to the `ONNX Runtime` backend.
**Pre-defined configurations**

| WOQ configurations | setting |
|:------------------:|:-------:|
|RTN_G32ASYM| {"algorithm": "RTN", "group_size": 32, "scheme": "asym"}|
|GPTQ_G32ASYM| {"algorithm": "GPTQ", "group_size": 32, "scheme": "asym"}|
|GPTQ_G32ASYM_DISABLE_LAST_MATMUL| {"algorithm": "GPTQ", "group_size": 32, "scheme": "asym"} <br> & disable last MatMul|
|GPTQ_G128ASYM| {"algorithm": "GPTQ", "group_size": 128, "scheme": "asym"}|
|AWQ_G32ASYM| {"algorithm": "AWQ", "group_size": 32, "scheme": "asym"}|

**User code example**

```python
conf = PostTrainingQuantConfig(
approach="weight_only",
quant_level="auto", # quant_level supports "auto" or 1 for woq config tuning
)
q_model = quantization.fit(model, conf, eval_func=eval_func, calib_dataloader=dataloader)
q_model.save("saved_results")
```

Refer to this [link](../../examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/weight_only) for an example of WOQ algorithms tuning on ONNX Llama models.

## Layer Wise Quantization

Large language models (LLMs) have shown exceptional performance across various tasks, meanwhile, the substantial parameter size poses significant challenges for deployment. Layer-wise quantization(LWQ) can greatly reduce the memory footprint of LLMs, usually 80-90% reduction, which means that users can quantize LLMs even on single node using GPU or CPU. We can quantize the model under memory-constrained devices, therefore making the huge-sized LLM quantization possible.
Expand All @@ -143,22 +173,19 @@ Large language models (LLMs) have shown exceptional performance across various t
|:--------------:|:----------:|
| RTN | &#10004; |
| AWQ | &#10005; |
| GPTQ | &#10005; |
| GPTQ | &#10004; |
| TEQ | &#10005; |

### Example
```python
from neural_compressor import PostTrainingQuantConfig, quantization
from neural_compressor.adaptor.torch_utils.layer_wise_quant import load_shell
from neural_compressor.adaptor.torch_utils.layer_wise_quant import load_empty_model

fp32_model = load_shell(model_name_or_path, AutoModelForCausalLM, torchscript=True)
fp32_model = load_empty_model(model_name_or_path, torchscript=True)
conf = PostTrainingQuantConfig(
approach="weight_only",
recipes={
"layer_wise_quant": True,
"layer_wise_quant_args": {
"model_path": "facebook/opt-125m",
},
"rtn_args": {"enable_full_range": True},
},
)
Expand All @@ -171,6 +198,7 @@ q_model = quantization.fit(
)
ouput_dir = "./saved_model"
q_model.save(ouput_dir)
q_model = load(ouput_dir, fp32_model, weight_only=True, layer_wise=True)
```

## Reference
Expand Down
2 changes: 2 additions & 0 deletions docs/source/tuning_strategies.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,8 @@ flowchart TD
> For [smooth quantization](./smooth_quant.md), users can tune the smooth quantization alpha by providing a list of scalars for the `alpha` item. The tuning process will take place at the **start stage** of the tuning procedure. For details usage, please refer to the [smooth quantization example](./smooth_quant.md#Example).
> For [weight-only quantization](./quantization_weight_only.md), users can tune the weight-only algorithms from the available [pre-defined configurations](./quantization_weight_only.md#woq-algorithms-tuning). The tuning process will take place at the **start stage** of the tuning procedure, preceding the smooth quantization alpha tuning. For details usage, please refer to the [weight-only quantization example](./quantization_weight_only.md#woq-algorithms-tuning).
*Please note that this behavior is specific to the `ONNX Runtime` backend.*

**1.** Default quantization

Expand Down
7 changes: 7 additions & 0 deletions examples/.config/model_params_onnxrt.json
Original file line number Diff line number Diff line change
Expand Up @@ -322,6 +322,13 @@
"main_script": "main.py",
"batch_size": 1
},
"beit": {
"model_src_dir": "image_recognition/beit/quantization/ptq_static",
"dataset_location": "/tf_dataset/pytorch/ImageNet/raw",
"input_model": "/tf_dataset2/models/onnx/beit/beit_base_patch16_224_pt22k_ft22kto1k.onnx",
"main_script": "main.py",
"batch_size": 1
},
"mobilebert_squad_mlperf_qdq": {
"model_src_dir": "nlp/onnx_model_zoo/mobilebert/quantization/ptq_static",
"dataset_location": "/tf_dataset2/datasets/squad",
Expand Down
6 changes: 6 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1133,6 +1133,12 @@ Intel® Neural Compressor validated examples with multiple compression technique
<td>Post-Training Static Quantization</td>
<td><a href="./onnxrt/body_analysis/onnx_model_zoo/arcface/quantization/ptq_static">qlinearops</a></td>
</tr>
<tr>
<td>BEiT</td>
<td>Image Recognition</td>
<td>Post-Training Static Quantization</td>
<td><a href="./onnxrt/image_recognition/beit/quantization/ptq_static">qlinearops</a></td>
</tr>
<tr>
<td>CodeBert</td>
<td>Natural Language Processing</td>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,13 +47,14 @@
"outputs": [],
"source": [
"# install neural-compressor from source\n",
"import sys\n",
"!git clone https://github.com/intel/neural-compressor.git\n",
"%cd ./neural-compressor\n",
"!pip install -r requirements.txt\n",
"!python setup.py install\n",
"!{sys.executable} -m pip install -r requirements.txt\n",
"!{sys.executable} setup.py install\n",
"%cd ..\n",
"# or install stable basic version from pypi\n",
"# pip install neural-compressor"
"# pip install neural-compressor\n"
]
},
{
Expand All @@ -65,10 +66,8 @@
},
"outputs": [],
"source": [
"# install onnx related packages\n",
"!pip install onnx onnxruntime onnxruntime-extensions\n",
"# install other packages used in this notebook.\n",
"!pip install torch transformers accelerate coloredlogs sympy numpy sentencepiece protobuf optimum"
"# install required packages\n",
"!{sys.executable} install -r requirements.txt\n"
]
},
{
Expand Down Expand Up @@ -168,7 +167,7 @@
"source": [
"!export GLUE_DIR=./glue_data\n",
"!wget https://raw.githubusercontent.com/Shimao-Zhang/Download_GLUE_Data/master/download_glue_data.py\n",
"!python download_glue_data.py --data_dir=GLUE_DIR --tasks=SST"
"!{sys.executable} download_glue_data.py --data_dir=GLUE_DIR --tasks=SST\n"
]
},
{
Expand All @@ -193,7 +192,7 @@
"int8_model_path = \"onnx-model/int8-model.onnx\"\n",
"data_path = \"./GLUE_DIR/SST-2\"\n",
"task = \"sst-2\"\n",
"batch_size = 8"
"batch_size = 8\n"
]
},
{
Expand Down Expand Up @@ -343,7 +342,7 @@
" label=label\n",
" )\n",
" features.append(feats)\n",
" return features"
" return features\n"
]
},
{
Expand Down Expand Up @@ -377,7 +376,7 @@
" model_name_or_path=model_name_or_path,\n",
" model_type=\"distilbert\",\n",
" task=task)\n",
"dataloader = DataLoader(framework=\"onnxruntime\", dataset=dataset, batch_size=batch_size)"
"dataloader = DataLoader(framework=\"onnxruntime\", dataset=dataset, batch_size=batch_size)\n"
]
},
{
Expand Down Expand Up @@ -448,7 +447,7 @@
" elif output_mode == \"regression\":\n",
" processed_preds = np.squeeze(self.pred_list)\n",
" result = transformers.glue_compute_metrics(self.task, processed_preds, self.label_list)\n",
" return result[self.return_key[self.task]]"
" return result[self.return_key[self.task]]\n"
]
},
{
Expand Down Expand Up @@ -486,7 +485,7 @@
" ort_inputs.update({inputs_names[i]: inputs[i]})\n",
" predictions = session.run(None, ort_inputs)\n",
" metric.update(predictions[0], labels)\n",
" return metric.result()"
" return metric.result()\n"
]
},
{
Expand Down Expand Up @@ -567,7 +566,7 @@
" num_heads=num_heads,\n",
" hidden_size=hidden_size,\n",
" optimization_options=opt_options)\n",
"model = model_optimizer.model"
"model = model_optimizer.model\n"
]
},
{
Expand Down Expand Up @@ -722,7 +721,7 @@
" config,\n",
" eval_func=eval_func,\n",
" calib_dataloader=dataloader)\n",
"q_model.save(int8_model_path)"
"q_model.save(int8_model_path)\n"
]
},
{
Expand Down
12 changes: 12 additions & 0 deletions examples/notebook/onnxruntime/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
onnx
onnxruntime
onnxruntime-extensions
torch
transformers
accelerate
coloredlogs
sympy
numpy
sentencepiece
protobuf
optimum
Original file line number Diff line number Diff line change
Expand Up @@ -45,14 +45,15 @@
"outputs": [],
"source": [
"# install neural-compressor from source\n",
"import sys\n",
"!git clone https://github.com/intel/neural-compressor.git\n",
"%cd ./neural-compressor\n",
"!pip install -r requirements.txt\n",
"!python setup.py install\n",
"!{sys.executable} -m pip install -r requirements.txt\n",
"!{sys.executable} setup.py install\n",
"%cd ..\n",
"\n",
"# or install stable basic version from pypi\n",
"!pip install neural-compressor"
"!{sys.executable} -m pip install neural-compressor\n"
]
},
{
Expand All @@ -62,7 +63,7 @@
"outputs": [],
"source": [
"# install other packages used in this notebook.\n",
"!pip install torch>=1.9.0 transformers>=4.16.0 accelerate sympy numpy sentencepiece!=0.1.92 protobuf<=3.20.3 datasets>=1.1.3 scipy scikit-learn Keras-Preprocessing"
"!{sys.executable} -m pip install -r requirements.txt\n"
]
},
{
Expand Down Expand Up @@ -303,10 +304,10 @@
"outputs": [],
"source": [
"# fp32 benchmark\n",
"!python benchmark.py --input_model ./pytorch_model.bin 2>&1|tee fp32_benchmark.log\n",
"!{sys.executable} benchmark.py --input_model ./pytorch_model.bin 2>&1|tee fp32_benchmark.log\n",
"\n",
"# int8 benchmark\n",
"!python benchmark.py --input_model ./saved_results/best_model.pt 2>&1|tee int8_benchmark.log\n"
"!{sys.executable} benchmark.py --input_model ./saved_results/best_model.pt 2>&1|tee int8_benchmark.log\n"
]
}
],
Expand Down
11 changes: 11 additions & 0 deletions examples/notebook/pytorch/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
torch>=1.9.0
transformers>=4.16.0
accelerate
sympy
numpy
sentencepiece!=0.1.92
protobuf<=3.20.3
datasets>=1.1.3
scipy
scikit-learn
Keras-Preprocessing
8 changes: 8 additions & 0 deletions examples/notebook/tensorflow/resnet/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
numpy
neural-compressor
tensorflow
datasets
requests
urllib3
pyOpenSSL
git+https://github.com/huggingface/huggingface_hub
33 changes: 17 additions & 16 deletions examples/notebook/tensorflow/resnet/resnet_quantization.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,11 @@
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"!conda install python==3.10 -y\n",
"!pip install neural-compressor\n",
"!wget -nc https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_6/resnet50_fp32_pretrained_model.pb\n",
"!pip install tensorflow\n",
"!pip install datasets\n",
"!pip install git+https://github.com/huggingface/huggingface_hub"
"!{sys.executable} -m pip install -r requirements.txt \n",
"\n",
"!wget -nc https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_6/resnet50_fp32_pretrained_model.pb\n"
]
},
{
Expand All @@ -43,9 +42,11 @@
"metadata": {},
"outputs": [],
"source": [
"print(sys.executable)\n",
"!{sys.executable} -m pip list\n",
"import tensorflow as tf\n",
"import numpy as np\n",
"import datasets"
"import datasets\n"
]
},
{
Expand All @@ -63,8 +64,8 @@
"source": [
"# login to huggingface to download the imagenet-1k dataset\n",
"# you should replace this read-only token with your own by create one on (https://huggingface.co/settings/tokens)\n",
"# !huggingface-cli login --token <YOUR HUGGINGFACE TOKEN>\n",
"!huggingface-cli login --token hf_xxxxxxxxxxxxxxxxxxxxxx"
"from huggingface_hub.hf_api import HfFolder\n",
"HfFolder.save_token('hf_xxxxxxxxxxxxxxxxxxxxxx')\n"
]
},
{
Expand All @@ -75,8 +76,8 @@
"source": [
"from datasets import load_dataset\n",
"# load dataset in streaming way will get an IterableDatset\n",
"calib_dataset = load_dataset('imagenet-1k', split='train', streaming=True, use_auth_token=True)\n",
"eval_dataset = load_dataset('imagenet-1k', split='validation', streaming=True, use_auth_token=True)"
"calib_dataset = load_dataset('imagenet-1k', split='train', streaming=True, token=True)\n",
"eval_dataset = load_dataset('imagenet-1k', split='validation', streaming=True, token=True)\n"
]
},
{
Expand All @@ -97,7 +98,7 @@
" return datasets.Dataset.from_dict(data)\n",
"\n",
"sub_calib_dataset = sample_data(calib_dataset, MAX_SAMPLE_LENGTG)\n",
"sub_eval_dataset = sample_data(eval_dataset, MAX_SAMPLE_LENGTG)"
"sub_eval_dataset = sample_data(eval_dataset, MAX_SAMPLE_LENGTG)\n"
]
},
{
Expand Down Expand Up @@ -136,7 +137,7 @@
" batch_inputs = []\n",
" labels = []\n",
" def __len__(self):\n",
" return self.length"
" return self.length\n"
]
},
{
Expand All @@ -146,7 +147,7 @@
"outputs": [],
"source": [
"calib_dataloader = CustomDataloader(dataset=sub_calib_dataset, batch_size=32)\n",
"eval_dataloader = CustomDataloader(dataset=sub_eval_dataset, batch_size=32)"
"eval_dataloader = CustomDataloader(dataset=sub_eval_dataset, batch_size=32)\n"
]
},
{
Expand Down Expand Up @@ -193,7 +194,7 @@
" return acc\n",
"\n",
"q_model = quantization.fit(\"./resnet50_fp32_pretrained_model.pb\", conf=conf, calib_dataloader=calib_dataloader, eval_func=eval_func)\n",
"q_model.save(\"resnet50_int8.pb\")"
"q_model.save(\"resnet50_int8.pb\")\n"
]
},
{
Expand Down Expand Up @@ -221,7 +222,7 @@
"metadata": {},
"outputs": [],
"source": [
"!python resnet_benchmark.py --input_model resnet50_fp32_pretrained_model.pb 2>&1|tee fp32_benchmark.log"
"!{sys.executable} resnet_benchmark.py --input_model resnet50_fp32_pretrained_model.pb 2>&1|tee fp32_benchmark.log\n"
]
},
{
Expand All @@ -237,7 +238,7 @@
"metadata": {},
"outputs": [],
"source": [
"!python resnet_benchmark.py --input_model resnet50_int8.pb 2>&1|tee int8_benchmark.log"
"!{sys.executable} resnet_benchmark.py --input_model resnet50_int8.pb 2>&1|tee int8_benchmark.log\n"
]
},
{
Expand Down
Loading

0 comments on commit 1b87325

Please sign in to comment.