Merge branch 'master' into xinhe/xpu

intel · Oct 31, 2023 · 1b87325 · 1b87325
2 parents 5fa1ae4 + 9036dce
commit 1b87325
Show file tree

Hide file tree

Showing 32 changed files with 1,237 additions and 141 deletions.
diff --git a/.azure-pipelines/scripts/models/env_setup.sh b/.azure-pipelines/scripts/models/env_setup.sh
@@ -78,7 +78,7 @@ if [[ "${inc_new_api}" == "false" ]]; then
 fi
 
 cd ${model_src_dir}
-pip install ruamel_yaml
+pip install ruamel.yaml==0.17.40
 pip install psutil
 pip install protobuf==4.23.4
 if [[ "${framework}" == "tensorflow" ]]; then

diff --git a/docs/source/quantization_weight_only.md b/docs/source/quantization_weight_only.md
@@ -129,6 +129,36 @@ torch.save(compressed_model.state_dict(), "compressed_model.pt")
 
 The saved_results folder contains two files: `best_model.pt` and `qconfig.json`, and the generated q_model is a fake quantized model.
 
+
+### **WOQ algorithms tuning**
+
+To find the best algorithm, users can omit specifying a particular algorithm. In comparison to setting a specific algorithm, this tuning process will traverse through a set of pre-defined WOQ configurations and identify the optimal one with the best result. For details usage, please refer to the [tuning strategy](./tuning_strategies.md#Basic).
+
+> **Note:** Currently, this behavior is specific to the `ONNX Runtime` backend.
+
+**Pre-defined configurations**
+
+| WOQ configurations | setting |
+|:------------------:|:-------:|
+|RTN_G32ASYM| {"algorithm": "RTN", "group_size": 32, "scheme": "asym"}|
+|GPTQ_G32ASYM| {"algorithm": "GPTQ", "group_size": 32, "scheme": "asym"}|
+|GPTQ_G32ASYM_DISABLE_LAST_MATMUL| {"algorithm": "GPTQ", "group_size": 32, "scheme": "asym"} <br> & disable last MatMul|
+|GPTQ_G128ASYM| {"algorithm": "GPTQ", "group_size": 128, "scheme": "asym"}|
+|AWQ_G32ASYM| {"algorithm": "AWQ", "group_size": 32, "scheme": "asym"}|
+
+**User code example**
+
+```python
+conf = PostTrainingQuantConfig(
+    approach="weight_only",
+    quant_level="auto",  # quant_level supports "auto" or 1 for woq config tuning
+)
+q_model = quantization.fit(model, conf, eval_func=eval_func, calib_dataloader=dataloader)
+q_model.save("saved_results")
+```
+
+Refer to this [link](../../examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/weight_only) for an example of WOQ algorithms tuning on ONNX Llama models.
+
 ## Layer Wise Quantization
 
 Large language models (LLMs) have shown exceptional performance across various tasks, meanwhile, the substantial parameter size poses significant challenges for deployment. Layer-wise quantization(LWQ) can greatly reduce the memory footprint of LLMs, usually 80-90% reduction, which means that users can quantize LLMs even on single node using GPU or CPU.  We can quantize the model under memory-constrained devices, therefore making the huge-sized LLM quantization possible.
@@ -143,22 +173,19 @@ Large language models (LLMs) have shown exceptional performance across various t
 |:--------------:|:----------:|
 |       RTN      |  &#10004;  | 
 |       AWQ      |  &#10005;  |
-|      GPTQ      | &#10005; | 
+|      GPTQ      | &#10004; | 
 |      TEQ      | &#10005; |
 
 ### Example
 ```python
 from neural_compressor import PostTrainingQuantConfig, quantization
-from neural_compressor.adaptor.torch_utils.layer_wise_quant import load_shell
+from neural_compressor.adaptor.torch_utils.layer_wise_quant import load_empty_model
 
-fp32_model = load_shell(model_name_or_path, AutoModelForCausalLM, torchscript=True)
+fp32_model = load_empty_model(model_name_or_path, torchscript=True)
 conf = PostTrainingQuantConfig(
     approach="weight_only",
     recipes={
         "layer_wise_quant": True,
-        "layer_wise_quant_args": {
-            "model_path": "facebook/opt-125m",
-        },
         "rtn_args": {"enable_full_range": True},
     },
 )
@@ -171,6 +198,7 @@ q_model = quantization.fit(
 )
 ouput_dir = "./saved_model"
 q_model.save(ouput_dir)
+q_model = load(ouput_dir, fp32_model, weight_only=True, layer_wise=True)
 ```
 
 ## Reference

diff --git a/docs/source/tuning_strategies.md b/docs/source/tuning_strategies.md
@@ -181,6 +181,8 @@ flowchart TD
 
 > For [smooth quantization](./smooth_quant.md), users can tune the smooth quantization alpha by providing a list of scalars for the `alpha` item. The tuning process will take place at the **start stage** of the tuning procedure. For details usage, please refer to the [smooth quantization example](./smooth_quant.md#Example).
 
+> For [weight-only quantization](./quantization_weight_only.md), users can tune the weight-only  algorithms from the available [pre-defined configurations](./quantization_weight_only.md#woq-algorithms-tuning). The tuning process will take place at the **start stage** of the tuning procedure, preceding the smooth quantization alpha tuning. For details usage, please refer to the [weight-only quantization example](./quantization_weight_only.md#woq-algorithms-tuning).
+*Please note that this behavior is specific to the `ONNX Runtime` backend.*
 
 **1.** Default quantization
 

diff --git a/examples/.config/model_params_onnxrt.json b/examples/.config/model_params_onnxrt.json
@@ -322,6 +322,13 @@
       "main_script": "main.py",
       "batch_size": 1
     },
+    "beit": {
+      "model_src_dir": "image_recognition/beit/quantization/ptq_static",
+      "dataset_location": "/tf_dataset/pytorch/ImageNet/raw",
+      "input_model": "/tf_dataset2/models/onnx/beit/beit_base_patch16_224_pt22k_ft22kto1k.onnx",
+      "main_script": "main.py",
+      "batch_size": 1
+    },
     "mobilebert_squad_mlperf_qdq": {
       "model_src_dir": "nlp/onnx_model_zoo/mobilebert/quantization/ptq_static",
       "dataset_location": "/tf_dataset2/datasets/squad",

diff --git a/examples/README.md b/examples/README.md
@@ -1133,6 +1133,12 @@ Intel® Neural Compressor validated examples with multiple compression technique
     <td>Post-Training Static Quantization</td>
     <td><a href="./onnxrt/body_analysis/onnx_model_zoo/arcface/quantization/ptq_static">qlinearops</a></td>
   </tr>
+  <tr>
+    <td>BEiT</td>
+    <td>Image Recognition</td>
+    <td>Post-Training Static Quantization</td>
+    <td><a href="./onnxrt/image_recognition/beit/quantization/ptq_static">qlinearops</a></td>
+  </tr>
   <tr>
     <td>CodeBert</td>
     <td>Natural Language Processing</td>

diff --git a/examples/notebook/onnxruntime/Quick_Started_Notebook_of_INC_for_ONNXRuntime.ipynb b/examples/notebook/onnxruntime/Quick_Started_Notebook_of_INC_for_ONNXRuntime.ipynb
@@ -47,13 +47,14 @@
    "outputs": [],
    "source": [
     "# install neural-compressor from source\n",
+    "import sys\n",
     "!git clone https://github.com/intel/neural-compressor.git\n",
     "%cd ./neural-compressor\n",
-    "!pip install -r requirements.txt\n",
-    "!python setup.py install\n",
+    "!{sys.executable} -m pip install -r requirements.txt\n",
+    "!{sys.executable} setup.py install\n",
     "%cd ..\n",
     "# or install stable basic version from pypi\n",
-    "# pip install neural-compressor"
+    "# pip install neural-compressor\n"
    ]
   },
   {
@@ -65,10 +66,8 @@
    },
    "outputs": [],
    "source": [
-    "# install onnx related packages\n",
-    "!pip install onnx onnxruntime onnxruntime-extensions\n",
-    "# install other packages used in this notebook.\n",
-    "!pip install torch transformers accelerate coloredlogs sympy numpy sentencepiece protobuf optimum"
+    "# install required packages\n",
+    "!{sys.executable} install -r requirements.txt\n"
    ]
   },
   {
@@ -168,7 +167,7 @@
    "source": [
     "!export GLUE_DIR=./glue_data\n",
     "!wget https://raw.githubusercontent.com/Shimao-Zhang/Download_GLUE_Data/master/download_glue_data.py\n",
-    "!python download_glue_data.py --data_dir=GLUE_DIR --tasks=SST"
+    "!{sys.executable} download_glue_data.py --data_dir=GLUE_DIR --tasks=SST\n"
    ]
   },
   {
@@ -193,7 +192,7 @@
     "int8_model_path = \"onnx-model/int8-model.onnx\"\n",
     "data_path = \"./GLUE_DIR/SST-2\"\n",
     "task = \"sst-2\"\n",
-    "batch_size = 8"
+    "batch_size = 8\n"
    ]
   },
   {
@@ -343,7 +342,7 @@
     "            label=label\n",
     "        )\n",
     "        features.append(feats)\n",
-    "    return features"
+    "    return features\n"
    ]
   },
   {
@@ -377,7 +376,7 @@
     "                      model_name_or_path=model_name_or_path,\n",
     "                      model_type=\"distilbert\",\n",
     "                      task=task)\n",
-    "dataloader = DataLoader(framework=\"onnxruntime\", dataset=dataset, batch_size=batch_size)"
+    "dataloader = DataLoader(framework=\"onnxruntime\", dataset=dataset, batch_size=batch_size)\n"
    ]
   },
   {
@@ -448,7 +447,7 @@
     "        elif output_mode == \"regression\":\n",
     "            processed_preds = np.squeeze(self.pred_list)\n",
     "        result = transformers.glue_compute_metrics(self.task, processed_preds, self.label_list)\n",
-    "        return result[self.return_key[self.task]]"
+    "        return result[self.return_key[self.task]]\n"
    ]
   },
   {
@@ -486,7 +485,7 @@
     "            ort_inputs.update({inputs_names[i]: inputs[i]})\n",
     "        predictions = session.run(None, ort_inputs)\n",
     "        metric.update(predictions[0], labels)\n",
-    "    return metric.result()"
+    "    return metric.result()\n"
    ]
   },
   {
@@ -567,7 +566,7 @@
     "    num_heads=num_heads,\n",
     "    hidden_size=hidden_size,\n",
     "    optimization_options=opt_options)\n",
-    "model = model_optimizer.model"
+    "model = model_optimizer.model\n"
    ]
   },
   {
@@ -722,7 +721,7 @@
     "                           config,\n",
     "                           eval_func=eval_func,\n",
     "                           calib_dataloader=dataloader)\n",
-    "q_model.save(int8_model_path)"
+    "q_model.save(int8_model_path)\n"
    ]
   },
   {

diff --git a/examples/notebook/onnxruntime/requirements.txt b/examples/notebook/onnxruntime/requirements.txt
@@ -0,0 +1,12 @@
+onnx
+onnxruntime
+onnxruntime-extensions
+torch
+transformers
+accelerate
+coloredlogs
+sympy
+numpy
+sentencepiece
+protobuf
+optimum
diff --git a/examples/notebook/pytorch/Quick_Started_Notebook_of_INC_for_Pytorch.ipynb b/examples/notebook/pytorch/Quick_Started_Notebook_of_INC_for_Pytorch.ipynb
@@ -45,14 +45,15 @@
    "outputs": [],
    "source": [
     "# install neural-compressor from source\n",
+    "import sys\n",
     "!git clone https://github.com/intel/neural-compressor.git\n",
     "%cd ./neural-compressor\n",
-    "!pip install -r requirements.txt\n",
-    "!python setup.py install\n",
+    "!{sys.executable} -m pip install -r requirements.txt\n",
+    "!{sys.executable} setup.py install\n",
     "%cd ..\n",
     "\n",
     "# or install stable basic version from pypi\n",
-    "!pip install neural-compressor"
+    "!{sys.executable} -m pip install neural-compressor\n"
    ]
   },
   {
@@ -62,7 +63,7 @@
    "outputs": [],
    "source": [
     "# install other packages used in this notebook.\n",
-    "!pip install torch>=1.9.0 transformers>=4.16.0 accelerate sympy numpy sentencepiece!=0.1.92 protobuf<=3.20.3 datasets>=1.1.3 scipy scikit-learn Keras-Preprocessing"
+    "!{sys.executable} -m pip install -r requirements.txt\n"
    ]
   },
   {
@@ -303,10 +304,10 @@
    "outputs": [],
    "source": [
     "# fp32 benchmark\n",
-    "!python benchmark.py --input_model ./pytorch_model.bin 2>&1|tee fp32_benchmark.log\n",
+    "!{sys.executable} benchmark.py --input_model ./pytorch_model.bin 2>&1|tee fp32_benchmark.log\n",
     "\n",
     "# int8 benchmark\n",
-    "!python benchmark.py --input_model ./saved_results/best_model.pt 2>&1|tee int8_benchmark.log\n"
+    "!{sys.executable} benchmark.py --input_model ./saved_results/best_model.pt 2>&1|tee int8_benchmark.log\n"
    ]
   }
  ],

diff --git a/examples/notebook/pytorch/requirements.txt b/examples/notebook/pytorch/requirements.txt
@@ -0,0 +1,11 @@
+torch>=1.9.0
+transformers>=4.16.0
+accelerate
+sympy
+numpy
+sentencepiece!=0.1.92
+protobuf<=3.20.3
+datasets>=1.1.3
+scipy
+scikit-learn
+Keras-Preprocessing
diff --git a/examples/notebook/tensorflow/resnet/requirements.txt b/examples/notebook/tensorflow/resnet/requirements.txt
@@ -0,0 +1,8 @@
+numpy
+neural-compressor
+tensorflow
+datasets
+requests
+urllib3
+pyOpenSSL
+git+https://github.com/huggingface/huggingface_hub
diff --git a/examples/notebook/tensorflow/resnet/resnet_quantization.ipynb b/examples/notebook/tensorflow/resnet/resnet_quantization.ipynb
@@ -29,12 +29,11 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "import sys\n",
     "!conda install python==3.10 -y\n",
-    "!pip install neural-compressor\n",
-    "!wget -nc https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_6/resnet50_fp32_pretrained_model.pb\n",
-    "!pip install tensorflow\n",
-    "!pip install datasets\n",
-    "!pip install git+https://github.com/huggingface/huggingface_hub"
+    "!{sys.executable} -m pip install -r requirements.txt \n",
+    "\n",
+    "!wget -nc https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_6/resnet50_fp32_pretrained_model.pb\n"
    ]
   },
   {
@@ -43,9 +42,11 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "print(sys.executable)\n",
+    "!{sys.executable} -m pip list\n",
     "import tensorflow as tf\n",
     "import numpy as np\n",
-    "import datasets"
+    "import datasets\n"
    ]
   },
   {
@@ -63,8 +64,8 @@
    "source": [
     "# login to huggingface to download the imagenet-1k dataset\n",
     "# you should replace this read-only token with your own by create one on (https://huggingface.co/settings/tokens)\n",
-    "# !huggingface-cli login --token <YOUR HUGGINGFACE TOKEN>\n",
-    "!huggingface-cli login --token hf_xxxxxxxxxxxxxxxxxxxxxx"
+    "from huggingface_hub.hf_api import HfFolder\n",
+    "HfFolder.save_token('hf_xxxxxxxxxxxxxxxxxxxxxx')\n"
    ]
   },
   {
@@ -75,8 +76,8 @@
    "source": [
     "from datasets import load_dataset\n",
     "# load dataset in streaming way will get an IterableDatset\n",
-    "calib_dataset = load_dataset('imagenet-1k', split='train', streaming=True, use_auth_token=True)\n",
-    "eval_dataset = load_dataset('imagenet-1k', split='validation', streaming=True, use_auth_token=True)"
+    "calib_dataset = load_dataset('imagenet-1k', split='train', streaming=True, token=True)\n",
+    "eval_dataset = load_dataset('imagenet-1k', split='validation', streaming=True, token=True)\n"
    ]
   },
   {
@@ -97,7 +98,7 @@
     "    return datasets.Dataset.from_dict(data)\n",
     "\n",
     "sub_calib_dataset = sample_data(calib_dataset, MAX_SAMPLE_LENGTG)\n",
-    "sub_eval_dataset = sample_data(eval_dataset, MAX_SAMPLE_LENGTG)"
+    "sub_eval_dataset = sample_data(eval_dataset, MAX_SAMPLE_LENGTG)\n"
    ]
   },
   {
@@ -136,7 +137,7 @@
     "                batch_inputs = []\n",
     "                labels = []\n",
     "    def __len__(self):\n",
-    "        return self.length"
+    "        return self.length\n"
    ]
   },
   {
@@ -146,7 +147,7 @@
    "outputs": [],
    "source": [
     "calib_dataloader = CustomDataloader(dataset=sub_calib_dataset, batch_size=32)\n",
-    "eval_dataloader = CustomDataloader(dataset=sub_eval_dataset, batch_size=32)"
+    "eval_dataloader = CustomDataloader(dataset=sub_eval_dataset, batch_size=32)\n"
    ]
   },
   {
@@ -193,7 +194,7 @@
     "    return acc\n",
     "\n",
     "q_model = quantization.fit(\"./resnet50_fp32_pretrained_model.pb\", conf=conf, calib_dataloader=calib_dataloader, eval_func=eval_func)\n",
-    "q_model.save(\"resnet50_int8.pb\")"
+    "q_model.save(\"resnet50_int8.pb\")\n"
    ]
   },
   {
@@ -221,7 +222,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!python resnet_benchmark.py --input_model resnet50_fp32_pretrained_model.pb 2>&1|tee fp32_benchmark.log"
+    "!{sys.executable} resnet_benchmark.py --input_model resnet50_fp32_pretrained_model.pb 2>&1|tee fp32_benchmark.log\n"
    ]
   },
   {
@@ -237,7 +238,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!python resnet_benchmark.py --input_model resnet50_int8.pb 2>&1|tee int8_benchmark.log"
+    "!{sys.executable} resnet_benchmark.py --input_model resnet50_int8.pb 2>&1|tee int8_benchmark.log\n"
    ]
   },
   {