From 8769a7bacb78af1e1722c78ffd1b88d9e3bd25c3 Mon Sep 17 00:00:00 2001 From: zehao-intel Date: Tue, 28 May 2024 14:41:00 +0800 Subject: [PATCH 01/13] Documents of INC TF 3x New API Signed-off-by: zehao-intel --- docs/3x/Quant/TF_Quant.md | 182 +++++++++++++++++++ docs/3x/SQ/TF_SQ.md | 162 +++++++++++++++++ docs/3x/TensorFlow.md | 211 +++++++++++++++++++++++ neural_compressor/tensorflow/__init__.py | 1 + 4 files changed, 556 insertions(+) create mode 100644 docs/3x/Quant/TF_Quant.md create mode 100644 docs/3x/SQ/TF_SQ.md create mode 100644 docs/3x/TensorFlow.md diff --git a/docs/3x/Quant/TF_Quant.md b/docs/3x/Quant/TF_Quant.md new file mode 100644 index 00000000000..d08e03e3809 --- /dev/null +++ b/docs/3x/Quant/TF_Quant.md @@ -0,0 +1,182 @@ + +Quantization +=============== + +1. [Quantization Introduction](#quantization-introduction) +2. [Quantization Fundamentals](#quantization-fundamentals) +3. [Accuracy Aware Tuning](#accuracy-aware-tuning) + +4. [Get Started](#get-started) + 5.1 [Without Accuracy Aware Tuning](#without-accuracy-aware-tuning) + 5.2 [With Accuracy Aware Tuning](#with-accuracy-aware-tuning) + 5.3 [Specify Quantization Rules](#specify-quantization-rules) + +## Quantization Introduction + +Quantization is a very popular deep learning model optimization technique invented for improving the speed of inference. It minimizes the number of bits required by converting a set of real-valued numbers into the lower bit data representation, such as int8 and int4, mainly on inference phase with minimal to no loss in accuracy. This way reduces the memory requirement, cache miss rate, and computational cost of using neural networks and finally achieve the goal of higher inference performance. On Intel 3rd Gen Intel® Xeon® Scalable Processors, user could expect up to 4x theoretical performance speedup. We expect further performance improvement with [Intel® Advanced Matrix Extensions](https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html) on 4th Gen Intel® Xeon® Scalable Processors. + +## Quantization Fundamentals + +`Affine quantization` and `Scale quantization` are two common range mapping techniques used in tensor conversion between different data types. + +For TensorFlow, all quantizable operators support `Scale quantization`, while a parts of operators support `Affine quantization`. For Keras, the quantizable layers only support `Scale quantization`. + +The math equation is like: $$X_{int8} = round(Scale \times X_{fp32} + ZeroPoint)$$. + +**Affine Quantization** + +This is so-called `asymmetric quantization`, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255]. + +here: + +If INT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 127$ and $ZeroPoint = -128 - X_{f_{min}} / Scale$. + +or + +If UINT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 255$ and $ZeroPoint = - X_{f_{min}} / Scale$. + +**Scale Quantization** + +This is so-called `Symmetric quantization`, in which we use the maximum absolute value in the float tensor as float range and map to the corresponding integer range. + +The math equation is like: + +here: + +If INT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 127$ and $ZeroPoint = 0$. + +or + +If UINT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 255$ and $ZeroPoint = 128$. + + +> ***Note*** +> Sometimes the reduce_range feature, that's using 7 bit width (1 sign bit + 6 data bits) to represent int8 range, may be needed on some early Xeon platforms, it's because those platforms may have overflow issues due to fp16 intermediate calculation result when executing int8 dot product operation. After AVX512_VNNI instruction is introduced, this issue gets solved by supporting fp32 intermediate data. + + + +### Quantization Approaches + +Quantization has three different approaches: +1) post training dynamic quantization +2) post training static quantization +3) quantization aware training. + +Currently, only `post training static quantization` is supported by INC TF 3X API. For this approach, the min/max range in weights and activations are collected offline on a so-called `calibration` dataset. This dataset should be able to represent the data distribution of those unseen inference dataset. The `calibration` process runs on the original fp32 model and dumps out all the tensor distributions for `Scale` and `ZeroPoint` calculations. Usually preparing 100 samples are enough for calibration. + +This approach is major quantization approach people should try because it could provide the better performance comparing with `post training dynamic quantization`. + + +## Accuracy Aware Tuning + +Accuracy aware tuning is one of unique features provided by Intel(R) Neural Compressor, compared with other 3rd party model compression tools. This feature can be used to solve accuracy loss pain points brought by applying low precision quantization and other lossy optimization methods. + +This tuning algorithm creates a tuning space by querying framework quantization capability and model structure, selects the ops to be quantized by the tuning strategy, generates quantized graph, and evaluates the accuracy of this quantized graph. The optimal model will be yielded if the pre-defined accuracy goal is met. + +Neural compressor also support to quantize all quantizable ops without accuracy tuning, user can decide whether to tune the model accuracy or not. Please refer to "Get Start" below. + +### Working Flow + +User could refer to below chart to understand the whole tuning flow. + +accuracy aware tuning working flow + + +## Get Started + + +### Without Accuracy Aware Tuning + + +This means user could leverage Intel(R) Neural Compressor to directly generate a fully quantized model without accuracy aware tuning. It's user responsibility to ensure the accuracy of the quantized model meets expectation. + +``` python +# main.py + +# Original code +model = tf.keras.applications.resnet50.ResNet50(weights='imagenet') +val_dataset = ... +val_dataloader = MyDataloader(dataset=val_dataset) + +# Quantization code +from neural_compressor.tensorflow import quantize_model, StaticQuantConfig + +quant_config = StaticQuantConfig() +qmodel = quantize_model( + model=model, + quant_config=quant_config, + calib_dataloader=val_dataloader, +) +qmodel.save("./output") +``` + +### With Accuracy Aware Tuning + +This means user could leverage the advance feature of Intel(R) Neural Compressor to tune out a best quantized model which has best accuracy and good performance. User should provide either `eval_fn` and `eval_args`. + +``` python +# main.py + +# Original code +model = tf.keras.applications.resnet50.ResNet50(weights='imagenet') +val_dataset = ... +val_dataloader = MyDataloader(dataset=val_dataset) + +def eval_acc_fn(model) -> float: + ... + return acc +# Quantization code +from neural_compressor.common.base_tuning import TuningConfig +from neural_compressor.tensorflow import autotune + +# it's also supported to define custom_tune_config as: +# TuningConfig(StaticQuantConfig(weight_sym=[True, False], act_sym=[True, False])) +custom_tune_config = TuningConfig( + config_set=[ + StaticQuantConfig(weight_sym=True, act_sym=True), + StaticQuantConfig(weight_sym=False, act_sym=False), + ] +) +best_model = autotune( + model=model, + tune_config=custom_tune_config, + eval_fn=eval_acc_fn, + calib_dataloader=val_dataloader, +) +best_model.save("./output") +``` + +### Specify Quantization Rules +Intel(R) Neural Compressor support specify quantization rules by operator name or operator type. Users can set `local` in dict or use `set_local` method of config class to achieve the above purpose. + +1. Example of setting `local` from a dict +```python +quant_config = { + "static_quant": { + "global": { + "weight_dtype": "int8", + "weight_sym": True, + "weight_granularity": "per_tensor", + "act_dtype": "int8", + "act_sym": True, + "act_granularity": "per_tensor", + }, + "local": { + "conv1": { + "weight_dtype": "fp32", + "act_dtype": "fp32", + } + }, + } + } +config = StaticQuantConfig.from_dict(quant_config) +``` +2. Example of using `set_local` +```python +quant_config = StaticQuantConfig() +conv2d_config = StaticQuantConfig( + weight_dtype="fp32", + act_dtype="fp32", +) +quant_config.set_local("conv1", conv2d_config) +``` diff --git a/docs/3x/SQ/TF_SQ.md b/docs/3x/SQ/TF_SQ.md new file mode 100644 index 00000000000..0e669f3e9ce --- /dev/null +++ b/docs/3x/SQ/TF_SQ.md @@ -0,0 +1,162 @@ +# Smooth Quant + +1. [Introduction](#Introduction) +2. [Quantization Fundamentals](#Quantization-Fundamentals) +3. [SmoothQuant and Our Enhancement](#SmoothQuant-and-Our-Enhancement) +4. [Usage](#Usage) +5. [Reference](#reference) + + +## Introduction + +Quantization is a common compression operation to reduce memory and accelerate inference by converting the floating point matrix to an integer matrix. For large language models (LLMs) with gigantic parameters, the systematic outliers make quantification of activations difficult. [SmoothQuant](https://arxiv.org/abs/2211.10438), a training free post-training quantization (PTQ) solution, offline migrates this difficulty from activations to weights with a mathematically equivalent transformation. + +## Quantization Fundamentals + +Quantization is a common compression operation to reduce memory and accelerate inference; therefore, the difficulty of LLM deployment can be alleviated. Quantization converts the floating point matrix to an integer matrix. + +The equation of quantization is as follows: + +$$ +X_{int8} = round(X_{fp32}/S) + Z \tag{1} +$$ + +where $X_{fp32}$ is the input matrix, $S$ is the scale factor, $Z$ is the integer zero point. + +### Per-tensor & Per-channel + +There are several choices of sharing quantization parameters among tensor elements, also called quantization granularity. The coarsest level, per-tensor granularity, is that all elements in the tensor share the same quantization parameters. Finer granularity means sharing quantization parameters per row or per column for 2D matrices and per channel for 3D matrices. Similarly, the finest granularity is that each element has an individual set of quantization parameters. + + +However, due to the model accuracy and computational consumption, per-tensor or per-channel are usually adopted. **Through mathematical calculations, per-channel could bring lower quantization loss but has some limitations, that is why normally we use per-channel for weight quantization and per-tensor for activation/input quantization** + +#### Per-channel limitation + +Though per-channel quantization could bring lower quantization error, we could not apply it for activations due to the difficulty of the dequantization. We would prove it in the following image and the zero point of quantization would be ignored for simplicity. + +The image on the left presents a normal linear forward with 1x2 input $x$ and 2x2 weight $w$. The results $y$ could be easily obtained by simple mathematics. In the middle image, we apply per-tensor quantization for activations and per-channel quantization for weights; the results after quantization that are denoted by $y_1$ and $y_2$, could be easily dequantized to the float results $y_{fp1}$ and $y_{fp2}$ by per channel scale $1.0/s_1s_x$ and $1.0/s_2s_x$. However, after applying per-channel quantization for activations (right image), we could not dequantize the $y_1$ and $y_2$ to float results. + +
+ +
+ + +## SmoothQuant and Our Enhancement + +### SmoothQuant + +In the previous subsection, we have explained why per-channel quantization could not be applied for s, even though it could lead to lower quantization loss. However, the quantization error loss of activations plays an important role in the accuracy loss of model quantization[^2][^3][^4]. + + + +To reduce the quantization loss of activations, lots of methods have been proposed. In the following, we briefly introduce SPIQ[^2], Outlier Suppression[^3] and Smoothquant[^4]. All these three methods share a similar idea to migrate the difficulty from activation quantization to weight quantization but differ in how much difficulty to be transferred. + + +So **the first question is how to migrate the difficulty from activation to weights?** The solution is straightforward, that is to convert the network to an output equivalent network that is presented in the image below and apply quantization to this equivalent network. The intuition is that each channel of activations could be scaled to make it more quantization-friendly, similar to a fake per-channel activation quantization. + +
+ +
+ + +Please note that this conversion will make the quantization of weights more difficult, because the scales attached to weights shown above are per-input-channel, while quantization of weights is per-output-channel or per-tensor. + +So **the second question is how much difficulty to be migrated**, that is how to choose the **conversion per-channel scale** $s_{x1}$ and $s_{x2}$ from the above image. Different works adopt different ways. + +*SPIQ* just adopts the quantization scale of activations as the conversion per-channel scale. + +*Outlier suppression* adopts the scale of the preceding layernorm as the conversion per-channel scale. + +*Smoothquant* introduces a hyperparameter $\alpha$ as a smooth factor to calculate the conversion per-channel scale and balance the quantization difficulty of activations and weights. + +$$ +s_j = max(|X_j|)^\alpha/max(|W_j|)^{1-\alpha} \tag{4} +$$ + +j is the index of the input channels. + + + +
+ +
+ + + +For most of the models such as OPT and BLOOM, $\alpha = 0.5$ is a well-balanced value to split the difficulty of weight and activation quantization. A larger $\alpha$ value could be used on models with more significant activation outliers to migrate more quantization difficulty to weights. + + +### Our enhancement: + +#### Algorithm: Auto-tuning of $\alpha$. + +SmoothQuant method aims to split the quantization difficulty of weights and activations by using a fixed-value $\alpha$ for an entire model. However, as the distributions of activation outliers vary not only across different models but also across different layers within a model, we hereby propose a method to obtain layer-wise optimal $\alpha$ values with the ability to tune automatically. + +Our proposed method consists of 7 major steps: + +- Calculate input minimum and maximum values of operators to be smoothed. +- Find a list of operators on which smoothquant could be performed. +- Set a $\alpha$ value based on user-defined $\alpha$ values. +- Calculate smoothing factor using the current $\alpha$ value, adjust parameters accordingly and forward the adjusted model given an input sample. +- Perform per-channel quantization_dequantization of weights and per-tensor quantization_dequantization of activations to predict output. +- Calculate the accuracy loss with respect to FP32 output, iterate the previous three steps given each $\alpha$ value and save the loss per alpha. +- Stop iterating if the maximum times of trial is reached and output the quantized model with a minimum accuracy loss. + + + +Multiple criteria (e.g min, max and mean) are supported to determine the $\alpha$ value of an input LayerNorm op of a transformer block. Both alpha range and criterion could be configured in auto_alpha_args. + +In our experiments, an $\alpha$ range of [0.0, 1.0] with a step_size of 0.1 is found to be well-balanced one for the majority of models. + + +## Usage +There are two ways to apply smooth quantization: 1) using a fixed `alpha` for the entire model or 2) determining the `alpha` through auto-tuning. + +### Using a fixed `alpha` +To set a fixed alpha for the entire model, users can follow this example: + +```python +from neural_compressor.tensorflow import SmoothQuantConfig, StaticQuantConfig + +quant_config = [SmoothQuantConfig(alpha=0.5), StaticQuantConfig()] +q_model = quantize_model( + output_graph_def, + [sq_config, static_config], + calib_dataloader +) +``` +The `SmoothQuantConfig` should be combined with `StaticQuantConfig` in a list because we still need to insert QDQ and apply pattern fusion after the smoothing process. + + +### Determining the `alpha` through auto-tuning +Users can search for the best `alpha` for the entire model.The tuning process looks for the optimal `alpha` value from a list of `alpha` values provided by the user. + +Here is an example: + +```python +from neural_compressor.tensorflow import StaticQuantConfig, SmoothQuantConfig + +custom_tune_config = TuningConfig( + config_set=[SmoothQuantConfig(alpha=[0.5, 0.6, 0.7]), StaticQuantConfig()] +) +best_model = autotune( + model="fp32_model", + tune_config=custom_tune_config, + eval_fn=eval_fn_wrapper, + calib_dataloader=calib_dataloader, +) +``` +> Please note that, it may a considerable amount of time as the tuning process applies each `alpha` to the entire model and uses the evaluation result on the entire dataset as the metric to determine the best `alpha`. + +## Reference + +[^1]: Jason, Wei, et al. "Emergent Abilities of Large Language Models". Published in Transactions on Machine Learning Research (2022). + + +[^2]: Yvinec, Edouard, et al. "SPIQ: Data-Free Per-Channel Static Input Quantization." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023. + + +[^3]: Wei, Xiuying, et al. "Outlier suppression: Pushing the limit of low-bit transformer language models." arXiv preprint arXiv:2209.13325 (2022). + + +[^4]: Xiao, Guangxuan, et al. "Smoothquant: Accurate and efficient post-training quantization for large language models." arXiv preprint arXiv:2211.10438 (2022). diff --git a/docs/3x/TensorFlow.md b/docs/3x/TensorFlow.md new file mode 100644 index 00000000000..843e2010b2c --- /dev/null +++ b/docs/3x/TensorFlow.md @@ -0,0 +1,211 @@ +TensorFlow +=============== + +- [TensorFlow](#tensorflow) + - [Introduction](#introduction) + - [API for TensorFlow](#api-for-tensorflow) + - [Support Matrix](#support-matrix) + - [Quantization Scheme](#quantization-scheme) + - [Quantization Approaches](#quantization-approaches) + - [Post Training Static Quantization](#post-training-static-quantization) + - [Post Training Static Quantization](#post-training-static-quantization-1) + - [Backend and Device](#backend-and-device) + - [Examples](#examples) + +## Introduction + +
+ +
+ +[TensorFlow](https://www.tensorflow.org/) is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of [tools](https://www.tensorflow.org/resources/tools), [libraries](https://www.tensorflow.org/resources/libraries-extensions), and [community](https://www.tensorflow.org/community) resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications. It provides stable [Python](https://www.tensorflow.org/api_docs/python) and [C++](https://www.tensorflow.org/api_docs/cc) APIs, as well as a non-guaranteed backward compatible API for [other languages](https://www.tensorflow.org/api_docs). + +Keras is a multi-backend deep learning framework , supporting JAX, TensorFlow, and PyTorch. It serves as a dependency of TensorFlow, providing high-level API. Effortlessly build and train models for computer vision, natural language processing, audio processing, timeseries forecasting, recommender systems, etc. + + + +## API for TensorFlow + +Intel(R) Neural Compressor provides `quantize_model` and `autotune` as main interfaces for supported algorithms on TensorFlow framework. + + +**quantize_model** + +The design philosophy of the `quantize_model` interface is easy-of-use. With minimal parameters requirement, including `model`, `quant_config`, `calib_dataloader` and `calib_iteration`, it offers a straightforward choice of quantizing TF model in one-shot. + +```python +def quantize_model( + model: Union[str, tf.keras.Model, BaseModel], + quant_config: Union[BaseConfig, list], + calib_dataloader: Callable = None, + calib_iteration: int = 100, +): +``` +`model` should be a string of the model's location, the object of Keras model or INC TF model wrapper class. + +`quant_config` is either the `StaticQuantConfig` object or a list contains `SmoothQuantConfig` and `StaticQuantConfig` to indicate what algorithm should be used and what specific quantization rules should be applied. + +`calib_dataloader` is used to load the data samples for calibration phase. In most cases, it could be the partial samples of the evaluation dataset. + +`calib_iteration` is used to decide how many iterations the calibration process will be run. + +Here is a simple example of using `quantize_model` interface with a dummy calibration dataloader and the default `StaticQuantConfig`: +```python +from neural_compressor.tensorflow import StaticQuantConfig, quantize_model +from neural_compressor.tensorflow.utils import DummyDataset + +dataset = DummyDataset(shape=(100, 32, 32, 3), label=True) +calib_dataloader = MyDataLoader(dataset=dataset) +quant_config = StaticQuantConfig() + +qmodel = quantize_model("fp32_model.pb", quant_config, calib_dataloader) +``` +**autotune** + +The `autotune` interface, on the other hand, provides greater flexibility and power. It's particularly useful when accuracy is a critical factor. If the initial quantization doesn't meet the tolerance of accuracy loss, `autotune` will iteratively try quantization rules according to the `tune_config`. + +Just like `quantize_model`, `autotune` requires `model`, `calib_dataloader` and `calib_iteration`. And the `eval_fn`, `eval_args` are used to build evaluation process. + + + +```python +def autotune( + model: Union[str, tf.keras.Model, BaseModel], + tune_config: TuningConfig, + eval_fn: Callable, + eval_args: Optional[Tuple[Any]] = None, + calib_dataloader: Callable = None, + calib_iteration: int = 100, +) -> Optional[BaseModel]: +``` +`model` should be a string of the model's location, the object of Keras model or INC TF model wrapper class. + +`tune_config` is the `TuningConfig` object which contains multiple quantization rules. + +`eval_fn` is the evaluation function that measures the accuracy of a model. + +`eval_args` is the supplemental arguments required by the defined evaluation function. + +`calib_dataloader` is used to load the data samples for calibration phase. In most cases, it could be the partial samples of the evaluation dataset. + +`calib_iteration` is used to decide how many iterations the calibration process will be run. + +Here is a simple example of using `autotune` interface with different quantization rules defined by a list of `StaticQuantConfig`: +```python +from neural_compressor.common.base_tuning import TuningConfig +from neural_compressor.tensorflow import StaticQuantConfig, autotune + +calib_dataloader = MyDataloader(dataset=Dataset()) +custom_tune_config = TuningConfig( + config_set=[ + StaticQuantConfig(weight_sym=True, act_sym=True), + StaticQuantConfig(weight_sym=False, act_sym=False), + ] +) +best_model = autotune( + model="baseline_model", + tune_config=custom_tune_config, + eval_fn=eval_acc_fn, + calib_dataloader=calib_dataloader, + ) +``` + +### Support Matrix + +#### Quantization Scheme + +| Framework | Backend Library | Symmetric Quantization | Asymmetric Quantization | +| :-------------- |:---------------:| ---------------:|---------------:| +| TensorFlow | [oneDNN](https://github.com/oneapi-src/oneDNN) | Activation (int8/uint8), Weight (int8) | - | +| Keras | [ITEX](https://github.com/intel/intel-extension-for-tensorflow) | Activation (int8/uint8), Weight (int8) | - | + + ++ Symmetric Quantization + + int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1) + + uint8: scale = max(rmin, rmax) / (max(uint8) - min(uint8)) + + ++ oneDNN: [Lower Numerical Precision Deep Learning Inference and Training](https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html) + +#### Quantization Approaches + +The supported Quantization methods for TensorFlow and Keras are listed below: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TypesQuantizationDataset RequirementsFrameworkBackend
Post-Training Static Quantization (PTQ)weights and activationscalibrationKerasITEX
TensorFlowTensorFlow/Intel TensorFlow
Smooth Quantization(SQ)weightscalibrationTensorflowTensorFlow/Intel TensorFlow
+
+
+ +##### Post Training Static Quantization + +The min/max range in weights and activations are collected offline on a so-called `calibration` dataset. This dataset should be able to represent the data distribution of those unseen inference dataset. The `calibration` process runs on the original fp32 model and dumps out all the tensor distributions for `Scale` and `ZeroPoint` calculations. Usually preparing 100 samples are enough for calibration. +Refer to the [PTQ Guide](./Quant/TF_Quant.md) for detailed information. + +##### Smooth Quantization + +Smooth Quantization (SQ) is an advanced quantization technique designed to optimize model performance while maintaining high accuracy. Unlike traditional quantization methods that can lead to significant accuracy loss, SQ focuses on a more refined approach by taking a balance between the scale of activations and weights. +Refer to the [SQ Guide](./SQ/TF_SQ.md) for detailed information. + +#### Backend and Device +Intel(R) Neural Compressor supports TF GPU with [ITEX-XPU](https://github.com/intel/intel-extension-for-tensorflow). We will automatically run model on GPU by checking if it has been installed. + + + + + + + + + + + + + + + + + + + + + + + + + + +
FrameworkBackendBackend LibraryBackend ValueSupport Device(cpu as default)
TensorFlowTensorFlowOneDNN"default"cpu
ITEXOneDNN"itex"cpu | gpu
+
+
+ +## Examples + +Users can refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api) on how to quantize a new model with INC 3x API. diff --git a/neural_compressor/tensorflow/__init__.py b/neural_compressor/tensorflow/__init__.py index e1e987f0a0d..678a02c83ba 100644 --- a/neural_compressor/tensorflow/__init__.py +++ b/neural_compressor/tensorflow/__init__.py @@ -14,6 +14,7 @@ from neural_compressor.tensorflow.utils import register_algo, Model from neural_compressor.tensorflow.quantization import ( + autotune, quantize_model, StaticQuantConfig, SmoothQuantConfig, From 4574891380a973da2a03d2ad8b871ac5dd92c85c Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Tue, 28 May 2024 06:45:03 +0000 Subject: [PATCH 02/13] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- docs/3x/Quant/TF_Quant.md | 39 +++++++++++++++++++++------------------ docs/3x/SQ/TF_SQ.md | 10 ++-------- docs/3x/TensorFlow.md | 2 +- 3 files changed, 24 insertions(+), 27 deletions(-) diff --git a/docs/3x/Quant/TF_Quant.md b/docs/3x/Quant/TF_Quant.md index d08e03e3809..f2cd6e27b74 100644 --- a/docs/3x/Quant/TF_Quant.md +++ b/docs/3x/Quant/TF_Quant.md @@ -94,7 +94,7 @@ This means user could leverage Intel(R) Neural Compressor to directly generate a # main.py # Original code -model = tf.keras.applications.resnet50.ResNet50(weights='imagenet') +model = tf.keras.applications.resnet50.ResNet50(weights="imagenet") val_dataset = ... val_dataloader = MyDataloader(dataset=val_dataset) @@ -118,13 +118,16 @@ This means user could leverage the advance feature of Intel(R) Neural Compressor # main.py # Original code -model = tf.keras.applications.resnet50.ResNet50(weights='imagenet') +model = tf.keras.applications.resnet50.ResNet50(weights="imagenet") val_dataset = ... val_dataloader = MyDataloader(dataset=val_dataset) + def eval_acc_fn(model) -> float: ... return acc + + # Quantization code from neural_compressor.common.base_tuning import TuningConfig from neural_compressor.tensorflow import autotune @@ -152,23 +155,23 @@ Intel(R) Neural Compressor support specify quantization rules by operator name o 1. Example of setting `local` from a dict ```python quant_config = { - "static_quant": { - "global": { - "weight_dtype": "int8", - "weight_sym": True, - "weight_granularity": "per_tensor", - "act_dtype": "int8", - "act_sym": True, - "act_granularity": "per_tensor", - }, - "local": { - "conv1": { - "weight_dtype": "fp32", - "act_dtype": "fp32", - } - }, + "static_quant": { + "global": { + "weight_dtype": "int8", + "weight_sym": True, + "weight_granularity": "per_tensor", + "act_dtype": "int8", + "act_sym": True, + "act_granularity": "per_tensor", + }, + "local": { + "conv1": { + "weight_dtype": "fp32", + "act_dtype": "fp32", } - } + }, + } +} config = StaticQuantConfig.from_dict(quant_config) ``` 2. Example of using `set_local` diff --git a/docs/3x/SQ/TF_SQ.md b/docs/3x/SQ/TF_SQ.md index 0e669f3e9ce..66d92fce2f9 100644 --- a/docs/3x/SQ/TF_SQ.md +++ b/docs/3x/SQ/TF_SQ.md @@ -119,11 +119,7 @@ To set a fixed alpha for the entire model, users can follow this example: from neural_compressor.tensorflow import SmoothQuantConfig, StaticQuantConfig quant_config = [SmoothQuantConfig(alpha=0.5), StaticQuantConfig()] -q_model = quantize_model( - output_graph_def, - [sq_config, static_config], - calib_dataloader -) +q_model = quantize_model(output_graph_def, [sq_config, static_config], calib_dataloader) ``` The `SmoothQuantConfig` should be combined with `StaticQuantConfig` in a list because we still need to insert QDQ and apply pattern fusion after the smoothing process. @@ -136,9 +132,7 @@ Here is an example: ```python from neural_compressor.tensorflow import StaticQuantConfig, SmoothQuantConfig -custom_tune_config = TuningConfig( - config_set=[SmoothQuantConfig(alpha=[0.5, 0.6, 0.7]), StaticQuantConfig()] -) +custom_tune_config = TuningConfig(config_set=[SmoothQuantConfig(alpha=[0.5, 0.6, 0.7]), StaticQuantConfig()]) best_model = autotune( model="fp32_model", tune_config=custom_tune_config, diff --git a/docs/3x/TensorFlow.md b/docs/3x/TensorFlow.md index 843e2010b2c..8709919848e 100644 --- a/docs/3x/TensorFlow.md +++ b/docs/3x/TensorFlow.md @@ -107,7 +107,7 @@ best_model = autotune( tune_config=custom_tune_config, eval_fn=eval_acc_fn, calib_dataloader=calib_dataloader, - ) +) ``` ### Support Matrix From 163e13b4c6955a5abb10947bb0299811eb9a45a4 Mon Sep 17 00:00:00 2001 From: zehao-intel Date: Tue, 28 May 2024 14:49:12 +0800 Subject: [PATCH 03/13] add enter before link to sq and ptq Signed-off-by: zehao-intel --- docs/3x/TensorFlow.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/3x/TensorFlow.md b/docs/3x/TensorFlow.md index 8709919848e..e4dd9df0fee 100644 --- a/docs/3x/TensorFlow.md +++ b/docs/3x/TensorFlow.md @@ -8,7 +8,7 @@ TensorFlow - [Quantization Scheme](#quantization-scheme) - [Quantization Approaches](#quantization-approaches) - [Post Training Static Quantization](#post-training-static-quantization) - - [Post Training Static Quantization](#post-training-static-quantization-1) + - [Smooth Quantization](#smooth-quantization) - [Backend and Device](#backend-and-device) - [Examples](#examples) @@ -167,11 +167,13 @@ The supported Quantization methods for TensorFlow and Keras are listed below: ##### Post Training Static Quantization The min/max range in weights and activations are collected offline on a so-called `calibration` dataset. This dataset should be able to represent the data distribution of those unseen inference dataset. The `calibration` process runs on the original fp32 model and dumps out all the tensor distributions for `Scale` and `ZeroPoint` calculations. Usually preparing 100 samples are enough for calibration. + Refer to the [PTQ Guide](./Quant/TF_Quant.md) for detailed information. ##### Smooth Quantization Smooth Quantization (SQ) is an advanced quantization technique designed to optimize model performance while maintaining high accuracy. Unlike traditional quantization methods that can lead to significant accuracy loss, SQ focuses on a more refined approach by taking a balance between the scale of activations and weights. + Refer to the [SQ Guide](./SQ/TF_SQ.md) for detailed information. #### Backend and Device From ec67776e02afbec4bb60999451a764a96ce42507 Mon Sep 17 00:00:00 2001 From: zehao-intel Date: Thu, 30 May 2024 17:18:17 +0800 Subject: [PATCH 04/13] add quantization.md for fundamental knowledge Signed-off-by: zehao-intel --- docs/3x/Quant/TF_Quant.md | 92 +++++---------------------- docs/3x/SQ/TF_SQ.md | 122 +++--------------------------------- docs/3x/TensorFlow.md | 20 +++--- docs/3x/quantization.md | 115 +++++++++++++++++++++++++++++++++ docs/source/quantization.md | 2 +- 5 files changed, 147 insertions(+), 204 deletions(-) create mode 100644 docs/3x/quantization.md diff --git a/docs/3x/Quant/TF_Quant.md b/docs/3x/Quant/TF_Quant.md index f2cd6e27b74..9330915a728 100644 --- a/docs/3x/Quant/TF_Quant.md +++ b/docs/3x/Quant/TF_Quant.md @@ -1,85 +1,19 @@ -Quantization +TensorFlow Quantization =============== -1. [Quantization Introduction](#quantization-introduction) -2. [Quantization Fundamentals](#quantization-fundamentals) -3. [Accuracy Aware Tuning](#accuracy-aware-tuning) +1. [Introduction](#introduction) +2. [Usage](#usage) + 2.1 [Without Accuracy Aware Tuning](#without-accuracy-aware-tuning) + 2.2 [With Accuracy Aware Tuning](#with-accuracy-aware-tuning) + 2.3 [Specify Quantization Rules](#specify-quantization-rules) +3. [Examples](#examples) -4. [Get Started](#get-started) - 5.1 [Without Accuracy Aware Tuning](#without-accuracy-aware-tuning) - 5.2 [With Accuracy Aware Tuning](#with-accuracy-aware-tuning) - 5.3 [Specify Quantization Rules](#specify-quantization-rules) +## Introduction -## Quantization Introduction +The INC 3x New API supports quantizing both TensorFlow and Keras model with or without accuracy aware tuning. -Quantization is a very popular deep learning model optimization technique invented for improving the speed of inference. It minimizes the number of bits required by converting a set of real-valued numbers into the lower bit data representation, such as int8 and int4, mainly on inference phase with minimal to no loss in accuracy. This way reduces the memory requirement, cache miss rate, and computational cost of using neural networks and finally achieve the goal of higher inference performance. On Intel 3rd Gen Intel® Xeon® Scalable Processors, user could expect up to 4x theoretical performance speedup. We expect further performance improvement with [Intel® Advanced Matrix Extensions](https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html) on 4th Gen Intel® Xeon® Scalable Processors. - -## Quantization Fundamentals - -`Affine quantization` and `Scale quantization` are two common range mapping techniques used in tensor conversion between different data types. - -For TensorFlow, all quantizable operators support `Scale quantization`, while a parts of operators support `Affine quantization`. For Keras, the quantizable layers only support `Scale quantization`. - -The math equation is like: $$X_{int8} = round(Scale \times X_{fp32} + ZeroPoint)$$. - -**Affine Quantization** - -This is so-called `asymmetric quantization`, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255]. - -here: - -If INT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 127$ and $ZeroPoint = -128 - X_{f_{min}} / Scale$. - -or - -If UINT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 255$ and $ZeroPoint = - X_{f_{min}} / Scale$. - -**Scale Quantization** - -This is so-called `Symmetric quantization`, in which we use the maximum absolute value in the float tensor as float range and map to the corresponding integer range. - -The math equation is like: - -here: - -If INT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 127$ and $ZeroPoint = 0$. - -or - -If UINT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 255$ and $ZeroPoint = 128$. - - -> ***Note*** -> Sometimes the reduce_range feature, that's using 7 bit width (1 sign bit + 6 data bits) to represent int8 range, may be needed on some early Xeon platforms, it's because those platforms may have overflow issues due to fp16 intermediate calculation result when executing int8 dot product operation. After AVX512_VNNI instruction is introduced, this issue gets solved by supporting fp32 intermediate data. - - - -### Quantization Approaches - -Quantization has three different approaches: -1) post training dynamic quantization -2) post training static quantization -3) quantization aware training. - -Currently, only `post training static quantization` is supported by INC TF 3X API. For this approach, the min/max range in weights and activations are collected offline on a so-called `calibration` dataset. This dataset should be able to represent the data distribution of those unseen inference dataset. The `calibration` process runs on the original fp32 model and dumps out all the tensor distributions for `Scale` and `ZeroPoint` calculations. Usually preparing 100 samples are enough for calibration. - -This approach is major quantization approach people should try because it could provide the better performance comparing with `post training dynamic quantization`. - - -## Accuracy Aware Tuning - -Accuracy aware tuning is one of unique features provided by Intel(R) Neural Compressor, compared with other 3rd party model compression tools. This feature can be used to solve accuracy loss pain points brought by applying low precision quantization and other lossy optimization methods. - -This tuning algorithm creates a tuning space by querying framework quantization capability and model structure, selects the ops to be quantized by the tuning strategy, generates quantized graph, and evaluates the accuracy of this quantized graph. The optimal model will be yielded if the pre-defined accuracy goal is met. - -Neural compressor also support to quantize all quantizable ops without accuracy tuning, user can decide whether to tune the model accuracy or not. Please refer to "Get Start" below. - -### Working Flow - -User could refer to below chart to understand the whole tuning flow. - -accuracy aware tuning working flow +For the detialed quantization fundamentals, please refer to the document for [Quantization](../quantization.md). ## Get Started @@ -112,7 +46,7 @@ qmodel.save("./output") ### With Accuracy Aware Tuning -This means user could leverage the advance feature of Intel(R) Neural Compressor to tune out a best quantized model which has best accuracy and good performance. User should provide either `eval_fn` and `eval_args`. +This means user could leverage the advance feature of Intel(R) Neural Compressor to tune out a best quantized model which has best accuracy and good performance. User should provide `eval_fn` and `eval_args`. ``` python # main.py @@ -183,3 +117,7 @@ conv2d_config = StaticQuantConfig( ) quant_config.set_local("conv1", conv2d_config) ``` + +## Examples + +Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/tensorflow) on how to quantize a TensorFlow model with INC 3x API. \ No newline at end of file diff --git a/docs/3x/SQ/TF_SQ.md b/docs/3x/SQ/TF_SQ.md index 66d92fce2f9..315bb48c5ab 100644 --- a/docs/3x/SQ/TF_SQ.md +++ b/docs/3x/SQ/TF_SQ.md @@ -1,118 +1,23 @@ # Smooth Quant -1. [Introduction](#Introduction) -2. [Quantization Fundamentals](#Quantization-Fundamentals) -3. [SmoothQuant and Our Enhancement](#SmoothQuant-and-Our-Enhancement) -4. [Usage](#Usage) -5. [Reference](#reference) +1. [Introduction](#introduction) +2. [Usage](#usage) + 2.1 [Using a Fixed alpha](#using-a-fixed-alpha) + 2.2 [Determining the alpha through auto-tuning](#determining-the-alpha-through-auto-tuning) +3. [Examples](#examples) ## Introduction Quantization is a common compression operation to reduce memory and accelerate inference by converting the floating point matrix to an integer matrix. For large language models (LLMs) with gigantic parameters, the systematic outliers make quantification of activations difficult. [SmoothQuant](https://arxiv.org/abs/2211.10438), a training free post-training quantization (PTQ) solution, offline migrates this difficulty from activations to weights with a mathematically equivalent transformation. -## Quantization Fundamentals - -Quantization is a common compression operation to reduce memory and accelerate inference; therefore, the difficulty of LLM deployment can be alleviated. Quantization converts the floating point matrix to an integer matrix. - -The equation of quantization is as follows: - -$$ -X_{int8} = round(X_{fp32}/S) + Z \tag{1} -$$ - -where $X_{fp32}$ is the input matrix, $S$ is the scale factor, $Z$ is the integer zero point. - -### Per-tensor & Per-channel - -There are several choices of sharing quantization parameters among tensor elements, also called quantization granularity. The coarsest level, per-tensor granularity, is that all elements in the tensor share the same quantization parameters. Finer granularity means sharing quantization parameters per row or per column for 2D matrices and per channel for 3D matrices. Similarly, the finest granularity is that each element has an individual set of quantization parameters. - - -However, due to the model accuracy and computational consumption, per-tensor or per-channel are usually adopted. **Through mathematical calculations, per-channel could bring lower quantization loss but has some limitations, that is why normally we use per-channel for weight quantization and per-tensor for activation/input quantization** - -#### Per-channel limitation - -Though per-channel quantization could bring lower quantization error, we could not apply it for activations due to the difficulty of the dequantization. We would prove it in the following image and the zero point of quantization would be ignored for simplicity. - -The image on the left presents a normal linear forward with 1x2 input $x$ and 2x2 weight $w$. The results $y$ could be easily obtained by simple mathematics. In the middle image, we apply per-tensor quantization for activations and per-channel quantization for weights; the results after quantization that are denoted by $y_1$ and $y_2$, could be easily dequantized to the float results $y_{fp1}$ and $y_{fp2}$ by per channel scale $1.0/s_1s_x$ and $1.0/s_2s_x$. However, after applying per-channel quantization for activations (right image), we could not dequantize the $y_1$ and $y_2$ to float results. - -
- -
- - -## SmoothQuant and Our Enhancement - -### SmoothQuant - -In the previous subsection, we have explained why per-channel quantization could not be applied for s, even though it could lead to lower quantization loss. However, the quantization error loss of activations plays an important role in the accuracy loss of model quantization[^2][^3][^4]. - - - -To reduce the quantization loss of activations, lots of methods have been proposed. In the following, we briefly introduce SPIQ[^2], Outlier Suppression[^3] and Smoothquant[^4]. All these three methods share a similar idea to migrate the difficulty from activation quantization to weight quantization but differ in how much difficulty to be transferred. - - -So **the first question is how to migrate the difficulty from activation to weights?** The solution is straightforward, that is to convert the network to an output equivalent network that is presented in the image below and apply quantization to this equivalent network. The intuition is that each channel of activations could be scaled to make it more quantization-friendly, similar to a fake per-channel activation quantization. - -
- -
- - -Please note that this conversion will make the quantization of weights more difficult, because the scales attached to weights shown above are per-input-channel, while quantization of weights is per-output-channel or per-tensor. - -So **the second question is how much difficulty to be migrated**, that is how to choose the **conversion per-channel scale** $s_{x1}$ and $s_{x2}$ from the above image. Different works adopt different ways. - -*SPIQ* just adopts the quantization scale of activations as the conversion per-channel scale. - -*Outlier suppression* adopts the scale of the preceding layernorm as the conversion per-channel scale. - -*Smoothquant* introduces a hyperparameter $\alpha$ as a smooth factor to calculate the conversion per-channel scale and balance the quantization difficulty of activations and weights. - -$$ -s_j = max(|X_j|)^\alpha/max(|W_j|)^{1-\alpha} \tag{4} -$$ - -j is the index of the input channels. - - - -
- -
- - - -For most of the models such as OPT and BLOOM, $\alpha = 0.5$ is a well-balanced value to split the difficulty of weight and activation quantization. A larger $\alpha$ value could be used on models with more significant activation outliers to migrate more quantization difficulty to weights. - - -### Our enhancement: - -#### Algorithm: Auto-tuning of $\alpha$. - -SmoothQuant method aims to split the quantization difficulty of weights and activations by using a fixed-value $\alpha$ for an entire model. However, as the distributions of activation outliers vary not only across different models but also across different layers within a model, we hereby propose a method to obtain layer-wise optimal $\alpha$ values with the ability to tune automatically. - -Our proposed method consists of 7 major steps: - -- Calculate input minimum and maximum values of operators to be smoothed. -- Find a list of operators on which smoothquant could be performed. -- Set a $\alpha$ value based on user-defined $\alpha$ values. -- Calculate smoothing factor using the current $\alpha$ value, adjust parameters accordingly and forward the adjusted model given an input sample. -- Perform per-channel quantization_dequantization of weights and per-tensor quantization_dequantization of activations to predict output. -- Calculate the accuracy loss with respect to FP32 output, iterate the previous three steps given each $\alpha$ value and save the loss per alpha. -- Stop iterating if the maximum times of trial is reached and output the quantized model with a minimum accuracy loss. - - - -Multiple criteria (e.g min, max and mean) are supported to determine the $\alpha$ value of an input LayerNorm op of a transformer block. Both alpha range and criterion could be configured in auto_alpha_args. - -In our experiments, an $\alpha$ range of [0.0, 1.0] with a step_size of 0.1 is found to be well-balanced one for the majority of models. +Please refer to the document of [Smooth Quant](../quantization.md/#smooth-quant) for detailed fundamental knowledge. ## Usage There are two ways to apply smooth quantization: 1) using a fixed `alpha` for the entire model or 2) determining the `alpha` through auto-tuning. -### Using a fixed `alpha` +### Using a Fixed `alpha` To set a fixed alpha for the entire model, users can follow this example: ```python @@ -142,15 +47,6 @@ best_model = autotune( ``` > Please note that, it may a considerable amount of time as the tuning process applies each `alpha` to the entire model and uses the evaluation result on the entire dataset as the metric to determine the best `alpha`. -## Reference - -[^1]: Jason, Wei, et al. "Emergent Abilities of Large Language Models". Published in Transactions on Machine Learning Research (2022). - - -[^2]: Yvinec, Edouard, et al. "SPIQ: Data-Free Per-Channel Static Input Quantization." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023. - - -[^3]: Wei, Xiuying, et al. "Outlier suppression: Pushing the limit of low-bit transformer language models." arXiv preprint arXiv:2209.13325 (2022). - +## Examples -[^4]: Xiao, Guangxuan, et al. "Smoothquant: Accurate and efficient post-training quantization for large language models." arXiv preprint arXiv:2211.10438 (2022). +Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/tensorflow/nlp/large_language_models\quantization\ptq\smoothquant) on how to apply smooth quant to a TensorFlow model with INC 3x API. diff --git a/docs/3x/TensorFlow.md b/docs/3x/TensorFlow.md index e4dd9df0fee..0f6edcf0e7a 100644 --- a/docs/3x/TensorFlow.md +++ b/docs/3x/TensorFlow.md @@ -1,16 +1,13 @@ TensorFlow =============== -- [TensorFlow](#tensorflow) - - [Introduction](#introduction) - - [API for TensorFlow](#api-for-tensorflow) - - [Support Matrix](#support-matrix) - - [Quantization Scheme](#quantization-scheme) - - [Quantization Approaches](#quantization-approaches) - - [Post Training Static Quantization](#post-training-static-quantization) - - [Smooth Quantization](#smooth-quantization) - - [Backend and Device](#backend-and-device) - - [Examples](#examples) + +1. [Introduction](#introduction) +2. [API for TensorFlow](#api-for-tensorflow) +3. [Support Matrix](#support-matrix) + 3.1 [Quantization Scheme](#quantization-scheme) + 3.2 [Quantization Approaches](#quantization-approaches) + 3.3 [Backend and Device](#backend-and-device) ## Introduction @@ -208,6 +205,3 @@ Intel(R) Neural Compressor supports TF GPU with [ITEX-XPU](https://github.com/in

-## Examples - -Users can refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api) on how to quantize a new model with INC 3x API. diff --git a/docs/3x/quantization.md b/docs/3x/quantization.md new file mode 100644 index 00000000000..cae764022ea --- /dev/null +++ b/docs/3x/quantization.md @@ -0,0 +1,115 @@ +Quantization +=============== + +1. Quantization + 1.1 [Quantization Introduction](#quantization-introduction) + 1.2 [Quantization Fundamentals](#quantization-fundamentals) + 1.3 [Accuracy Aware Tuning](#accuracy-aware-tuning) +1. [Smooth Quant](#smooth-quant) +2. [WOQ](#woq) + +## Quantization Introduction + +Quantization is a very popular deep learning model optimization technique invented for improving the speed of inference. It minimizes the number of bits required by converting a set of real-valued numbers into the lower bit data representation, such as int8 and int4, mainly on inference phase with minimal to no loss in accuracy. This way reduces the memory requirement, cache miss rate, and computational cost of using neural networks and finally achieve the goal of higher inference performance. On Intel 3rd Gen Intel® Xeon® Scalable Processors, user could expect up to 4x theoretical performance speedup. We expect further performance improvement with [Intel® Advanced Matrix Extensions](https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html) on 4th Gen Intel® Xeon® Scalable Processors. + +## Quantization Fundamentals + +`Affine quantization` and `Scale quantization` are two common range mapping techniques used in tensor conversion between different data types. + +The math equation is like: $$X_{int8} = round(Scale \times X_{fp32} + ZeroPoint)$$. + +**Affine Quantization** + +This is so-called `asymmetric quantization`, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255]. + +here: + +If INT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 127$ and $ZeroPoint = -128 - X_{f_{min}} / Scale$. + +or + +If UINT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 255$ and $ZeroPoint = - X_{f_{min}} / Scale$. + +**Scale Quantization** + +This is so-called `Symmetric quantization`, in which we use the maximum absolute value in the float tensor as float range and map to the corresponding integer range. + +The math equation is like: + +here: + +If INT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 127$ and $ZeroPoint = 0$. + +or + +If UINT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 255$ and $ZeroPoint = 128$. + +*NOTE* + +Sometimes the reduce_range feature, that's using 7 bit width (1 sign bit + 6 data bits) to represent int8 range, may be needed on some early Xeon platforms, it's because those platforms may have overflow issues due to fp16 intermediate calculation result when executing int8 dot product operation. After AVX512_VNNI instruction is introduced, this issue gets solved by supporting fp32 intermediate data. + + + +#### Quantization Scheme in TensorFlow ++ Symmetric Quantization + + int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1) + + uint8: scale = max(rmin, rmax) / (max(uint8) - min(uint8)) + +#### Quantization Scheme in PyTorch ++ Symmetric Quantization + + int8: scale = max(abs(rmin), abs(rmax)) / (float(max(int8) - min(int8)) / 2) + + uint8: scale = max(abs(rmin), abs(rmax)) / (float(max(int8) - min(int8)) / 2) ++ Asymmetric Quantization + + uint8: scale = (rmax - rmin) / (max(uint8) - min(uint8)); zero_point = min(uint8) - round(rmin / scale) + +#### Quantization Scheme in IPEX ++ Symmetric Quantization + + int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1) + + uint8: scale = max(rmin, rmax) / (max(uint8) - min(uint8)) + +### Quantization Approaches + +Quantization has three different approaches: +1) post training dynamic quantization +2) post training static quantization +3) quantization aware training. + +The first two approaches belong to optimization on inference. The last belongs to optimization during training. + +#### Post Training Dynamic Quantization + +The weights of the neural network get quantized into int8 format from float32 format offline. The activations of the neural network is quantized as well with the min/max range collected during inference runtime. + +This approach is widely used in dynamic length neural networks, like NLP model. + +#### Post Training Static Quantization + +Compared with `post training dynamic quantization`, the min/max range in weights and activations are collected offline on a so-called `calibration` dataset. This dataset should be able to represent the data distribution of those unseen inference dataset. The `calibration` process runs on the original fp32 model and dumps out all the tensor distributions for `Scale` and `ZeroPoint` calculations. Usually preparing 100 samples are enough for calibration. + +This approach is major quantization approach people should try because it could provide the better performance comparing with `post training dynamic quantization`. + +#### Quantization Aware Training + +Quantization aware training emulates inference-time quantization in the forward pass of the training process by inserting `fake quant` ops before those quantizable ops. With `quantization aware training`, all weights and activations are `fake quantized` during both the forward and backward passes of training: that is, float values are rounded to mimic int8 values, but all computations are still done with floating point numbers. Thus, all the weight adjustments during training are made while aware of the fact that the model will ultimately be quantized; after quantizing, therefore, this method will usually yield higher accuracy than either dynamic quantization or post-training static quantization. + +## Accuracy Aware Tuning + +Accuracy aware tuning is one of unique features provided by Intel(R) Neural Compressor, compared with other 3rd party model compression tools. This feature can be used to solve accuracy loss pain points brought by applying low precision quantization and other lossy optimization methods. + +This tuning algorithm creates a tuning space by querying framework quantization capability and model structure, selects the ops to be quantized by the tuning strategy, generates quantized graph, and evaluates the accuracy of this quantized graph. The optimal model will be yielded if the pre-defined accuracy goal is met. The `autotune` serves as a main interface of this algorithm. + +Neural compressor also support to quantize all quantizable ops without accuracy tuning, using `quantize_model` interace to achieve that. + +### Working Flow + +For supported quantization methods for `accuracy aware tuning` and the detailed API usage, please refer to the document of [PyTorch](./pytorch.md) or [TensorFlow](./tensorflow.md) respectively. + +User could refer to below chart to understand the whole tuning flow. + +accuracy aware tuning working flow + + +# Smooth Quant + + +# WOQ diff --git a/docs/source/quantization.md b/docs/source/quantization.md index eac1d5dce6f..69d8d71c022 100644 --- a/docs/source/quantization.md +++ b/docs/source/quantization.md @@ -121,7 +121,7 @@ This approach is major quantization approach people should try because it could Quantization aware training emulates inference-time quantization in the forward pass of the training process by inserting `fake quant` ops before those quantizable ops. With `quantization aware training`, all weights and activations are `fake quantized` during both the forward and backward passes of training: that is, float values are rounded to mimic int8 values, but all computations are still done with floating point numbers. Thus, all the weight adjustments during training are made while aware of the fact that the model will ultimately be quantized; after quantizing, therefore, this method will usually yield higher accuracy than either dynamic quantization or post-training static quantization. -## With or Without Accuracy Aware Tuning +## Accuracy Aware Tuning Accuracy aware tuning is one of unique features provided by Intel(R) Neural Compressor, compared with other 3rd party model compression tools. This feature can be used to solve accuracy loss pain points brought by applying low precision quantization and other lossy optimization methods. From 2c46ca4e61498cb05c759f3cff77bd51b57a0c2d Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Thu, 30 May 2024 09:19:47 +0000 Subject: [PATCH 05/13] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- docs/3x/Quant/TF_Quant.md | 4 ++-- docs/3x/TensorFlow.md | 1 - 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/3x/Quant/TF_Quant.md b/docs/3x/Quant/TF_Quant.md index 9330915a728..fe50be0dec5 100644 --- a/docs/3x/Quant/TF_Quant.md +++ b/docs/3x/Quant/TF_Quant.md @@ -13,7 +13,7 @@ TensorFlow Quantization The INC 3x New API supports quantizing both TensorFlow and Keras model with or without accuracy aware tuning. -For the detialed quantization fundamentals, please refer to the document for [Quantization](../quantization.md). +For the detailed quantization fundamentals, please refer to the document for [Quantization](../quantization.md). ## Get Started @@ -120,4 +120,4 @@ quant_config.set_local("conv1", conv2d_config) ## Examples -Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/tensorflow) on how to quantize a TensorFlow model with INC 3x API. \ No newline at end of file +Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/tensorflow) on how to quantize a TensorFlow model with INC 3x API. diff --git a/docs/3x/TensorFlow.md b/docs/3x/TensorFlow.md index 0f6edcf0e7a..73b624097b8 100644 --- a/docs/3x/TensorFlow.md +++ b/docs/3x/TensorFlow.md @@ -204,4 +204,3 @@ Intel(R) Neural Compressor supports TF GPU with [ITEX-XPU](https://github.com/in

- From b0b274bc582e53c9878a65aef89d45ed7b01f32c Mon Sep 17 00:00:00 2001 From: zehao-intel Date: Fri, 31 May 2024 13:24:52 +0800 Subject: [PATCH 06/13] modify structure of folders Signed-off-by: zehao-intel --- docs/3x/{Quant => }/TF_Quant.md | 0 docs/3x/{SQ => }/TF_SQ.md | 0 docs/3x/TensorFlow.md | 19 +++++++++++-------- 3 files changed, 11 insertions(+), 8 deletions(-) rename docs/3x/{Quant => }/TF_Quant.md (100%) rename docs/3x/{SQ => }/TF_SQ.md (100%) diff --git a/docs/3x/Quant/TF_Quant.md b/docs/3x/TF_Quant.md similarity index 100% rename from docs/3x/Quant/TF_Quant.md rename to docs/3x/TF_Quant.md diff --git a/docs/3x/SQ/TF_SQ.md b/docs/3x/TF_SQ.md similarity index 100% rename from docs/3x/SQ/TF_SQ.md rename to docs/3x/TF_SQ.md diff --git a/docs/3x/TensorFlow.md b/docs/3x/TensorFlow.md index 0f6edcf0e7a..a563310282c 100644 --- a/docs/3x/TensorFlow.md +++ b/docs/3x/TensorFlow.md @@ -2,12 +2,15 @@ TensorFlow =============== -1. [Introduction](#introduction) -2. [API for TensorFlow](#api-for-tensorflow) -3. [Support Matrix](#support-matrix) - 3.1 [Quantization Scheme](#quantization-scheme) - 3.2 [Quantization Approaches](#quantization-approaches) - 3.3 [Backend and Device](#backend-and-device) +- [TensorFlow](#tensorflow) + - [Introduction](#introduction) + - [API for TensorFlow](#api-for-tensorflow) + - [Support Matrix](#support-matrix) + - [Quantization Scheme](#quantization-scheme) + - [Quantization Approaches](#quantization-approaches) + - [Post Training Static Quantization](#post-training-static-quantization) + - [Smooth Quantization](#smooth-quantization) + - [Backend and Device](#backend-and-device) ## Introduction @@ -165,13 +168,13 @@ The supported Quantization methods for TensorFlow and Keras are listed below: The min/max range in weights and activations are collected offline on a so-called `calibration` dataset. This dataset should be able to represent the data distribution of those unseen inference dataset. The `calibration` process runs on the original fp32 model and dumps out all the tensor distributions for `Scale` and `ZeroPoint` calculations. Usually preparing 100 samples are enough for calibration. -Refer to the [PTQ Guide](./Quant/TF_Quant.md) for detailed information. +Refer to the [PTQ Guide](./TF_Quant.md) for detailed information. ##### Smooth Quantization Smooth Quantization (SQ) is an advanced quantization technique designed to optimize model performance while maintaining high accuracy. Unlike traditional quantization methods that can lead to significant accuracy loss, SQ focuses on a more refined approach by taking a balance between the scale of activations and weights. -Refer to the [SQ Guide](./SQ/TF_SQ.md) for detailed information. +Refer to the [SQ Guide](./TF_SQ.md) for detailed information. #### Backend and Device Intel(R) Neural Compressor supports TF GPU with [ITEX-XPU](https://github.com/intel/intel-extension-for-tensorflow). We will automatically run model on GPU by checking if it has been installed. From 855256f14a473e790937f59b1b92c8e840dfbcdb Mon Sep 17 00:00:00 2001 From: zehao-intel Date: Fri, 31 May 2024 14:14:55 +0800 Subject: [PATCH 07/13] fix typo Signed-off-by: zehao-intel --- docs/3x/TensorFlow.md | 16 +++++++--------- docs/3x/quantization.md | 2 +- 2 files changed, 8 insertions(+), 10 deletions(-) diff --git a/docs/3x/TensorFlow.md b/docs/3x/TensorFlow.md index 86df8de8a23..092a9f76fa5 100644 --- a/docs/3x/TensorFlow.md +++ b/docs/3x/TensorFlow.md @@ -2,15 +2,12 @@ TensorFlow =============== -- [TensorFlow](#tensorflow) - - [Introduction](#introduction) - - [API for TensorFlow](#api-for-tensorflow) - - [Support Matrix](#support-matrix) - - [Quantization Scheme](#quantization-scheme) - - [Quantization Approaches](#quantization-approaches) - - [Post Training Static Quantization](#post-training-static-quantization) - - [Smooth Quantization](#smooth-quantization) - - [Backend and Device](#backend-and-device) +1. [Introduction](#introduction) +2. [API for TensorFlow](#api-for-tensorflow) +3. [Support Matrix](#support-matrix) + 3.1 [Quantization Scheme](#quantization-scheme) + 3.2 [Quantization Approaches](#quantization-approaches) + 3.3[Backend and Device](#backend-and-device) ## Introduction @@ -207,3 +204,4 @@ Intel(R) Neural Compressor supports TF GPU with [ITEX-XPU](https://github.com/in

+ diff --git a/docs/3x/quantization.md b/docs/3x/quantization.md index cae764022ea..e21d530993b 100644 --- a/docs/3x/quantization.md +++ b/docs/3x/quantization.md @@ -98,7 +98,7 @@ Accuracy aware tuning is one of unique features provided by Intel(R) Neural Comp This tuning algorithm creates a tuning space by querying framework quantization capability and model structure, selects the ops to be quantized by the tuning strategy, generates quantized graph, and evaluates the accuracy of this quantized graph. The optimal model will be yielded if the pre-defined accuracy goal is met. The `autotune` serves as a main interface of this algorithm. -Neural compressor also support to quantize all quantizable ops without accuracy tuning, using `quantize_model` interace to achieve that. +Neural compressor also support to quantize all quantizable ops without accuracy tuning, using `quantize_model` interface to achieve that. ### Working Flow From 9853a921818520cf3819679c093e054f4c890996 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Fri, 31 May 2024 06:16:22 +0000 Subject: [PATCH 08/13] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- docs/3x/TensorFlow.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/3x/TensorFlow.md b/docs/3x/TensorFlow.md index 092a9f76fa5..ec77726a223 100644 --- a/docs/3x/TensorFlow.md +++ b/docs/3x/TensorFlow.md @@ -204,4 +204,3 @@ Intel(R) Neural Compressor supports TF GPU with [ITEX-XPU](https://github.com/in

- From 6ddb040d2951bb90ec3d91188055c1150e0332aa Mon Sep 17 00:00:00 2001 From: zehao-intel Date: Fri, 31 May 2024 14:26:37 +0800 Subject: [PATCH 09/13] fix heading Signed-off-by: zehao-intel --- docs/3x/TF_SQ.md | 4 ++-- docs/3x/TensorFlow.md | 8 ++++---- docs/3x/quantization.md | 6 +++--- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/3x/TF_SQ.md b/docs/3x/TF_SQ.md index 315bb48c5ab..3955ca10f6c 100644 --- a/docs/3x/TF_SQ.md +++ b/docs/3x/TF_SQ.md @@ -2,8 +2,8 @@ 1. [Introduction](#introduction) 2. [Usage](#usage) - 2.1 [Using a Fixed alpha](#using-a-fixed-alpha) - 2.2 [Determining the alpha through auto-tuning](#determining-the-alpha-through-auto-tuning) + 2.1 [Using a Fixed alpha](#using-a-fixed-alpha) + 2.2 [Determining the alpha through auto-tuning](#determining-the-alpha-through-auto-tuning) 3. [Examples](#examples) diff --git a/docs/3x/TensorFlow.md b/docs/3x/TensorFlow.md index 092a9f76fa5..d0969ece93b 100644 --- a/docs/3x/TensorFlow.md +++ b/docs/3x/TensorFlow.md @@ -4,10 +4,10 @@ TensorFlow 1. [Introduction](#introduction) 2. [API for TensorFlow](#api-for-tensorflow) -3. [Support Matrix](#support-matrix) - 3.1 [Quantization Scheme](#quantization-scheme) - 3.2 [Quantization Approaches](#quantization-approaches) - 3.3[Backend and Device](#backend-and-device) +3. [Support Matrix](#support-matrix) + 3.1 [Quantization Scheme](#quantization-scheme) + 3.2 [Quantization Approaches](#quantization-approaches) + 3.3 [Backend and Device](#backend-and-device) ## Introduction diff --git a/docs/3x/quantization.md b/docs/3x/quantization.md index e21d530993b..e533a7c8a52 100644 --- a/docs/3x/quantization.md +++ b/docs/3x/quantization.md @@ -2,9 +2,9 @@ Quantization =============== 1. Quantization - 1.1 [Quantization Introduction](#quantization-introduction) - 1.2 [Quantization Fundamentals](#quantization-fundamentals) - 1.3 [Accuracy Aware Tuning](#accuracy-aware-tuning) + 1.1 [Quantization Introduction](#quantization-introduction) + 1.2 [Quantization Fundamentals](#quantization-fundamentals) + 1.3 [Accuracy Aware Tuning](#accuracy-aware-tuning) 1. [Smooth Quant](#smooth-quant) 2. [WOQ](#woq) From ebb23fa4357e365b50646308330791d6ea56340f Mon Sep 17 00:00:00 2001 From: "Huang, Tai" Date: Fri, 31 May 2024 14:40:58 +0800 Subject: [PATCH 10/13] Update TensorFlow.md outline heading --- docs/3x/TensorFlow.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/3x/TensorFlow.md b/docs/3x/TensorFlow.md index 27577f99192..52287dedcf0 100644 --- a/docs/3x/TensorFlow.md +++ b/docs/3x/TensorFlow.md @@ -5,9 +5,9 @@ TensorFlow 1. [Introduction](#introduction) 2. [API for TensorFlow](#api-for-tensorflow) 3. [Support Matrix](#support-matrix) - 3.1 [Quantization Scheme](#quantization-scheme) - 3.2 [Quantization Approaches](#quantization-approaches) - 3.3 [Backend and Device](#backend-and-device) + 3.1 [Quantization Scheme](#quantization-scheme) + 3.2 [Quantization Approaches](#quantization-approaches) + 3.3 [Backend and Device](#backend-and-device) ## Introduction From 488abb805a65ea3084556f6effb72f67d8e71f73 Mon Sep 17 00:00:00 2001 From: "Huang, Tai" Date: Fri, 31 May 2024 14:41:55 +0800 Subject: [PATCH 11/13] Update TF_SQ.md update outline heading --- docs/3x/TF_SQ.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/3x/TF_SQ.md b/docs/3x/TF_SQ.md index 3955ca10f6c..b74a114329d 100644 --- a/docs/3x/TF_SQ.md +++ b/docs/3x/TF_SQ.md @@ -1,9 +1,9 @@ # Smooth Quant 1. [Introduction](#introduction) -2. [Usage](#usage) - 2.1 [Using a Fixed alpha](#using-a-fixed-alpha) - 2.2 [Determining the alpha through auto-tuning](#determining-the-alpha-through-auto-tuning) +2. [Usage](#usage) + 2.1 [Using a Fixed alpha](#using-a-fixed-alpha) + 2.2 [Determining the alpha through auto-tuning](#determining-the-alpha-through-auto-tuning) 3. [Examples](#examples) From b1fffd7ea5c500f9e0cd954611ca17fae8755470 Mon Sep 17 00:00:00 2001 From: "Huang, Tai" Date: Fri, 31 May 2024 14:42:39 +0800 Subject: [PATCH 12/13] Update quantization.md update outline heading --- docs/3x/quantization.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/3x/quantization.md b/docs/3x/quantization.md index e533a7c8a52..190b9d33da3 100644 --- a/docs/3x/quantization.md +++ b/docs/3x/quantization.md @@ -1,12 +1,12 @@ Quantization =============== -1. Quantization - 1.1 [Quantization Introduction](#quantization-introduction) - 1.2 [Quantization Fundamentals](#quantization-fundamentals) - 1.3 [Accuracy Aware Tuning](#accuracy-aware-tuning) -1. [Smooth Quant](#smooth-quant) -2. [WOQ](#woq) +1. Quantization + 1.1 [Quantization Introduction](#quantization-introduction) + 1.2 [Quantization Fundamentals](#quantization-fundamentals) + 1.3 [Accuracy Aware Tuning](#accuracy-aware-tuning) +1. [Smooth Quant](#smooth-quant) +2. [WOQ](#woq) ## Quantization Introduction From bce2abf3c1db976c8ae3c2c4fa6131cdf5c80fae Mon Sep 17 00:00:00 2001 From: zehao-intel Date: Fri, 31 May 2024 15:55:44 +0800 Subject: [PATCH 13/13] refine index order of quantization.md Signed-off-by: zehao-intel --- docs/3x/quantization.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/3x/quantization.md b/docs/3x/quantization.md index 190b9d33da3..de8e43828a3 100644 --- a/docs/3x/quantization.md +++ b/docs/3x/quantization.md @@ -5,8 +5,8 @@ Quantization 1.1 [Quantization Introduction](#quantization-introduction) 1.2 [Quantization Fundamentals](#quantization-fundamentals) 1.3 [Accuracy Aware Tuning](#accuracy-aware-tuning) -1. [Smooth Quant](#smooth-quant) -2. [WOQ](#woq) +2. [Smooth Quant](#smooth-quant) +3. [WOQ](#woq) ## Quantization Introduction