From 4dbf71e412a370f09809db89db27a0b7c7b56d14 Mon Sep 17 00:00:00 2001 From: zehao-intel Date: Fri, 31 May 2024 16:18:51 +0800 Subject: [PATCH] Upload Documents of INC TF 3x New API (#1822) * Documents of INC TF 3x New API Signed-off-by: zehao-intel * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add enter before link to sq and ptq Signed-off-by: zehao-intel * add quantization.md for fundamental knowledge Signed-off-by: zehao-intel * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * modify structure of folders Signed-off-by: zehao-intel * fix typo Signed-off-by: zehao-intel * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix heading Signed-off-by: zehao-intel * Update TensorFlow.md outline heading * Update TF_SQ.md update outline heading * Update quantization.md update outline heading * refine index order of quantization.md Signed-off-by: zehao-intel --------- Signed-off-by: zehao-intel Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Huang, Tai --- docs/3x/TF_Quant.md | 123 ++++++++++++++ docs/3x/TF_SQ.md | 52 ++++++ docs/3x/TensorFlow.md | 206 +++++++++++++++++++++++ docs/3x/quantization.md | 115 +++++++++++++ docs/source/quantization.md | 2 +- neural_compressor/tensorflow/__init__.py | 1 + 6 files changed, 498 insertions(+), 1 deletion(-) create mode 100644 docs/3x/TF_Quant.md create mode 100644 docs/3x/TF_SQ.md create mode 100644 docs/3x/TensorFlow.md create mode 100644 docs/3x/quantization.md diff --git a/docs/3x/TF_Quant.md b/docs/3x/TF_Quant.md new file mode 100644 index 00000000000..fe50be0dec5 --- /dev/null +++ b/docs/3x/TF_Quant.md @@ -0,0 +1,123 @@ + +TensorFlow Quantization +=============== + +1. [Introduction](#introduction) +2. [Usage](#usage) + 2.1 [Without Accuracy Aware Tuning](#without-accuracy-aware-tuning) + 2.2 [With Accuracy Aware Tuning](#with-accuracy-aware-tuning) + 2.3 [Specify Quantization Rules](#specify-quantization-rules) +3. [Examples](#examples) + +## Introduction + +The INC 3x New API supports quantizing both TensorFlow and Keras model with or without accuracy aware tuning. + +For the detailed quantization fundamentals, please refer to the document for [Quantization](../quantization.md). + + +## Get Started + + +### Without Accuracy Aware Tuning + + +This means user could leverage Intel(R) Neural Compressor to directly generate a fully quantized model without accuracy aware tuning. It's user responsibility to ensure the accuracy of the quantized model meets expectation. + +``` python +# main.py + +# Original code +model = tf.keras.applications.resnet50.ResNet50(weights="imagenet") +val_dataset = ... +val_dataloader = MyDataloader(dataset=val_dataset) + +# Quantization code +from neural_compressor.tensorflow import quantize_model, StaticQuantConfig + +quant_config = StaticQuantConfig() +qmodel = quantize_model( + model=model, + quant_config=quant_config, + calib_dataloader=val_dataloader, +) +qmodel.save("./output") +``` + +### With Accuracy Aware Tuning + +This means user could leverage the advance feature of Intel(R) Neural Compressor to tune out a best quantized model which has best accuracy and good performance. User should provide `eval_fn` and `eval_args`. + +``` python +# main.py + +# Original code +model = tf.keras.applications.resnet50.ResNet50(weights="imagenet") +val_dataset = ... +val_dataloader = MyDataloader(dataset=val_dataset) + + +def eval_acc_fn(model) -> float: + ... + return acc + + +# Quantization code +from neural_compressor.common.base_tuning import TuningConfig +from neural_compressor.tensorflow import autotune + +# it's also supported to define custom_tune_config as: +# TuningConfig(StaticQuantConfig(weight_sym=[True, False], act_sym=[True, False])) +custom_tune_config = TuningConfig( + config_set=[ + StaticQuantConfig(weight_sym=True, act_sym=True), + StaticQuantConfig(weight_sym=False, act_sym=False), + ] +) +best_model = autotune( + model=model, + tune_config=custom_tune_config, + eval_fn=eval_acc_fn, + calib_dataloader=val_dataloader, +) +best_model.save("./output") +``` + +### Specify Quantization Rules +Intel(R) Neural Compressor support specify quantization rules by operator name or operator type. Users can set `local` in dict or use `set_local` method of config class to achieve the above purpose. + +1. Example of setting `local` from a dict +```python +quant_config = { + "static_quant": { + "global": { + "weight_dtype": "int8", + "weight_sym": True, + "weight_granularity": "per_tensor", + "act_dtype": "int8", + "act_sym": True, + "act_granularity": "per_tensor", + }, + "local": { + "conv1": { + "weight_dtype": "fp32", + "act_dtype": "fp32", + } + }, + } +} +config = StaticQuantConfig.from_dict(quant_config) +``` +2. Example of using `set_local` +```python +quant_config = StaticQuantConfig() +conv2d_config = StaticQuantConfig( + weight_dtype="fp32", + act_dtype="fp32", +) +quant_config.set_local("conv1", conv2d_config) +``` + +## Examples + +Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/tensorflow) on how to quantize a TensorFlow model with INC 3x API. diff --git a/docs/3x/TF_SQ.md b/docs/3x/TF_SQ.md new file mode 100644 index 00000000000..b74a114329d --- /dev/null +++ b/docs/3x/TF_SQ.md @@ -0,0 +1,52 @@ +# Smooth Quant + +1. [Introduction](#introduction) +2. [Usage](#usage) + 2.1 [Using a Fixed alpha](#using-a-fixed-alpha) + 2.2 [Determining the alpha through auto-tuning](#determining-the-alpha-through-auto-tuning) +3. [Examples](#examples) + + +## Introduction + +Quantization is a common compression operation to reduce memory and accelerate inference by converting the floating point matrix to an integer matrix. For large language models (LLMs) with gigantic parameters, the systematic outliers make quantification of activations difficult. [SmoothQuant](https://arxiv.org/abs/2211.10438), a training free post-training quantization (PTQ) solution, offline migrates this difficulty from activations to weights with a mathematically equivalent transformation. + +Please refer to the document of [Smooth Quant](../quantization.md/#smooth-quant) for detailed fundamental knowledge. + + +## Usage +There are two ways to apply smooth quantization: 1) using a fixed `alpha` for the entire model or 2) determining the `alpha` through auto-tuning. + +### Using a Fixed `alpha` +To set a fixed alpha for the entire model, users can follow this example: + +```python +from neural_compressor.tensorflow import SmoothQuantConfig, StaticQuantConfig + +quant_config = [SmoothQuantConfig(alpha=0.5), StaticQuantConfig()] +q_model = quantize_model(output_graph_def, [sq_config, static_config], calib_dataloader) +``` +The `SmoothQuantConfig` should be combined with `StaticQuantConfig` in a list because we still need to insert QDQ and apply pattern fusion after the smoothing process. + + +### Determining the `alpha` through auto-tuning +Users can search for the best `alpha` for the entire model.The tuning process looks for the optimal `alpha` value from a list of `alpha` values provided by the user. + +Here is an example: + +```python +from neural_compressor.tensorflow import StaticQuantConfig, SmoothQuantConfig + +custom_tune_config = TuningConfig(config_set=[SmoothQuantConfig(alpha=[0.5, 0.6, 0.7]), StaticQuantConfig()]) +best_model = autotune( + model="fp32_model", + tune_config=custom_tune_config, + eval_fn=eval_fn_wrapper, + calib_dataloader=calib_dataloader, +) +``` +> Please note that, it may a considerable amount of time as the tuning process applies each `alpha` to the entire model and uses the evaluation result on the entire dataset as the metric to determine the best `alpha`. + +## Examples + +Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/tensorflow/nlp/large_language_models\quantization\ptq\smoothquant) on how to apply smooth quant to a TensorFlow model with INC 3x API. diff --git a/docs/3x/TensorFlow.md b/docs/3x/TensorFlow.md new file mode 100644 index 00000000000..52287dedcf0 --- /dev/null +++ b/docs/3x/TensorFlow.md @@ -0,0 +1,206 @@ +TensorFlow +=============== + + +1. [Introduction](#introduction) +2. [API for TensorFlow](#api-for-tensorflow) +3. [Support Matrix](#support-matrix) + 3.1 [Quantization Scheme](#quantization-scheme) + 3.2 [Quantization Approaches](#quantization-approaches) + 3.3 [Backend and Device](#backend-and-device) + +## Introduction + +
+ +
+ +[TensorFlow](https://www.tensorflow.org/) is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of [tools](https://www.tensorflow.org/resources/tools), [libraries](https://www.tensorflow.org/resources/libraries-extensions), and [community](https://www.tensorflow.org/community) resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications. It provides stable [Python](https://www.tensorflow.org/api_docs/python) and [C++](https://www.tensorflow.org/api_docs/cc) APIs, as well as a non-guaranteed backward compatible API for [other languages](https://www.tensorflow.org/api_docs). + +Keras is a multi-backend deep learning framework , supporting JAX, TensorFlow, and PyTorch. It serves as a dependency of TensorFlow, providing high-level API. Effortlessly build and train models for computer vision, natural language processing, audio processing, timeseries forecasting, recommender systems, etc. + + + +## API for TensorFlow + +Intel(R) Neural Compressor provides `quantize_model` and `autotune` as main interfaces for supported algorithms on TensorFlow framework. + + +**quantize_model** + +The design philosophy of the `quantize_model` interface is easy-of-use. With minimal parameters requirement, including `model`, `quant_config`, `calib_dataloader` and `calib_iteration`, it offers a straightforward choice of quantizing TF model in one-shot. + +```python +def quantize_model( + model: Union[str, tf.keras.Model, BaseModel], + quant_config: Union[BaseConfig, list], + calib_dataloader: Callable = None, + calib_iteration: int = 100, +): +``` +`model` should be a string of the model's location, the object of Keras model or INC TF model wrapper class. + +`quant_config` is either the `StaticQuantConfig` object or a list contains `SmoothQuantConfig` and `StaticQuantConfig` to indicate what algorithm should be used and what specific quantization rules should be applied. + +`calib_dataloader` is used to load the data samples for calibration phase. In most cases, it could be the partial samples of the evaluation dataset. + +`calib_iteration` is used to decide how many iterations the calibration process will be run. + +Here is a simple example of using `quantize_model` interface with a dummy calibration dataloader and the default `StaticQuantConfig`: +```python +from neural_compressor.tensorflow import StaticQuantConfig, quantize_model +from neural_compressor.tensorflow.utils import DummyDataset + +dataset = DummyDataset(shape=(100, 32, 32, 3), label=True) +calib_dataloader = MyDataLoader(dataset=dataset) +quant_config = StaticQuantConfig() + +qmodel = quantize_model("fp32_model.pb", quant_config, calib_dataloader) +``` +**autotune** + +The `autotune` interface, on the other hand, provides greater flexibility and power. It's particularly useful when accuracy is a critical factor. If the initial quantization doesn't meet the tolerance of accuracy loss, `autotune` will iteratively try quantization rules according to the `tune_config`. + +Just like `quantize_model`, `autotune` requires `model`, `calib_dataloader` and `calib_iteration`. And the `eval_fn`, `eval_args` are used to build evaluation process. + + + +```python +def autotune( + model: Union[str, tf.keras.Model, BaseModel], + tune_config: TuningConfig, + eval_fn: Callable, + eval_args: Optional[Tuple[Any]] = None, + calib_dataloader: Callable = None, + calib_iteration: int = 100, +) -> Optional[BaseModel]: +``` +`model` should be a string of the model's location, the object of Keras model or INC TF model wrapper class. + +`tune_config` is the `TuningConfig` object which contains multiple quantization rules. + +`eval_fn` is the evaluation function that measures the accuracy of a model. + +`eval_args` is the supplemental arguments required by the defined evaluation function. + +`calib_dataloader` is used to load the data samples for calibration phase. In most cases, it could be the partial samples of the evaluation dataset. + +`calib_iteration` is used to decide how many iterations the calibration process will be run. + +Here is a simple example of using `autotune` interface with different quantization rules defined by a list of `StaticQuantConfig`: +```python +from neural_compressor.common.base_tuning import TuningConfig +from neural_compressor.tensorflow import StaticQuantConfig, autotune + +calib_dataloader = MyDataloader(dataset=Dataset()) +custom_tune_config = TuningConfig( + config_set=[ + StaticQuantConfig(weight_sym=True, act_sym=True), + StaticQuantConfig(weight_sym=False, act_sym=False), + ] +) +best_model = autotune( + model="baseline_model", + tune_config=custom_tune_config, + eval_fn=eval_acc_fn, + calib_dataloader=calib_dataloader, +) +``` + +### Support Matrix + +#### Quantization Scheme + +| Framework | Backend Library | Symmetric Quantization | Asymmetric Quantization | +| :-------------- |:---------------:| ---------------:|---------------:| +| TensorFlow | [oneDNN](https://github.com/oneapi-src/oneDNN) | Activation (int8/uint8), Weight (int8) | - | +| Keras | [ITEX](https://github.com/intel/intel-extension-for-tensorflow) | Activation (int8/uint8), Weight (int8) | - | + + ++ Symmetric Quantization + + int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1) + + uint8: scale = max(rmin, rmax) / (max(uint8) - min(uint8)) + + ++ oneDNN: [Lower Numerical Precision Deep Learning Inference and Training](https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html) + +#### Quantization Approaches + +The supported Quantization methods for TensorFlow and Keras are listed below: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TypesQuantizationDataset RequirementsFrameworkBackend
Post-Training Static Quantization (PTQ)weights and activationscalibrationKerasITEX
TensorFlowTensorFlow/Intel TensorFlow
Smooth Quantization(SQ)weightscalibrationTensorflowTensorFlow/Intel TensorFlow
+
+
+ +##### Post Training Static Quantization + +The min/max range in weights and activations are collected offline on a so-called `calibration` dataset. This dataset should be able to represent the data distribution of those unseen inference dataset. The `calibration` process runs on the original fp32 model and dumps out all the tensor distributions for `Scale` and `ZeroPoint` calculations. Usually preparing 100 samples are enough for calibration. + +Refer to the [PTQ Guide](./TF_Quant.md) for detailed information. + +##### Smooth Quantization + +Smooth Quantization (SQ) is an advanced quantization technique designed to optimize model performance while maintaining high accuracy. Unlike traditional quantization methods that can lead to significant accuracy loss, SQ focuses on a more refined approach by taking a balance between the scale of activations and weights. + +Refer to the [SQ Guide](./TF_SQ.md) for detailed information. + +#### Backend and Device +Intel(R) Neural Compressor supports TF GPU with [ITEX-XPU](https://github.com/intel/intel-extension-for-tensorflow). We will automatically run model on GPU by checking if it has been installed. + + + + + + + + + + + + + + + + + + + + + + + + + + +
FrameworkBackendBackend LibraryBackend ValueSupport Device(cpu as default)
TensorFlowTensorFlowOneDNN"default"cpu
ITEXOneDNN"itex"cpu | gpu
+
+
diff --git a/docs/3x/quantization.md b/docs/3x/quantization.md new file mode 100644 index 00000000000..de8e43828a3 --- /dev/null +++ b/docs/3x/quantization.md @@ -0,0 +1,115 @@ +Quantization +=============== + +1. Quantization + 1.1 [Quantization Introduction](#quantization-introduction) + 1.2 [Quantization Fundamentals](#quantization-fundamentals) + 1.3 [Accuracy Aware Tuning](#accuracy-aware-tuning) +2. [Smooth Quant](#smooth-quant) +3. [WOQ](#woq) + +## Quantization Introduction + +Quantization is a very popular deep learning model optimization technique invented for improving the speed of inference. It minimizes the number of bits required by converting a set of real-valued numbers into the lower bit data representation, such as int8 and int4, mainly on inference phase with minimal to no loss in accuracy. This way reduces the memory requirement, cache miss rate, and computational cost of using neural networks and finally achieve the goal of higher inference performance. On Intel 3rd Gen Intel® Xeon® Scalable Processors, user could expect up to 4x theoretical performance speedup. We expect further performance improvement with [Intel® Advanced Matrix Extensions](https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html) on 4th Gen Intel® Xeon® Scalable Processors. + +## Quantization Fundamentals + +`Affine quantization` and `Scale quantization` are two common range mapping techniques used in tensor conversion between different data types. + +The math equation is like: $$X_{int8} = round(Scale \times X_{fp32} + ZeroPoint)$$. + +**Affine Quantization** + +This is so-called `asymmetric quantization`, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255]. + +here: + +If INT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 127$ and $ZeroPoint = -128 - X_{f_{min}} / Scale$. + +or + +If UINT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 255$ and $ZeroPoint = - X_{f_{min}} / Scale$. + +**Scale Quantization** + +This is so-called `Symmetric quantization`, in which we use the maximum absolute value in the float tensor as float range and map to the corresponding integer range. + +The math equation is like: + +here: + +If INT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 127$ and $ZeroPoint = 0$. + +or + +If UINT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 255$ and $ZeroPoint = 128$. + +*NOTE* + +Sometimes the reduce_range feature, that's using 7 bit width (1 sign bit + 6 data bits) to represent int8 range, may be needed on some early Xeon platforms, it's because those platforms may have overflow issues due to fp16 intermediate calculation result when executing int8 dot product operation. After AVX512_VNNI instruction is introduced, this issue gets solved by supporting fp32 intermediate data. + + + +#### Quantization Scheme in TensorFlow ++ Symmetric Quantization + + int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1) + + uint8: scale = max(rmin, rmax) / (max(uint8) - min(uint8)) + +#### Quantization Scheme in PyTorch ++ Symmetric Quantization + + int8: scale = max(abs(rmin), abs(rmax)) / (float(max(int8) - min(int8)) / 2) + + uint8: scale = max(abs(rmin), abs(rmax)) / (float(max(int8) - min(int8)) / 2) ++ Asymmetric Quantization + + uint8: scale = (rmax - rmin) / (max(uint8) - min(uint8)); zero_point = min(uint8) - round(rmin / scale) + +#### Quantization Scheme in IPEX ++ Symmetric Quantization + + int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1) + + uint8: scale = max(rmin, rmax) / (max(uint8) - min(uint8)) + +### Quantization Approaches + +Quantization has three different approaches: +1) post training dynamic quantization +2) post training static quantization +3) quantization aware training. + +The first two approaches belong to optimization on inference. The last belongs to optimization during training. + +#### Post Training Dynamic Quantization + +The weights of the neural network get quantized into int8 format from float32 format offline. The activations of the neural network is quantized as well with the min/max range collected during inference runtime. + +This approach is widely used in dynamic length neural networks, like NLP model. + +#### Post Training Static Quantization + +Compared with `post training dynamic quantization`, the min/max range in weights and activations are collected offline on a so-called `calibration` dataset. This dataset should be able to represent the data distribution of those unseen inference dataset. The `calibration` process runs on the original fp32 model and dumps out all the tensor distributions for `Scale` and `ZeroPoint` calculations. Usually preparing 100 samples are enough for calibration. + +This approach is major quantization approach people should try because it could provide the better performance comparing with `post training dynamic quantization`. + +#### Quantization Aware Training + +Quantization aware training emulates inference-time quantization in the forward pass of the training process by inserting `fake quant` ops before those quantizable ops. With `quantization aware training`, all weights and activations are `fake quantized` during both the forward and backward passes of training: that is, float values are rounded to mimic int8 values, but all computations are still done with floating point numbers. Thus, all the weight adjustments during training are made while aware of the fact that the model will ultimately be quantized; after quantizing, therefore, this method will usually yield higher accuracy than either dynamic quantization or post-training static quantization. + +## Accuracy Aware Tuning + +Accuracy aware tuning is one of unique features provided by Intel(R) Neural Compressor, compared with other 3rd party model compression tools. This feature can be used to solve accuracy loss pain points brought by applying low precision quantization and other lossy optimization methods. + +This tuning algorithm creates a tuning space by querying framework quantization capability and model structure, selects the ops to be quantized by the tuning strategy, generates quantized graph, and evaluates the accuracy of this quantized graph. The optimal model will be yielded if the pre-defined accuracy goal is met. The `autotune` serves as a main interface of this algorithm. + +Neural compressor also support to quantize all quantizable ops without accuracy tuning, using `quantize_model` interface to achieve that. + +### Working Flow + +For supported quantization methods for `accuracy aware tuning` and the detailed API usage, please refer to the document of [PyTorch](./pytorch.md) or [TensorFlow](./tensorflow.md) respectively. + +User could refer to below chart to understand the whole tuning flow. + +accuracy aware tuning working flow + + +# Smooth Quant + + +# WOQ diff --git a/docs/source/quantization.md b/docs/source/quantization.md index eac1d5dce6f..69d8d71c022 100644 --- a/docs/source/quantization.md +++ b/docs/source/quantization.md @@ -121,7 +121,7 @@ This approach is major quantization approach people should try because it could Quantization aware training emulates inference-time quantization in the forward pass of the training process by inserting `fake quant` ops before those quantizable ops. With `quantization aware training`, all weights and activations are `fake quantized` during both the forward and backward passes of training: that is, float values are rounded to mimic int8 values, but all computations are still done with floating point numbers. Thus, all the weight adjustments during training are made while aware of the fact that the model will ultimately be quantized; after quantizing, therefore, this method will usually yield higher accuracy than either dynamic quantization or post-training static quantization. -## With or Without Accuracy Aware Tuning +## Accuracy Aware Tuning Accuracy aware tuning is one of unique features provided by Intel(R) Neural Compressor, compared with other 3rd party model compression tools. This feature can be used to solve accuracy loss pain points brought by applying low precision quantization and other lossy optimization methods. diff --git a/neural_compressor/tensorflow/__init__.py b/neural_compressor/tensorflow/__init__.py index e1e987f0a0d..678a02c83ba 100644 --- a/neural_compressor/tensorflow/__init__.py +++ b/neural_compressor/tensorflow/__init__.py @@ -14,6 +14,7 @@ from neural_compressor.tensorflow.utils import register_algo, Model from neural_compressor.tensorflow.quantization import ( + autotune, quantize_model, StaticQuantConfig, SmoothQuantConfig,