diff --git a/.coverage b/.coverage new file mode 100644 index 00000000000..02b5b52790b Binary files /dev/null and b/.coverage differ diff --git a/docs/3x/PT_DynamicQuant.md b/docs/3x/PT_DynamicQuant.md new file mode 100644 index 00000000000..a907cae93b7 --- /dev/null +++ b/docs/3x/PT_DynamicQuant.md @@ -0,0 +1,42 @@ +Dynamic Quantization +=============== + +1. [Introduction](#introduction) +2. [Getting Started with Dynamic Quantization](#Getting-Started-with-Dynamic-Quantization) +3. [Examples](#examples) + + +## Introduction +Quantization is the process of converting floating point weights and activations to lower bitwidth tensors by multiplying the floating point values by a scale factor and rounding the results to whole numbers. Dynamic quantization determines the scale factor for activations dynamically based on the data range observed at runtime. We support W8A8 (quantizing weights and activations into 8 bits) dynamic quantization by leveraging torch's [`X86InductorQuantizer`](https://pytorch.org/tutorials/prototype/pt2e_quant_x86_inductor.html?highlight=x86inductorquantizer). + + +## Getting Started with Dynamic Quantization +There are four steps to perform W8A8 dynamic quantization: `export`, `prepare`, `convert` and `compile`. + +```python +import torch +from neural_compressor.torch.export import export +from neural_compressor.torch.quantization import DynamicQuantConfig, prepare, convert + +# Prepare the float model and example inputs for export model +model = UserFloatModel() +example_inputs = ... + +# Export eager model into FX graph model +exported_model = export(model=model, example_inputs=example_inputs) +# Quantize the model +quant_config = DynamicQuantConfig() +prepared_model = prepare(exported_model, quant_config=quant_config) +q_model = convert(prepared_model) +# Compile the quantized model and replace the Q/DQ pattern with Q-operator +from torch._inductor import config + +config.freezing = True +opt_model = torch.compile(q_model) +``` + +> Note: The `set_local` of `DynamicQuantConfig` will be supported after the torch 2.4 release. + + +## Examples +Example will be added later. diff --git a/docs/3x/PT_MXQuant.md b/docs/3x/PT_MXQuant.md new file mode 100644 index 00000000000..1cfb17ff30b --- /dev/null +++ b/docs/3x/PT_MXQuant.md @@ -0,0 +1,107 @@ +Microscaling Quantization +=============== + +1. [Introduction](#introduction) +2. [Get Started with Microscaling Quantization API](#get-start-with-microscaling-quantization-api) +3. [Examples](#examples) +4. [Reference](#reference) + +## Introduction + +Numerous breakthroughs have emerged across various fields, such as text analysis, language translation and chatbot technologies, fueled by the development of large language models (LLMs). Nevertheless, their increasing power comes with the challenge of explosive growth in parameters, posing obstacles for practical use. To balance memory limits and accuracy preservation for AI models, the Microscaling (MX) specification was promoted from the well-known Microsoft Floating Point (MSFP) data type [1, 2]: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Format NameElement Data typeElement BitsScaling Block SizeScale Data TypeScale Bits
MXFP8FP8 (E5M2)832E8M08
FP8 (E4M3)
MXFP6FP6 (E3M2)632E8M08
FP6 (E2M3)
MXFP4FP4 (E2M1)432E8M08
MXINT8INT8832E8M08
+ + +At an equivalent accuracy level, the MX data type demonstrates the ability to occupy a smaller area and incur lower energy costs for multiply-accumulate compared to other conventional data types on the same silicon [1]. + +Neural Compressor seamlessly applies the MX data type to post-training quantization, offering meticulously crafted recipes to empower users to quantize LLMs without sacrificing accuracy. The workflow is shown as below. + + + + Workflow of MX Quant (source [3]) + + + +The memory and computational limits of LLMs are more severe than other general neural networks, so our exploration focuses on LLMs first. The following table shows the basic MX quantization recipes in Neural Compressor and enumerates distinctions among various data types. The MX data type replaces general float scale with powers of two to be more hardware-friendly. It adapts a granularity falling between per-channel and per-tensor to balance accuracy and memory consumption. + +| | MX Format | INT8 | FP8 | +|------------|--------------|------------|------------| +| Scale | $2^{exp}$ | $\frac{MAX}{amax}$ | $\frac{MAX}{amax}$ | +| Zero point | 0 (None) | $2^{bits - 1}$ or $-min * scale$ | 0 (None) | +| Granularity | per-block (default blocksize is 32) | per-channel or per-tensor | per-channel or per-tensor | + +The exponent (exp) is equal to torch.floor(torch.log2(amax)), MAX is the representation range of the data type, amax is the max absolute value of per-block tensor, and rmin is the minimum value of the per-block tensor. + + +## Get Started with Microscaling Quantization API + +To get a model quantized with Microscaling Data Types, users can use the Microscaling Quantization API as follows. + +```python +from neural_compressor.torch.quantization import MXQuantConfig, prepare, convert + +quant_config = MXQuantConfig(w_dtype=args.w_dtype, act_dtype=args.act_dtype, weight_only=args.woq) +user_model = prepare(model=user_model, quant_config=quant_config) +user_model = convert(model=user_model) +``` + +## Examples + +- PyTorch [huggingface models](/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/mx) + + +## Reference + +[1]: Darvish Rouhani, Bita, et al. "Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point." Advances in neural information processing systems 33 (2020): 10271-10281 + +[2]: OCP Microscaling Formats (MX) Specification + +[3]: Rouhani, Bita Darvish, et al. "Microscaling Data Formats for Deep Learning." arXiv preprint arXiv:2310.10537 (2023). diff --git a/docs/3x/PT_MixPrecision.md b/docs/3x/PT_MixPrecision.md new file mode 100644 index 00000000000..c1cd198049b --- /dev/null +++ b/docs/3x/PT_MixPrecision.md @@ -0,0 +1,103 @@ +PyTorch Mixed Precision +======================================== + +1. [Introduction](#introduction) +2. [Mixed Precision Support Matrix](#mixed-precision-support-matrix) +3. [Get Started](#get-start) +4. [Examples](#examples) + +## Introduction + +The recent growth of Deep Learning has driven the development of more complex models that require significantly more compute and memory capabilities. Several low precision numeric formats have been proposed to address the problem. Google's [bfloat16](https://cloud.google.com/tpu/docs/bfloat16) and the [FP16: IEEE](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) half-precision format are two of the most widely used sixteen bit formats. [Mixed precision](https://arxiv.org/abs/1710.03740) training and inference using low precision formats have been developed to reduce compute and bandwidth requirements. + +The 3rd Gen Intel® Xeon® Scalable processor (codenamed Cooper Lake), featuring Intel® Deep Learning Boost, is the first general-purpose x86 CPU to support the bfloat16 format. Specifically, three new bfloat16 instructions are added as a part of the AVX512_BF16 extension within Intel Deep Learning Boost: VCVTNE2PS2BF16, VCVTNEPS2BF16, and VDPBF16PS. The first two instructions allow converting to and from bfloat16 data type, while the last one performs a dot product of bfloat16 pairs. Further details can be found in the [hardware numerics document](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-deep-learning-boost-new-instruction-bfloat16.html) published by Intel. + +The 4th Gen Intel® Xeon® Scalable processor supports FP16 instruction set architecture (ISA) for Intel® +Advanced Vector Extensions 512 (Intel® AVX-512). The new ISA supports a wide range of general-purpose numeric +operations for 16-bit half-precision IEEE-754 floating-point and complements the existing 32-bit and 64-bit floating-point instructions already available in the Intel Xeon processor based products. Further details can be found in the [hardware numerics document](https://www.intel.com/content/www/us/en/content-details/669773/intel-avx-512-fp16-instruction-set-for-intel-xeon-processor-based-products-technology-guide.html) published by Intel. + +

+ Architecture +

+ +## Mixed Precision Support Matrix + + + + + + + + + + + + + + + + + + + + + + + + +
FrameworkBackendBackend LibraryBackend ValueSupport Device(cpu as default)Support BF16Support FP16
PyTorchFXFBGEMM"default"cpu
+ + +### Hardware and Software requests for **BF16** +- PyTorch + 1. Hardware: CPU supports `avx512_bf16` instruction set. + 2. Software: torch >= [1.11.0](https://download.pytorch.org/whl/torch_stable.html). + + +### Hardware and Software requests for **FP16** +- PyTorch + 1. Hardware: CPU supports `avx512_fp16` instruction set. + 2. Software: torch >= [1.11.0](https://download.pytorch.org/whl/torch_stable.html). + + +### Accuracy-driven mixed precision +BF16/FP16 conversion may lead to accuracy drop. Intel® Neural Compressor provides an accuracy-driven tuning function to reduce accuracy loss, +which could fallback converted ops to FP32, if set in config, to get better accuracy. To enable this function, users only to provide +`eval_fn` and `eval_args` for `autotune`. +To be noticed, IPEX backend doesn't support accuracy-driven mixed precision. + +## Get Started with autotune API + +To get a bf16/fp16 model, users can use the `autotune` interface with `MixPrecisionConfig` as follows. + +- BF16: + +```python +from neural_compressor.torch.quantization import MixPrecisionConfig, TuningConfig, autotune + +def eval_acc_fn(model): + ...... + return acc + +# modules might be fallback to fp32 to get better accuracy +custom_tune_config = TuningConfig(config_set=[MixPrecisionConfig(dtype=["bf16", "fp32"])], max_trials=3) +best_model = autotune(model=build_torch_model(), tune_config=custom_tune_config, eval_fn=eval_acc_fn) +``` + +- FP16: + +```python +from neural_compressor.torch.quantization import MixPrecisionConfig, TuningConfig, autotune + +def eval_acc_fn(model): + ...... + return acc + +# modules might be fallback to fp32 to get better accuracy +custom_tune_config = TuningConfig(config_set=[MixPrecisionConfig(dtype=["fp16", "fp32"])], max_trials=3) +best_model = autotune(model=build_torch_model(), tune_config=custom_tune_config, eval_fn=eval_acc_fn) +``` + +## Examples + +Example will be added later. diff --git a/docs/3x/PT_SmoothQuant.md b/docs/3x/PT_SmoothQuant.md new file mode 100644 index 00000000000..9e4ae3eb62f --- /dev/null +++ b/docs/3x/PT_SmoothQuant.md @@ -0,0 +1,112 @@ +PyTorch Smooth Quantization +======================================== + +1. [Introduction](#Introduction) +2. [Usage](#Usage) +3. [Validated Models](#Validated-Models) +4. [Supported Framework Matrix](#Supported-Framework-Matrix) + + +## Introduction +Quantization is a common compression operation to reduce memory and accelerate inference by converting the floating point matrix to an integer matrix. For large language models (LLMs) with gigantic parameters, the systematic outliers make quantification of activations difficult. [SmoothQuant](https://arxiv.org/abs/2211.10438), a training free post-training quantization (PTQ) solution, offline migrates this difficulty from activations to weights with a mathematically equivalent transformation. + + +## Usage +### Fixed Alpha +To set a fixed alpha for the entire model, users can follow this example: + +```python +from neural_compressor.torch.quantization import SmoothQuantConfig, convert, prepare + + +def run_fn(model): + model(example_inputs) + + +quant_config = SmoothQuantConfig(alpha=0.5) +prepared_model = prepare(fp32_model, quant_config=quant_config, example_inputs=example_inputs) +run_fn(prepared_model) +q_model = convert(prepared_model) +``` +`SmoothQuantConfig` description: + +`alpha`: a smooth factor to calculate the conversion per-channel scale and balance the quantization difficulty of activation and weight. Float value, default is 0.5. + +> **Note:** Alpha="auto" and alpha auto-tuning was supported in old API, please stay tuned for the new API's support for auto alpha. + +### Specify Quantization Rules +Intel(R) Neural Compressor support specify quantization rules by operator type for Smooth Quantization. Users can use `set_local` to fallback op type in `SmoothQuantConfig` to achieve the above purpose. + +Here we don't quantize `Linear` layers. +```python +# fallback by op_type +quant_config.set_local("Linear", SmoothQuantConfig(w_dtype="fp32", act_dtype="fp32")) +prepared_model = prepare(model, quant_config=quant_config, example_inputs=example_inputs) +run_fn(prepared_model) +q_model = convert(prepared_model) +``` + +To get more information, please refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/llm). + + +## Validated Models +Neural Compressor: 2.1 + +IPEX (Intel Extension for PyTorch): 2.0/2.1 + +Dataset: lambada_openai + +Task: text-generation provided by [ITREX](https://github.com/intel/intel-extension-for-transformers/tree/main/examples/huggingface/pytorch/text-generation/quantization) + +alpha [0.4, 0.6] is sweet spot region in SmoothQuant paper. + +A list of models that achieved a <1% accuracy drop is shown below. + +| Model/Last token accuracy | FP32 Accuracy | INT8 (w/ SmoothQuant) | Notes | +|:----------:|:------:|:------:|-----------------------------------| +| bigscience/bloom-560m | 0.354 | 0.3542 | alpha=0.5, Ipex 2.1 | +| bigscience/bloom-1b7 | 0.4634 | 0.4936 | alpha=0.5, Ipex 2.0 | +| bigscience/bloom-3b | 0.518 | 0.5185 | alpha=0.8, Ipex 2.1 | +| bigscience/bloom-7b1 | 0.5764 | 0.5977 | alpha=0.5, Ipex 2.0 | +| bigscience/bloomz-560m | 0.3947 | 0.3930 | alpha=0.8, Ipex 2.1 | +| bigscience/bloomz-1b7 | 0.4828 | 0.4906 | alpha=0.5, Ipex 2.1 | +| bigscience/bloomz-3b | 0.5018 | 0.4980 | alpha=0.5, Ipex 2.1 | +| bigscience/bloomz-7b1 | 0.5593 | 0.5552 | alpha=0.5, Ipex 2.1 | +| facebook/opt-125m | 0.379 | 0.3757 | alpha=0.5, Ipex 2.1 | +| facebook/opt-350m | 0.4516 | 0.4533 | alpha=0.8, Ipex 2.1 | +| facebook/opt-1.3b | 0.5789 | 0.5742 | alpha=0.8, Ipex 2.0 | +| facebook/opt-2.7b | 0.6365 | 0.6404 | alpha=0.5, Ipex 2.0 | +| facebook/opt-6.7b | 0.6769 | 0.6804 | alpha=0.5, Ipex 2.0 | +| facebook/opt-13b | 0.6872 | 0.6814 | alpha=0.5, Ipex 2.1 | +| facebook/opt-30b | 0.7149 | 0.7128 | alpha=0.5, Ipex 2.1 | +| facebook/opt-66b | 0.7398 | 0.7326 | alpha=0.5, Ipex 2.1 | +| LLaMa-7b | 0.7361 | 0.7357 | alpha=0.8, Ipex 2.1 | +| LLaMa-13b | 0.7627 | 0.7590 | alpha=0.7, Ipex 2.1 | +| LLaMa-30b | 0.7759 | 0.7840 | alpha=0.7, Ipex 2.1 | +| LLaMa-65b | 0.7908 | 0.7957 | alpha=0.9, Ipex 2.1 | +| EleutherAI/gpt-j-6B* | 0.6831 | 0.6821 | alpha=1.0, Ipex 2.1 | +| MBZUAI/LaMini-GPT-124m | 0.3804 | 0.3887 | alpha=0.5, Ipex 2.1 | +| MBZUAI/LaMini-GPT-774m | 0.5048 | 0.5057 | alpha=0.5, Ipex 2.1 | +| MBZUAI/LaMini-GPT-1.5b | 0.5443 | 0.5436 | alpha=0.5, Ipex 2.1 | +| mosaicml/mpt-7b-chat | 0.655 | 0.6499 | alpha=0.7, Ipex 2.1 | +| stabilityai/stablelm-base-alpha-3b | 0.4172 | 0.4149 | alpha=0.6, Ipex 2.1 | +| togethercomputer/RedPajama-INCITE-Base-3B-v1 | 0.6542 | 0.6735 | alpha=0.5, Ipex 2.1 | +| togethercomputer/RedPajama-INCITE-Chat-3B-v1* | 0.6718 | 0.6740 | alpha=0.5, Ipex 2.0 | +| togethercomputer/RedPajama-INCITE-Instruct-3B-v1* | 0.6569 | 0.6621 | alpha=0.5, Ipex 2.0 | +| togethercomputer/RedPajama-INCITE-Base-7B-v0.1* | 0.7143 | 0.7221 | alpha=0.5, Ipex 2.0 | +| togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1* | 0.6895 | 0.6953 | alpha=0.5, Ipex 2.0 | +| databricks/dolly-v1-6b* | 0.6866 | 0.6895 | alpha=0.8, Ipex 2.1 | +| databricks/dolly-v2-3b* | 0.6297 | 0.6247 | alpha=0.5, Ipex 2.1 | +| tiiuae/falcon-7b-instruct | 0.6437 | 0.6392 | alpha=0.7, Pytorch | + +Please refer to the step-by-step [instruction](../../examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/ipex/README.md) for details. + +Please note that for models with asterisk(*), we have set all add ops to FP32 during quantization step to achieve desirable results. + + +## Supported Framework Matrix + +| Framework | Alpha | Folding | +|:---------:|--------------|------------| +| PyTorch | [0-1] | False | +| IPEX | [0-1] | True / False(Version>2.1) | diff --git a/docs/3x/PT_StaticQuant.md b/docs/3x/PT_StaticQuant.md new file mode 100644 index 00000000000..ec967a780d4 --- /dev/null +++ b/docs/3x/PT_StaticQuant.md @@ -0,0 +1,104 @@ +PyTorch Static Quantization +======================================== + +1. [Introduction](#introduction) +2. [Get Started](#get-started) \ + 2.1 [Static Quantization with IPEX Backend](#static-quantization-with-ipex-backend) \ + 2.1.1 [Usage Sample with IPEX](#usage-sample-with-ipex) \ + 2.1.2 [Specify Quantization Rules](#specify-quantization-rules) \ + 2.1.3 [Model Examples](#model-examples) \ + 2.2 [Static Quantization with PT2E Backend](#static-quantization-with-pt2e-backend) \ + 2.2.1 [Usage Sample with PT2E](#usage-sample-with-pt2e) + + +## Introduction + +Post-Training Quantization (PTQ) is a technique used to convert a pre-trained floating-point model to a quantized model. This approach does not require model retraining. Instead, it uses calibration data to determine the optimal quantization parameters. Static quantization involves calibrating both weights and activations during the quantization process. Currently, we support two paths to perform static PTQ: [Intel Extension for PyTorch (IPEX)](https://github.com/intel/intel-extension-for-pytorch) and [PyTorch 2 Export Quantization (PT2E)](https://pytorch.org/tutorials/prototype/pt2e_quant_x86_inductor.html). + +## Get Started + +### Static Quantization with IPEX Backend + +Intel Extension for PyTorch (IPEX) provides optimizations specifically for Intel hardware, improving the performance of PyTorch models through efficient execution on CPUs. IPEX supports PTQ, allowing users to quantize models to lower precision to reduce model size and inference time while maintaining accuracy. + +The design philosophy of the quantization interface of Intel(R) Neural Compressor is easy-of-use. It requests user to provide `model`, `calibration function`, and `example inputs`. Those parameters would be used to quantize and tune the model. + +`model` is the framework model location or the framework model object. + +`calibration function` is used to determine the appropriate quantization parameters, such as `scale` and `zero-point`, for the model's weights and activations. This process is crucial for minimizing the loss of accuracy that can occur when converting from floating-point to lower-precision format. + +IPEX leverages just-in-time (JIT) compilation techniques for optimizing the model. `example inputs` is used to trace the computational graph of the model, enabling various optimizations and transformations that are specific to IPEX. This tracing process captures the operations performed by the model, allowing IPEX to apply quantization optimizations effectively. `example inputs` should be representative of the actual data the model will process to ensure accurate calibration. + + +#### Usage Sample with IPEX +```python +import intel_extension_for_pytorch as ipex +from neural_compressor.torch.quantization import StaticQuantConfig, convert, prepare + +quant_config = StaticQuantConfig(act_sym=True, act_algo="minmax") +prepared_model = prepare(model, quant_config=quant_config, example_inputs=example_inputs) +run_fn(prepared_model) +q_model = convert(prepared_model) +``` + +> [!IMPORTANT] +> To use static quantization with the IPEX backend, please explicitly import IPEX at the beginning of your program. + +#### Specify Quantization Rules +Intel(R) Neural Compressor support specify quantization rules by operator name or operator type. Users can use `set_local` to fallback either `op_name` or `op_type` in `StaticQuantConfig` to achieve the above purpose. + +1. Example of `op_name_dict` +Here we don't quantize the layer named `fc1`. +```python +# fallback by op_name +quant_config.set_local("fc1", StaticQuantConfig(w_dtype="fp32", act_dtype="fp32")) +prepared_model = prepare(fp32_model, quant_config=quant_config, example_inputs=example_inputs) +run_fn(prepared_model) +q_model = convert(prepared_model) +``` +2. Example of `op_type_dict` +Here we don't quantize `Linear` layers. +```python +# fallback by op_type +quant_config.set_local("Linear", StaticQuantConfig(w_dtype="fp32", act_dtype="fp32")) +prepared_model = prepare(model, quant_config=quant_config, example_inputs=example_inputs) +run_fn(prepared_model) +q_model = convert(prepared_model) +``` + +#### Model Examples + +Users could refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/llm) on how to quantize a new model. + + +### Static Quantization with PT2E Backend +Compared to the IPEX backend, which uses JIT compilation to capture the eager model, the PT2E path uses `torch.dynamo` to capture the eager model into an FX graph model, and then inserts the observers and Q/QD pairs on it. Finally it uses the `torch.compile` to perform the pattern matching and replace the Q/DQ pairs with optimized quantized operators. + +#### Usage Sample with PT2E +There are four steps to perform W8A8 static quantization with PT2E backend: `export`, `prepare`, `convert` and `compile`. + +```python +import torch +from neural_compressor.torch.export import export +from neural_compressor.torch.quantization import StaticQuantConfig, prepare, convert + +# Prepare the float model and example inputs for export model +model = UserFloatModel() +example_inputs = ... + +# Export eager model into FX graph model +exported_model = export(model=model, example_inputs=example_inputs) +# Quantize the model +quant_config = StaticQuantConfig() +prepared_model = prepare(exported_model, quant_config=quant_config) +# Calibrate +run_fn(prepared_model) +q_model = convert(prepared_model) +# Compile the quantized model and replace the Q/DQ pattern with Q-operator +from torch._inductor import config + +config.freezing = True +opt_model = torch.compile(q_model) +``` + +> Note: The `set_local` of `StaticQuantConfig` will be supported after the torch 2.4 release. diff --git a/docs/3x/PT_WeightOnlyQuant.md b/docs/3x/PT_WeightOnlyQuant.md new file mode 100644 index 00000000000..e7e5c543215 --- /dev/null +++ b/docs/3x/PT_WeightOnlyQuant.md @@ -0,0 +1,277 @@ + +PyTorch Weight Only Quantization +=============== +- [Introduction](#introduction) +- [Supported Matrix](#supported-matrix) +- [Usage](#usage) + - [Get Started](#get-started) + - [Common arguments](#common-arguments) + - [RTN](#rtn) + - [GPTQ](#gptq) + - [AutoRound](#autoround) + - [AWQ](#awq) + - [TEQ](#teq) + - [HQQ](#hqq) + - [Specify Quantization Rules](#specify-quantization-rules) + - [Saving and Loading](#saving-and-loading) +- [Examples](#examples) + +## Introduction + +As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computational demands of these modern architectures while maintaining the accuracy. Compared to normal quantization like W8A8, weight only quantization is probably a better trade-off to balance the performance and the accuracy, since we will see below that the bottleneck of deploying LLMs is the memory bandwidth and normally weight only quantization could lead to better accuracy. + +Model inference: Roughly speaking , two key steps are required to get the model's result. The first one is moving the model from the memory to the cache piece by piece, in which, memory bandwidth $B$ and parameter count $P$ are the key factors, theoretically the time cost is $P*4 /B$. The second one is computation, in which, the device's computation capacity $C$ measured in FLOPS and the forward FLOPs $F$ play the key roles, theoretically the cost is $F/C$. + +Text generation: The most famous application of LLMs is text generation, which predicts the next token/word based on the inputs/context. To generate a sequence of texts, we need to predict them one by one. In this scenario, $F\approx P$ if some operations like bmm are ignored and past key values have been saved. However, the $C/B$ of the modern device could be to **100X,** that makes the memory bandwidth as the bottleneck in this scenario. + +Besides, as mentioned in many papers[1][2], activation quantization is the main reason to cause the accuracy drop. So for text generation task, weight only quantization is a preferred option in most cases. + +Theoretically, round-to-nearest (RTN) is the most straightforward way to quantize weight using scale maps. However, when the number of bits is small (e.g. 3), the MSE loss is larger than expected. A group size is introduced to reduce elements using the same scale to improve accuracy. + + +## Supported Matrix + +| Algorithms/Backend | PyTorch eager mode | +|--------------|----------| +| RTN | ✔ | +| GPTQ | ✔ | +| AutoRound| ✔ | +| AWQ | ✔ | +| TEQ | ✔ | +| HQQ | ✔ | +> **RTN:** A quantification method that we can think of very intuitively. It does not require additional datasets and is a very fast quantization method. Generally speaking, RTN will convert the weight into a uniformly distributed integer data type, but some algorithms, such as Qlora, propose a non-uniform NF4 data type and prove its theoretical optimality. + +> **GPTQ:** A new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly efficient[4]. The weights of each column are updated based on the fixed-scale pseudo-quantization error and the inverse of the Hessian matrix calculated from the activations. The updated columns sharing the same scale may generate a new max/min value, so the scale needs to be saved for restoration. + +> **AutoRound:** AutoRound is an advanced weight-only quantization algorithm for low-bits LLM inference. It's tailored for a wide range of models and consistently delivers noticeable improvements, often significantly outperforming SignRound[5] with the cost of more tuning time for quantization. + +> **AWQ:** Proved that protecting only 1% of salient weights can greatly reduce quantization error. the salient weight channels are selected by observing the distribution of activation and weight per channel. The salient weights are also quantized after multiplying a big scale factor before quantization for preserving. + +> **TEQ:** A trainable equivalent transformation that preserves the FP32 precision in weight-only quantization. It is inspired by AWQ while providing a new solution to search for the optimal per-channel scaling factor between activations and weights. + +> **HQQ:** The HQQ[6] method focuses specifically on minimizing errors in the weights rather than the layer activation. Additionally, by incorporating a sparsity-promoting loss, such as the $l_{p<1}$-norm, we effectively model outliers through a hyper-Laplacian distribution. This distribution more accurately captures the heavy-tailed nature of outlier errors compared to the squared error, resulting in a more nuanced representation of error distribution. + +## Usage + +### Get Started + +WeightOnlyQuant quantization for PyTorch is using prepare and convert [APIs](./PyTorch.md#quantization-apis). + +#### Common arguments +| Config | Capability | +|---|---| +| dtype (str)| ['int', 'nf4', 'fp4'] | +| bits (int)| [1, ..., 8] | +| group_size (int)| [-1, 1, ..., $C_{in}$] | +| use_sym (bool)| [True, False] | +| use_double_quant (bool) | [True, False] | +| double_quant_dtype (str) | ['int'] | +| double_quant_bits (int) | [1, ..., bits] | +| double_quant_use_sym (bool) | [True, False] | +| double_quant_group_size (int) | [-1, 1, ..., $C_{in}$] | + +Notes: +- *group_size = -1* refers to **per output channel quantization**. Taking a linear layer (input channel = $C_{in}$, output channel = $C_{out}$) for instance, when *group size = -1*, quantization will calculate total $C_{out}$ quantization parameters. Otherwise, when *group_size = gs* quantization parameters are calculate with every $gs$ elements along with the input channel, leading to total $C_{out} \times (C_{in} / gs)$ quantization parameters. +- 4-bit NormalFloat(NF4) is proposed in QLoRA[7]. 'fp4' includes [fp4_e2m1](../../neural_compressor/adaptor/torch_utils/weight_only.py#L37) and [fp4_e2m1_bnb](https://github.com/TimDettmers/bitsandbytes/blob/18e827d666fa2b70a12d539ccedc17aa51b2c97c/bitsandbytes/functional.py#L735). By default, fp4 refers to fp4_e2m1_bnb. +- Only RTN and GPTQ support double quant. + + +#### RTN +| rtn_args | comments | default value | +|----------|-------------|-------------------------------------------------------------------| +| group_dim (int) | Dimension for grouping | 1 | +| use_full_range (bool) | Enables full range for activations | False | +| use_mse_search (bool) | Enables mean squared error (MSE) search | False | +| use_layer_wise (bool) | Enables quantize model per layer | False | +| model_path (str) | Model path that is used to load state_dict per layer | | + +> **Notes:** `model_path` is only used when use_layer_wise=True. `layer-wise` is stay-tuned. +``` python +# Quantization code +from neural_compressor.torch.quantization import prepare, convert, RTNConfig + +quant_config = RTNConfig() +model = prepare(model, quant_config) +model = convert(model) +``` + +#### GPTQ +| gptq_args | comments | default value | +|----------|-------------|-------------------------------------------------------------------| +| use_mse_search (bool) | Enables mean squared error (MSE) search | False +| use_layer_wise (bool) | Enables quantize model per layer | False | +| model_path (str) | Model path that is used to load state_dict per layer | | +| use_double_quant (bool) | Enables double quantization | False | +| act_order (bool) | Whether to sort Hessian's diagonal values to rearrange channel-wise quantization order | False | +| percdamp (float) | Percentage of Hessian's diagonal values' average, which will be added to Hessian's diagonal to increase numerical stability | 0.01. | +| block_size (int) | Execute GPTQ quantization per block, block shape = [C_out, block_size] | 128 | +| static_groups (bool) | Whether to calculate group wise quantization parameters in advance. This option mitigate actorder's extra computational requirements. | False. | +> **Note:** `model_path` is only used when use_layer_wise=True. `layer-wise` is stay-tuned. +``` python +# Quantization code +from neural_compressor.torch.quantization import prepare, convert, GPTQConfig + +quant_config = GPTQConfig() +model = prepare(model, quant_config) +run_fn(model) # calibration +model = convert(model) +``` + +#### AutoRound +| autoround_args | comments | default value | +|----------|-------------|-------------------------------------------------------------------| +| enable_full_range (bool) | Whether to enable full range quantization | False +| batch_size (int) | Batch size for training | 8 | +| lr_scheduler | The learning rate scheduler to be used | None | +| enable_quanted_input (bool) | Whether to use quantized input data | True | +| enable_minmax_tuning (bool) | Whether to enable min-max tuning | True | +| lr (float) | The learning rate | 0 | +| minmax_lr (float) | The learning rate for min-max tuning | None | +| low_gpu_mem_usage (bool) | Whether to use low GPU memory | True | +| iters (int) | Number of iterations | 200 | +| seqlen (int) | Length of the sequence | 2048 | +| n_samples (int) | Number of samples | 512 | +| sampler (str) | The sampling method | "rand" | +| seed (int) | The random seed | 42 | +| n_blocks (int) | Number of blocks | 1 | +| gradient_accumulate_steps (int) | Number of gradient accumulation steps | 1 | +| not_use_best_mse (bool) | Whether to use mean squared error | False | +| dynamic_max_gap (int) | The dynamic maximum gap | -1 | +| scale_dtype (str) | The data type of quantization scale to be used, different kernels have different choices | "float16" | +``` python +# Quantization code +from neural_compressor.torch.quantization import prepare, convert, AutoRoundConfig + +quant_config = AutoRoundConfig() +model = prepare(model, quant_config) +run_fn(model) # calibration +model = convert(model) +``` + +#### AWQ +| awq_args | comments | default value | +|----------|-------------|-------------------------------------------------------------------| +| group_dim (int) | Dimension for grouping | 1 | +| use_full_range (bool) | Enables full range for activations | False | +| use_mse_search (bool) | Enables mean squared error (MSE) search | False | +| use_layer_wise (bool) | Enables quantize model per layer | False | +| use_auto_scale (bool) | Enables best scales search based on activation distribution | True | +| use_auto_clip (bool) | Enables clip range search | True | +| folding(bool) | Allow insert mul before linear when the scale cannot be absorbed by last layer | False. | +> **Notes:** `layer-wise` is stay-tuned. +``` python +# Quantization code +from neural_compressor.torch.quantization import prepare, convert, AWQConfig + +quant_config = AWQConfig() +model = prepare(model, quant_config, example_inputs=example_inputs) +run_fn(model) # calibration +model = convert(model) +``` + +#### TEQ +| teq_args | comments | default value | +|----------|-------------|-------------------------------------------------------------------| +| group_dim (int) | Dimension for grouping | 1 | +| use_full_range (bool) | Enables full range for activations | False | +| use_mse_search (bool) | Enables mean squared error (MSE) search | False | +| use_layer_wise (bool) | Enables quantize model per layer | False | +| use_double_quant (bool) | Enables double quantization | False | +| folding(bool) | Allow insert mul before linear when the scale cannot be absorbed by last layer | False | +> **Notes:** `layer-wise` is stay-tuned. +``` python +# Quantization code +from neural_compressor.torch.quantization import prepare, convert, TEQConfig + +quant_config = TEQConfig() +model = prepare(model, quant_config, example_inputs=example_inputs) +train_fn(model) # calibration +model = convert(model) +``` + +#### HQQ +| hqq_args | comments | default value | +|----------|-------------|-------------------------------------------------------------------| +| quant_zero (bool) | Whether to quantize zero point | True | +| quant_scale: (bool) | Whether to quantize scale: point | False | +| scale_quant_group_size (int) | The group size for quantizing scale | 128 | +| skip_lm_head (bool) | Whether to skip for quantizing lm_head | True | +``` python +# Quantization code +from neural_compressor.torch.quantization import prepare, convert, HQQConfig + +quant_config = HQQConfig() +model = prepare(model, quant_config) +run_fn(model) # calibration +model = convert(model) +``` +### Specify Quantization Rules +Intel(R) Neural Compressor support specify quantization rules by operator name or operator type. Users can set `local` in dict or use `set_local` method of config class to achieve the above purpose. + +1. Example of setting `local` from a dict +```python +quant_config = { + "rtn": { + "global": { + "dtype": "int", + "bits": 4, + "group_size": -1, + "use_sym": True, + }, + "local": { + "lm_head": { + "dtype": "fp32", + }, + }, + } +} +``` +2. Example of using `set_local` +```python +quant_config = RTNConfig() +lm_head_config = RTNConfig(dtype="fp32") +quant_config.set_local("lm_head", lm_head_config) +``` + +### Saving and Loading +The saved_results folder contains two files: quantized_model.pt and qconfig.json, and the generated model is a quantized model. The quantitative model will include WeightOnlyLinear. To support low memory inference, Intel(R) Neural Compressor implemented WeightOnlyLinear, a torch.nn.Module, to compress the fake quantized fp32 model. Since torch does not provide flexible data type storage, WeightOnlyLinear combines low bits data into a long date type, such as torch.int8 and torch.int32. Low bits data includes weights and zero points. When using WeightOnlyLinear for inference, it will restore the compressed data to float32 and run torch linear function. +```python +# Quantization code +from neural_compressor.torch.quantization import prepare, convert, RTNConfig + +quant_config = RTNConfig() +model = prepare(model, quant_config) +model = convert(model) + +# save +model.save("saved_results") + +# load +from neural_compressor.torch.quantization import load + +orig_model = YOURMODEL() +loaded_model = load( + "saved_results", model=orig_model +) # Please note that the model parameter passes the original model. +``` + + +## Examples + +Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/llm) on how to quantize a model with WeightOnlyQuant. + +## Reference + +[1]. Xiao, Guangxuan, et al. "Smoothquant: Accurate and efficient post-training quantization for large language models." arXiv preprint arXiv:2211.10438 (2022). + +[2]. Wei, Xiuying, et al. "Outlier suppression: Pushing the limit of low-bit transformer language models." arXiv preprint arXiv:2209.13325 (2022). + +[3]. Lin, Ji, et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv preprint arXiv:2306.00978 (2023). + +[4]. Frantar, Elias, et al. "Gptq: Accurate post-training quantization for generative pre-trained transformers." arXiv preprint arXiv:2210.17323 (2022). + +[5]. Cheng, Wenhua, et al. "Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs" arXiv preprint arXiv:2309.05516 (2023). + +[6]. Badri, Hicham and Shaji, Appu. "Half-Quadratic Quantization of Large Machine Learning Models." [Online] Available: https://mobiusml.github.io/hqq_blog/ (2023). + +[7]. Dettmers, Tim, et al. "Qlora: Efficient finetuning of quantized llms." arXiv preprint arXiv:2305.14314 (2023). diff --git a/docs/3x/PyTorch.md b/docs/3x/PyTorch.md new file mode 100644 index 00000000000..b8c4ea2c7c5 --- /dev/null +++ b/docs/3x/PyTorch.md @@ -0,0 +1,225 @@ +Torch +================================================= + +1. [Introduction](#introduction) +2. [Torch-like APIs](#torch-like-apis) +3. [Support matrix](#supported-matrix) +4. [Common Problems](#common-problems) + +## Introduction + +`neural_compressor.torch` provides a Torch-like API and integrates various model compression methods fine-grained to the torch.nn.Module. Supports a comprehensive range of models, including but not limited to CV models, NLP models, and large language models. A variety of quantization methods are available, including classic INT8 quantization, SmoothQuant, and the popular weight-only quantization. Neural compressor also provides the latest research in simulation work, such as FP8 emulation quantization, MX data type emulation quantization. + +In terms of ease of use, neural compressor is committed to providing an easy-to-use user interface and easy to extend the structure design, on the one hand, reuse the PyTorch prepare, convert API, on the other hand, through the Quantizer base class for prepare and convert customization to provide a convenient. + +For more details, please refer to [link](https://github.com/intel/neural-compressor/discussions/1527) in Neural Compressor discussion space. + +So far, `neural_compressor.torch` still relies on the backend to generate the quantized model and run it on the corresponding backend, but in the future, neural_compressor is planned to provide generalized device-agnostic Q-DQ model, so as to achieve one-time quantization and arbitrary deployment. + +## Torch-like APIs + +Currently, we provide below three user scenarios, through `prepare`&`convert`, `autotune` and `load` APIs. + +- One-time quantization of the model +- Get the best quantized model by setting the search scope and target +- Direct deployment of the quantized model + +### Quantization APIs + +```python +def prepare( + model: torch.nn.Module, + quant_config: BaseConfig, + inplace: bool = True, + example_inputs: Any = None, +): + """Prepare the model for calibration. + + Insert observers into the model so that it can monitor the input and output tensors during calibration. + + Args: + model (torch.nn.Module): origin model + quant_config (BaseConfig): path to quantization config + inplace (bool, optional): It will change the given model in-place if True. + example_inputs (tensor/tuple/dict, optional): used to trace torch model. + + Returns: + prepared and calibrated module. + """ +``` + +```python +def convert( + model: torch.nn.Module, + quant_config: BaseConfig = None, + inplace: bool = True, +): + """Convert the prepared model to a quantized model. + + Args: + model (torch.nn.Module): the prepared model + quant_config (BaseConfig, optional): path to quantization config, for special usage. + inplace (bool, optional): It will change the given model in-place if True. + + Returns: + The quantized model. + """ +``` + +### Autotune API + +```python +def autotune( + model: torch.nn.Module, + tune_config: TuningConfig, + eval_fn: Callable, + eval_args=None, + run_fn=None, + run_args=None, + example_inputs=None, +): + """The main entry of auto-tune. + + Args: + model (torch.nn.Module): _description_ + tune_config (TuningConfig): _description_ + eval_fn (Callable): for evaluation of quantized models. + eval_args (tuple, optional): arguments used by eval_fn. Defaults to None. + run_fn (Callable, optional): for calibration to quantize model. Defaults to None. + run_args (tuple, optional): arguments used by run_fn. Defaults to None. + example_inputs (tensor/tuple/dict, optional): used to trace torch model. Defaults to None. + + Returns: + The quantized model. + """ +``` + +### Load API + +`neural_compressor.torch` links the save function to the quantized model. If `model.save` already exists, Neural Compressor renames the previous function to `model.orig_save`. + +```python +def save(self, output_dir="./saved_results"): +""" + Args: + self (torch.nn.Module): the quantized model. + output_dir (str, optional): path to save the quantized model +""" +``` + +```python +def load(output_dir="./saved_results", model=None): + """The main entry of load for all algorithms. + + Args: + output_dir (str, optional): path to quantized model folder. Defaults to "./saved_results". + model (torch.nn.Module, optional): original model, suggest to use empty tensor. + + Returns: + The quantized model + """ +``` + +## Supported Matrix + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Method
AlgorithmBackendSupport StatusUsage Link
Weight Only Quantization
Round to Nearest (RTN)
PyTorch eager modelink
GPTQ
PyTorch eager modelink
AWQPyTorch eager modelink
AutoRoundPyTorch eager modelink
TEQPyTorch eager modelink
HQQPyTorch eager modelink
Smooth QuantizationSmoothQuantintel-extension-for-pytorchlink
Static QuantizationPost-traning Static Quantizationintel-extension-for-pytorchlink
TorchDynamolink
Dynamic QuantizationPost-traning Dynamic QuantizationTorchDynamolink
Quantization Aware TrainingQuantization Aware TrainingTorchDynamostay tunedstay tuned
+ +## Common Problems + +1. How to choose backend between `intel-extension-for-pytorch` and `PyTorchDynamo`? + > Neural Compressor provides automatic logic to detect which backend should be used. + > + + + + + + + + + + + + + + +
EnvironmentAutomatic Backend
import torchtorch.dynamo
import torch
import intel-extension-for-pytorch
intel-extension-for-pytorch
diff --git a/docs/3x/imgs/data_format.png b/docs/3x/imgs/data_format.png new file mode 100644 index 00000000000..36a8839cbe7 Binary files /dev/null and b/docs/3x/imgs/data_format.png differ diff --git a/docs/3x/imgs/mx_workflow.png b/docs/3x/imgs/mx_workflow.png new file mode 100644 index 00000000000..28d34a18ee0 Binary files /dev/null and b/docs/3x/imgs/mx_workflow.png differ diff --git a/docs/3x/imgs/smoothquant.png b/docs/3x/imgs/smoothquant.png new file mode 100644 index 00000000000..730716146ac Binary files /dev/null and b/docs/3x/imgs/smoothquant.png differ diff --git a/docs/3x/imgs/sq_convert.png b/docs/3x/imgs/sq_convert.png new file mode 100644 index 00000000000..d80600d6054 Binary files /dev/null and b/docs/3x/imgs/sq_convert.png differ diff --git a/docs/3x/imgs/sq_pc.png b/docs/3x/imgs/sq_pc.png new file mode 100644 index 00000000000..5ca0d4e158b Binary files /dev/null and b/docs/3x/imgs/sq_pc.png differ diff --git a/docs/3x/quantization.md b/docs/3x/quantization.md index de8e43828a3..b26c49470a9 100644 --- a/docs/3x/quantization.md +++ b/docs/3x/quantization.md @@ -1,26 +1,43 @@ Quantization -=============== +========================================== +1. Introduction -1. Quantization - 1.1 [Quantization Introduction](#quantization-introduction) - 1.2 [Quantization Fundamentals](#quantization-fundamentals) - 1.3 [Accuracy Aware Tuning](#accuracy-aware-tuning) -2. [Smooth Quant](#smooth-quant) -3. [WOQ](#woq) +2. Quantization Fundamentals -## Quantization Introduction +3. Quantization methods -Quantization is a very popular deep learning model optimization technique invented for improving the speed of inference. It minimizes the number of bits required by converting a set of real-valued numbers into the lower bit data representation, such as int8 and int4, mainly on inference phase with minimal to no loss in accuracy. This way reduces the memory requirement, cache miss rate, and computational cost of using neural networks and finally achieve the goal of higher inference performance. On Intel 3rd Gen Intel® Xeon® Scalable Processors, user could expect up to 4x theoretical performance speedup. We expect further performance improvement with [Intel® Advanced Matrix Extensions](https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html) on 4th Gen Intel® Xeon® Scalable Processors. + - [Dynamic Quantization](#dynamic-quantization) + + - [Static Quantization](#static-quantization) + + - [Smooth Quantization](#smooth-quantization) + + - [Weight Only Quantization](#weight-only-quantization) + + - [Quantization Aware Training](#quantization-aware-training) + + - [Accuracy Aware Tuning](#accuracy-aware-tuning) + + +## Introduction + +Quantization is a very popular deep learning model optimization technique invented for improving the speed of inference. It minimizes the number of bits required by converting a set of real-valued numbers into the lower bit data representation, such as int8 and int4, mainly on inference phase with minimal to no loss in accuracy. This way reduces the memory requirement, cache miss rate, and computational cost of using neural networks and finally achieve the goal of higher inference performance. On Intel 3rd Gen Intel Xeon Scalable Processors, user could expect up to 4x theoretical performance speedup. We expect further performance improvement with [Intel Advanced Matrix Extensions](https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html) on 4th Gen Intel Xeon Scalable Processors. ## Quantization Fundamentals -`Affine quantization` and `Scale quantization` are two common range mapping techniques used in tensor conversion between different data types. +The equation of quantization is as follows: + +$$ +X_{int8} = round(X_{fp32}/S) + Z \tag{1} +$$ + +where $X_{fp32}$ is the input matrix, $S$ is the scale factor, $Z$ is the integer zero point. -The math equation is like: $$X_{int8} = round(Scale \times X_{fp32} + ZeroPoint)$$. +### Symmetric & Asymmetric -**Affine Quantization** +---------------------------------------------- -This is so-called `asymmetric quantization`, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255]. +asymmetric quantization, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255]. here: @@ -30,9 +47,9 @@ or If UINT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 255$ and $ZeroPoint = - X_{f_{min}} / Scale$. -**Scale Quantization** +---------------------------------------------- -This is so-called `Symmetric quantization`, in which we use the maximum absolute value in the float tensor as float range and map to the corresponding integer range. +Symmetric quantization, in which we use the maximum absolute value in the float tensor as float range and map to the corresponding integer range. The math equation is like: @@ -48,13 +65,15 @@ If UINT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 255$ a Sometimes the reduce_range feature, that's using 7 bit width (1 sign bit + 6 data bits) to represent int8 range, may be needed on some early Xeon platforms, it's because those platforms may have overflow issues due to fp16 intermediate calculation result when executing int8 dot product operation. After AVX512_VNNI instruction is introduced, this issue gets solved by supporting fp32 intermediate data. - +---------------------------------------------- #### Quantization Scheme in TensorFlow + Symmetric Quantization + int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1) + uint8: scale = max(rmin, rmax) / (max(uint8) - min(uint8)) +---------------------------------------------- + #### Quantization Scheme in PyTorch + Symmetric Quantization + int8: scale = max(abs(rmin), abs(rmax)) / (float(max(int8) - min(int8)) / 2) @@ -62,36 +81,309 @@ Sometimes the reduce_range feature, that's using 7 bit width (1 sign bit + 6 dat + Asymmetric Quantization + uint8: scale = (rmax - rmin) / (max(uint8) - min(uint8)); zero_point = min(uint8) - round(rmin / scale) +---------------------------------------------- + #### Quantization Scheme in IPEX + Symmetric Quantization + int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1) + uint8: scale = max(rmin, rmax) / (max(uint8) - min(uint8)) -### Quantization Approaches +---------------------------------------------- + +### Per-tensor & Per-channel + +---------------------------------------------- + +There are several choices of sharing quantization parameters among tensor elements, also called quantization granularity. The coarsest level, per-tensor granularity, is that all elements in the tensor share the same quantization parameters. Finer granularity means sharing quantization parameters per row or per column for 2D matrices and per channel for 3D matrices. Similarly, the finest granularity is that each element has an individual set of quantization parameters. + + +However, due to the model accuracy and computational consumption, per-tensor or per-channel are usually adopted. **In the following part, We will show that per-channel could bring lower quantization loss but has some limitations, that is why normally we use per-channel for weight quantization and per-tensor for activation/input quantization** + +#### Per-tensor example + +---------------------------------------------- -Quantization has three different approaches: -1) post training dynamic quantization -2) post training static quantization -3) quantization aware training. +Suppose the weight tensor is: -The first two approaches belong to optimization on inference. The last belongs to optimization during training. +```python +import torch -#### Post Training Dynamic Quantization +W = torch.Tensor( + [ + [0.6839, 0.4741, 0.7451], + [0.9301, 0.1742, 0.6835], + ] +) +``` + +According to the formula (1), we need scale $S$ and zero point $Z$ to calculate the integer matrix. + +$$ +S = \frac{X_{max} - X{min}}{2^b -1} \tag{2} +$$ + +$$ +Z = -round(X_{min/}/S) \tag{3} +$$ + +The per-tensor quantization function is: + +```python +def quantize(x, num_bits=8): + q_min, q_max = 0, 2.0**num_bits - 1.0 + scale = (torch.max(x) - torch.min(x)) / (2**num_bits - 1) + scale = torch.clip(scale, min=1e-5) + zp = torch.round(0 - (torch.min(x)) / scale) + q_x = x / scale + zp + q_x.clamp_(q_min, q_max).round_() + print(f"scale = {scale}, zp = {zp}") + return q_x, scale, zp +``` + +Then we can get the quantized $W_{q}$ + +```bash +>>> W_q, scale, zp = quantize(W) +scale = 0.00296431384049356, zp = -59.0 +>>> W_q +tensor([[172., 101., 192.], + [255., 0., 172.]]) +``` + +With the value of scale and zp, we can dequantize the tensor. + +```python +def dequantize(q_x, scale, zp): + return scale * (q_x - zp) +``` + +```bash +>>> W_dq = dequantize(W_q, 0.001, -50) +>>> W_dq +tensor([[0.2220, 0.1510, 0.2420], + [0.2570, 0.0500, 0.1890]]) +>>> loss = torch.nn.MSELoss()(W_dq, W) +>>> loss.item() +0.1983354538679123 + +>>> W_dq = dequantize(W_q, scale, zp) +>>> W_dq +tensor([[0.6848, 0.4743, 0.7440], + [0.9308, 0.1749, 0.6848]]) +>>> loss = torch.nn.MSELoss()(W_dq, W) +>>> loss.item() +7.385297635664756e-07 +``` + +The difference between $W$ and $W_{dq}$ shows that quantization affects precision and appropriate values of scale and zero point will reduce the loss of precision. + +#### Per-channel example + +---------------------------------------------- + +Similarly, the example of per-channel quantization is as follows: + +```python +def quantize_per_channel(x, num_bits=8): + q_min, q_max = 0, 2.0**num_bits - 1.0 + x_tmp = x.detach().reshape(x.shape[0], -1) + scales = x_tmp.max(dim=-1, keepdim=True)[0] / (2**num_bits - 1) + zp = torch.round(0 - x_tmp.min(dim=-1, keepdim=True)[0].divide(scales)) + q_x = x_tmp.divide(scales) + zp + q_x.clamp_(q_min, q_max).round_() + print(f"scales = {scales}, \n zp = {zp}") + return q_x, scales, zp + + +def dequantize_per_channel(q_x, scales, zp): + print(q_x, scales, zp) + print(scales * (q_x - zp)) + return scales * (q_x - zp) +``` + +```bash +>>>W_q, scales, zp = quantize_per_channel(W) +scale = tensor([[0.0029], + [0.0036]]), +zp = tensor([[-162.], + [ -48.]]) +>>>W_q +tensor([[ 72., 0., 93.], + [207., 0., 139.]]) + +>>>W_dq = dequantize_per_channel(W_q, scales, zp) +>>>W_dq +tensor([[0.6837, 0.4734, 0.7451], + [0.9301, 0.1751, 0.6821]]) +``` + +And the loss is + +```bash +>>> loss = torch.nn.MSELoss()(W_dq, W) +>>> loss.item() +5.637690492221736e-07 +``` + +Through this example, we can see that per-channel quantization has finer granularity and has lower loss (loss 5.6376e-07 for per-channel quantization and 7.3852e-07 for per-tensor quantization). + +#### Matmul quantization example + +---------------------------------------------- + +For a linear layer in most model, $Y=X \cdot W$, we can quantize both the weights and activations in order to reduce the storage and accelerate inference. +Using per-tensor scale quantization to show the process. + +```python +def quantize_per_tensor_absmax(x, n_bits=8): + scales = x.abs().max() + q_max = 2 ** (n_bits - 1) - 1 + scales.clamp_(min=1e-5).div_(q_max) + q_x = x / scales + q_x = q_x.clamp_(-q_max, q_max).round_() + return q_x, scales + + +def dequantize(q_x, scale): + return scale * q_x +``` + +Randomly initialize the $W$ and $Y$, then calculate the result of $Y=X \cdot W$ + +```bash +>>>W = torch.rand(2, 3, dtype=torch.float32) +>>>X = torch.rand(3, 4, dtype=torch.float32) +>>>W +tensor([[0.0806, 0.7589, 0.6038], + [0.3815, 0.5040, 0.7174]]) +>>>X +tensor([[0.5444, 0.5826, 0.7772, 0.5555], + [0.3740, 0.3253, 0.0698, 0.1381], + [0.5972, 0.0086, 0.0737, 0.8298]]) +>>>Y = torch.matmul(W, X) +>>>Y +tensor([[0.6883, 0.2991, 0.1601, 0.6506], + [0.8246, 0.3924, 0.3845, 0.8768]]) +``` + +Quantize weight and activation, matmul(quantize(X), quantize(Y)) + +```bash +>>>W_q, W_scale = quantize_per_tensor_absmax(W) +>>>X_q, X_scale = quantize_per_tensor_absmax(X) +>>>print(f'{W_q}\n{W_scale.item()}') +>>>print(f'{X_q}\n{X_scale.item()}') +tensor([[ 13., 127., 101.], + [ 64., 84., 120.]]) +0.0059755356051027775 +tensor([[ 83., 89., 119., 85.], + [ 57., 50., 11., 21.], + [ 91., 1., 11., 127.]]) +0.006533813662827015 + +>>>Y_q = torch.matmul(W_q, X_q) +>>>Y_q +tensor([[17509., 7608., 4055., 16599.], + [21020., 10016., 9860., 22444.]]) +>>>Y_dq = dequantize(Y_q, W_scale * X_scale) +>>>Y_dq +tensor([[0.6836, 0.2970, 0.1583, 0.6481], + [0.8207, 0.3911, 0.3850, 0.8763]]) +``` + + +## Dynamic Quantization The weights of the neural network get quantized into int8 format from float32 format offline. The activations of the neural network is quantized as well with the min/max range collected during inference runtime. This approach is widely used in dynamic length neural networks, like NLP model. -#### Post Training Static Quantization + +## Static Quantization Compared with `post training dynamic quantization`, the min/max range in weights and activations are collected offline on a so-called `calibration` dataset. This dataset should be able to represent the data distribution of those unseen inference dataset. The `calibration` process runs on the original fp32 model and dumps out all the tensor distributions for `Scale` and `ZeroPoint` calculations. Usually preparing 100 samples are enough for calibration. This approach is major quantization approach people should try because it could provide the better performance comparing with `post training dynamic quantization`. -#### Quantization Aware Training + +## Smooth Quantization + +#### Per-channel limitation + +---------------------------------------------- + +Though per-channel quantization could bring lower quantization error, we could not apply it for activations due to the difficulty of the dequantization. We would prove it in the following image and the zero point of quantization would be ignored for simplicity. + +The image on the left presents a normal linear forward with 1x2 input $x$ and 2x2 weight $w$. The results $y$ could be easily obtained by simple mathematics. In the middle image, we apply per-tensor quantization for activations and per-channel quantization for weights; the results after quantization that are denoted by $y_1$ and $y_2$, could be easily dequantized to the float results $y_{fp1}$ and $y_{fp2}$ by per channel scale $1.0/s_1s_x$ and $1.0/s_2s_x$. However, after applying per-channel quantization for activation (right image), we could not dequantize the $y_1$ and $y_2$ to float results. + +
+ +
+ + +---------------------------------------------- + +In the previous subsection, we have explained why per-channel quantization could not be applied for activation, even though it could lead to lower quantization loss. However, the quantization error loss of activation plays an important role in the accuracy loss of model quantization[1][6][7]. + + + +To reduce the quantization loss of activations, lots of methods have been proposed. In the following, we briefly introduce SPIQ[6], Outlier Suppression[7] and Smoothquant[1]. All these three methods share a similar idea to migrate the difficulty from activation quantization to weight quantization but differ in how much difficulty to be transferred. + + +So **the first question is how to migrate the difficulty from activation to weights?** The solution is straightforward, that is to convert the network to an output equivalent network that is presented in the image below and apply quantization to this equivalent network. The intuition is that each channel of activation could be scaled to make it more quantization-friendly, similar to a fake per-channel activation quantization. + +
+ +
+ + +Please note that this conversion will make the quantization of weights more difficult, because the scales attached to weights shown above are per-input-channel, while quantization of weights is per-output-channel or per-tensor. + +So **the second question is how much difficulty to be migrated**, that is how to choose the **conversion per-channel scale** $s_{x1}$ and $s_{x2}$ from the above image. Different works adopt different ways. + +*SPIQ* just adopts the quantization scale of activations as the conversion per-channel scale. + +*Outlier suppression* adopts the scale of the preceding layernorm as the conversion per-channel scale. + +*Smoothquant* introduces a hyperparameter $\alpha$ as a smooth factor to calculate the conversion per-channel scale and balance the quantization difficulty of activation and weight. + +$$ +s_j = max(|X_j|)^\alpha/max(|W_j|)^{1-\alpha} \tag{4} +$$ + +j is the index of the input channels. + + + +
+ +
+ + + +For most of the models such as OPT and BLOOM, $\alpha = 0.5$ is a well-balanced value to split the difficulty of weight and activation quantization. A larger $\alpha$ value could be used on models with more significant activation outliers to migrate more quantization difficulty to weights. + + +## Weight Only Quantization + +As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computational demands of these modern architectures while maintaining the accuracy. Compared to normal quantization like W8A8, weight only quantization is probably a better trade-off to balance the performance and the accuracy, since we will see below that the bottleneck of deploying LLMs is the memory bandwidth and normally weight only quantization could lead to better accuracy. + +Model inference: Roughly speaking , two key steps are required to get the model's result. The first one is moving the model from the memory to the cache piece by piece, in which, memory bandwidth $B$ and parameter count $P$ are the key factors, theoretically the time cost is $P*4 /B$. The second one is computation, in which, the device's computation capacity $C$ measured in FLOPS and the forward FLOPs $F$ play the key roles, theoretically the cost is $F/C$. + +Text generation: The most famous application of LLMs is text generation, which predicts the next token/word based on the inputs/context. To generate a sequence of texts, we need to predict them one by one. In this scenario, $F\approx P$ if some operations like bmm are ignored and past key values have been saved. However, the $C/B$ of the modern device could be to **100X,** that makes the memory bandwidth as the bottleneck in this scenario. + +Besides, as mentioned in many papers[1][2], activation quantization is the main reason to cause the accuracy drop. So for text generation task, weight only quantization is a preferred option in most cases. + +Theoretically, round-to-nearest (RTN) is the most straightforward way to quantize weight using scale maps. However, when the number of bits is small (e.g. 3), the MSE loss is larger than expected. A group size is introduced to reduce elements using the same scale to improve accuracy. + +There are many excellent works for weight only quantization to improve its accuracy performance, such as AWQ[3], GPTQ[4], AutoRound[8]. Neural compressor integrates these popular algorithms in time to help customers leverage them and deploy them to their own tasks. + + +## Quantization Aware Training Quantization aware training emulates inference-time quantization in the forward pass of the training process by inserting `fake quant` ops before those quantizable ops. With `quantization aware training`, all weights and activations are `fake quantized` during both the forward and backward passes of training: that is, float values are rounded to mimic int8 values, but all computations are still done with floating point numbers. Thus, all the weight adjustments during training are made while aware of the fact that the model will ultimately be quantized; after quantizing, therefore, this method will usually yield higher accuracy than either dynamic quantization or post-training static quantization. + ## Accuracy Aware Tuning Accuracy aware tuning is one of unique features provided by Intel(R) Neural Compressor, compared with other 3rd party model compression tools. This feature can be used to solve accuracy loss pain points brought by applying low precision quantization and other lossy optimization methods. @@ -100,16 +392,28 @@ This tuning algorithm creates a tuning space by querying framework quantization Neural compressor also support to quantize all quantizable ops without accuracy tuning, using `quantize_model` interface to achieve that. -### Working Flow - -For supported quantization methods for `accuracy aware tuning` and the detailed API usage, please refer to the document of [PyTorch](./pytorch.md) or [TensorFlow](./tensorflow.md) respectively. +For supported quantization methods for `accuracy aware tuning` and the detailed API usage, please refer to the document of [PyTorch](PyTorch.md) or [TensorFlow](TensorFlow.md) respectively. User could refer to below chart to understand the whole tuning flow. accuracy aware tuning working flow -# Smooth Quant +## Reference + +[1]. Xiao, Guangxuan, et al. "Smoothquant: Accurate and efficient post-training quantization for large language models." arXiv preprint arXiv:2211.10438 (2022). + +[2]. Wei, Xiuying, et al. "Outlier suppression: Pushing the limit of low-bit transformer language models." arXiv preprint arXiv:2209.13325 (2022). + +[3]. Lin, Ji, et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv preprint arXiv:2306.00978 (2023). + +[4]. Frantar, Elias, et al. "Gptq: Accurate post-training quantization for generative pre-trained transformers." arXiv preprint arXiv:2210.17323 (2022). + +[5]. Dettmers, Tim, et al. "Qlora: Efficient finetuning of quantized llms." arXiv preprint arXiv:2305.14314 (2023). + +[6]. Yvinec, Edouard, et al. "SPIQ: Data-Free Per-Channel Static Input Quantization." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023. + +[7]. Wei, Xiuying, et al. "Outlier suppression: Pushing the limit of low-bit transformer language models." arXiv preprint arXiv:2209.13325 (2022). -# WOQ +[8]. Cheng, Wenhua, et al. "Optimize weight rounding via signed gradient descent for the quantization of llms." arXiv preprint arXiv:2309.05516 (2023). diff --git a/neural_compressor/torch/quantization/autotune.py b/neural_compressor/torch/quantization/autotune.py index 2d0aa2bd2e0..279b6be4633 100644 --- a/neural_compressor/torch/quantization/autotune.py +++ b/neural_compressor/torch/quantization/autotune.py @@ -51,8 +51,21 @@ def autotune( run_fn=None, run_args=None, example_inputs=None, -) -> Optional[torch.nn.Module]: - """The main entry of auto-tune.""" +): + """The main entry of auto-tune. + + Args: + model (torch.nn.Module): _description_ + tune_config (TuningConfig): _description_ + eval_fn (Callable): for evaluation of quantized models. + eval_args (tuple, optional): arguments used by eval_fn. Defaults to None. + run_fn (Callable, optional): for calibration to quantize model. Defaults to None. + run_args (tuple, optional): arguments used by run_fn. Defaults to None. + example_inputs (tensor/tuple/dict, optional): used to trace torch model. Defaults to None. + + Returns: + The quantized model. + """ best_quant_model = None eval_func_wrapper = EvaluationFuncWrapper(eval_fn, eval_args) config_loader, tuning_logger, tuning_monitor = init_tuning(tuning_config=tune_config) diff --git a/neural_compressor/torch/quantization/load_entry.py b/neural_compressor/torch/quantization/load_entry.py index 35e5fd1208e..fb870a92e77 100644 --- a/neural_compressor/torch/quantization/load_entry.py +++ b/neural_compressor/torch/quantization/load_entry.py @@ -32,6 +32,15 @@ def load(output_dir="./saved_results", model=None): + """The main entry of load for all algorithms. + + Args: + output_dir (str, optional): path to quantized model folder. Defaults to "./saved_results". + model (torch.nn.Module, optional): original model, suggest to use empty tensor. + + Returns: + The quantized model + """ from neural_compressor.common.base_config import ConfigRegistry qconfig_file_path = os.path.join(os.path.abspath(os.path.expanduser(output_dir)), "qconfig.json") diff --git a/neural_compressor/torch/quantization/quantize.py b/neural_compressor/torch/quantization/quantize.py index d694123b359..57197a91972 100644 --- a/neural_compressor/torch/quantization/quantize.py +++ b/neural_compressor/torch/quantization/quantize.py @@ -114,8 +114,8 @@ def prepare( Args: model (torch.nn.Module): origin model quant_config (BaseConfig): path to quantization config - inplace (bool): It will change the given model in-place if True. - example_inputs: used to trace torch model. + inplace (bool, optional): It will change the given model in-place if True. + example_inputs (tensor/tuple/dict, optional): used to trace torch model. Returns: prepared and calibrated module.