openvinotoolkit · alexsu52 · Dec 7, 2023 · Nov 23, 2023 · Nov 23, 2023 · Nov 23, 2023
@@ -8,22 +8,30 @@ The Weights Compression algorithm is aimed at compressing the weights of the mod
 
 #### Supported modes
 
-By default, weights are compressed to 8-bit integer data type - "INT8" mode.
+By default, weights are compressed asymmetrically to 8-bit integer data type - "INT8_ASYM" mode.
 OpenVINO backend also supports 3 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM and NF4. The primary precision in case of INT4_SYM mode is unsigned 4-bit integer and weights are quantized to it [symmetrically](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) with a fixed zero point equals to 8. In case of INT4_ASYM mode - also unsigned 4-bit integer, but weight are quantized to it [asymmetrically](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point.
 All 4-bit modes have a grouped quantization support, when small group of weights (e.g. 128) in the channel dimension share quantization parameters (scale).
 All embeddings and last linear layers are always compressed to 8-bit integer data type.
-Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to 8-bit integer data type.
+Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to 8-bit asymmetric integer data type.
 
 #### User guide
 
-- Compress weights to 8-bit integer data type.
+- Compress weights asymmetrically to 8-bit integer data type.
 
 ```python
 from nncf import compress_weights
 compressed_model = compress_weights(model)
 ```
 
-- Compress weights symmetrically to 4-bit integer data type with group size = 128, except embeddings and last linear layers - they are compressed to 8-bit integer data type.
+- Compress weights symmetrically to 8-bit integer data type.
+
+```python
+from nncf import compress_weights
+from nncf import CompressWeightsMode
+compressed_model = compress_weights(model, mode=CompressWeightsMode.INT8_SYM)
+```
+
+- Compress weights symmetrically to 4-bit integer data type with group size = 128, except embeddings and last linear layers - they are compressed asymmetrically to 8-bit integer data type.
 
 ```python
 from nncf import compress_weights
@@ -36,7 +44,7 @@ compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_SYM)
   If the accuracy or perplexity is still not satisfying, there are 2 more hyper-parameters to tune: `group_size` and `ratio`.
   Lower group size and less ratio of 4-bit layers usually improve accuracy at the sacrifice of inference speed.
   Below is the example how to compress weights of 90% of layers to 4-bit integer asymmetrically with the group size 64, and
-  the rest of layers to 8-bit integer data type. The same parametrization is applicable for `INT4_SYM` mode.
+  the rest of layers to 8-bit asymmetric integer data type. The same parametrization is applicable for `INT4_SYM` mode.
 
 ```python
 from nncf import compress_weights
@@ -45,7 +53,7 @@ compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_ASYM, g
 ```
 
 - `NF4` mode can be considered for improving accuracy, but currently models quantized to nf4 should not be faster models
-  quantized to 8-bit integer. Here's the example how to compress weights to nf4 data type with group size = 128.
+  quantized to 8-bit asymmetric integer. Here's the example how to compress weights to nf4 data type with group size = 128.
   Different `group_size` and `ratio` are also supported.
 
 ```python
@@ -79,7 +87,7 @@ Here is the perplexity and model size before and after weight compression for di
   </tr>
   <tr>
     <td class="tg-0pky">databricks/dolly-v2-3b</td>
-    <td class="tg-0pky">int8</td>
+    <td class="tg-0pky">int8_asym</td>
     <td class="tg-0pky">5.07</td>
     <td class="tg-0pky">0.05</td>
     <td class="tg-0pky">2.6</td>
@@ -107,7 +115,7 @@ Here is the perplexity and model size before and after weight compression for di
   </tr>
   <tr>
     <td class="tg-0pky">facebook/opt-6.7b</td>
-    <td class="tg-0pky">int8</td>
+    <td class="tg-0pky">int8_asym</td>
     <td class="tg-0pky">4.27</td>
     <td class="tg-0pky">0.01</td>
     <td class="tg-0pky">6.2</td>
@@ -135,7 +143,7 @@ Here is the perplexity and model size before and after weight compression for di
   </tr>
   <tr>
     <td class="tg-0pky">meta-llama/Llama-2-7b-chat-hf</td>
-    <td class="tg-0pky">int8</td>
+    <td class="tg-0pky">int8_asym</td>
     <td class="tg-0pky">3.29</td>
     <td class="tg-0pky">0.01</td>
     <td class="tg-0pky">6.3</td>
@@ -163,7 +171,7 @@ Here is the perplexity and model size before and after weight compression for di
   </tr>
   <tr>
     <td class="tg-0pky">togethercomputer/RedPajama-INCITE-7B-Instruct</td>
-    <td class="tg-0pky">int8</td>
+    <td class="tg-0pky">int8_asym</td>
     <td class="tg-0pky">4.17</td>
     <td class="tg-0pky">0.02</td>
     <td class="tg-0pky">6.4</td>
@@ -191,7 +199,7 @@ Here is the perplexity and model size before and after weight compression for di
   </tr>
   <tr>
     <td class="tg-0pky">meta-llama/Llama-2-13b-chat-hf</td>
-    <td class="tg-0pky">int8</td>
+    <td class="tg-0pky">int8_asym</td>
     <td class="tg-0pky">2.91</td>
     <td class="tg-0pky">0</td>
     <td class="tg-0pky">12.1</td>
@@ -218,7 +226,7 @@ Here is the perplexity and model size before and after weight compression for di
 - The algorithm is supported for OpenVINO and PyTorch models.
 - The compression applies in-place.
 - The compressed model is not trainable.
-- INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
+- INT8_SYM, INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
 - NF4 support is experimental - models quantized to nf4 should not be faster models quantized to 8-bit integer.
 
 #### Additional resources

@@ -62,7 +62,11 @@ class DropType(Enum):
 class CompressWeightsMode(Enum):
     """
     Defines a mode for weight compression.
-    :param INT8: Stands for 8-bit integer quantization of all weights.
+    :param INT8_SYM: Stands for 8-bit integer symmetric quantization of all weights.
+        https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization
+    :param INT8_ASYM: The same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
+        with a typical non-fixed zero point.
+        https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization
     :param INT4_SYM: Stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
         Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
         All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,
@@ -73,9 +77,12 @@ class CompressWeightsMode(Enum):
         with a typical non-fixed zero point.
         https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization
     :param NF4: The the same as INT4_SYM mode, but primary precision is NF4 data type without zero point.
+    :param INT8: Mode is deprecated and will be removed in future releases. Please use `INT8_ASYM` instead.
     """
 
-    INT8 = "int8"
+    INT8_SYM = "int8_sym"
+    INT8_ASYM = "int8_asym"
     INT4_SYM = "int4_sym"
     INT4_ASYM = "int4_asym"
     NF4 = "nf4"
+    INT8 = "int8"  # Deprecated mode
@@ -54,7 +54,9 @@ def __init__(
     ):
         """
         :param mode: Defines a mode for weight compression.
-            INT8 stands for 8-bit integer quantization of all weights.
+            INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
+            INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
+                with a typical non-fixed zero point.
             INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
                 Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
                 All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,

@@ -47,7 +47,9 @@ def validate_params(mode: CompressWeightsMode, ignored_scope: Optional[IgnoredSc
         parameters. Should be called on early algorithm steps to prevent execution of time-consuming operations.
 
         :param mode: Defines a mode for weight compression.
-            INT8 stands for 8-bit integer quantization of all weights.
+            INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
+            INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
+                with a typical non-fixed zero point.
             INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
                 Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
                 All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,
@@ -77,7 +79,9 @@ def do_compression(
         :param nodes_to_compress: List of nodes in the model's graph,
             corresponding to the layers for weight compression.
         :param mode: Defines a mode for weight compression.
-            INT8 stands for 8-bit integer quantization of all weights.
+            INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
+            INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
+                with a typical non-fixed zero point.
             INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
                 Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
                 All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,

@@ -103,12 +103,12 @@ def do_compression(
                 quantized_nodes_ids.add(id(weight_node))
 
         internal_weight_params = all_weight_params
-        if mode != CompressWeightsMode.INT8:
+        if mode not in [CompressWeightsMode.INT8_SYM, CompressWeightsMode.INT8_ASYM]:
             internal_weight_params = list(filter(lambda wp: wp.metatype != OVEmbeddingMetatype, all_weight_params))
             if not is_last_layer_compressed:
                 internal_weight_params = internal_weight_params[:-1]
-            primary_config = WeightCompressionConfig(mode=mode, group_size=group_size)
-            _assign_mixed_precision(internal_weight_params, ratio, primary_config)
+        primary_config = WeightCompressionConfig(mode=mode, group_size=group_size)
+        _assign_mixed_precision(internal_weight_params, ratio, primary_config)
         nncf_logger.info(_get_bitwidth_distribution_str(all_weight_params, internal_weight_params))
 
         for wp in track(all_weight_params, description="Applying Weight Compression"):
@@ -172,15 +172,15 @@ class WeightCompressionConfig:
         The value -1 means no grouping. Defaults to -1.
     """
 
-    mode: Optional[CompressWeightsMode] = CompressWeightsMode.INT8
+    mode: Optional[CompressWeightsMode] = CompressWeightsMode.INT8_ASYM
     group_size: Optional[int] = -1
 
     @property
     def num_bits(self):
         """
         :return: number of bits that is used for storing a single quantized value in the given mode.
         """
-        return 8 if self.mode == CompressWeightsMode.INT8 else 4
+        return 8 if self.mode in [CompressWeightsMode.INT8_SYM, CompressWeightsMode.INT8_ASYM] else 4
 
 
 @dataclass
@@ -212,7 +212,10 @@ def _do_integer_quantization(
     """
     The method quantizes the given weights to integer data type in accordance with the compression config.
     The config defines a quantization mode:
-        INT8 mode refers to unsigned int8 asymmetric weight compression - quantization to [0, 255] range.
+        INT8_SYM mode refers to unsigned int4 symmetric weight compression with a fixed zero point equals to 128 -
+            quantization to [0, 255] range.
+        INT8_ASYM mode refers to unsigned int8 asymmetric weight compression with a typical non-fixed zero-point -
+            quantization to [0, 255] range.
         INT4_ASYM mode refers to unsigned int4 asymmetric weight compression with a typical non-fixed zero-point -
             quantization to [0, 15] range.
         INT4_SYM mode refers to unsigned int4 symmetric weight compression with a fixed zero point equals to 8 -
@@ -239,7 +242,7 @@ def _do_integer_quantization(
         # weights are reshaped from [a1, r, a2] to [a1, r//gs, gs, a2]
         weight, reduction_axis = _reshape_weights_for_grouped_quantization(weight, reduction_axis, group_size)
 
-    if mode in [CompressWeightsMode.INT8, CompressWeightsMode.INT4_ASYM]:
+    if mode in [CompressWeightsMode.INT8_ASYM, CompressWeightsMode.INT4_ASYM]:
         min_values = np.min(weight, axis=reduction_axis, keepdims=True)  # [a1, r, a2] -> [a1, 1, a2]
         max_values = np.max(weight, axis=reduction_axis, keepdims=True)  # [a1, r, a2] -> [a1, 1, a2]
         scale, zero_point = calculate_scale_zero_point(

@@ -12,6 +12,7 @@
 from typing import Any, Callable, Iterable, List, Optional, Tuple, TypeVar, Union
 
 from nncf.api.compression import TModel
+from nncf.common.deprecation import warning_deprecated
 from nncf.common.factory import NNCFGraphFactory
 from nncf.common.quantization.structs import QuantizationPreset
 from nncf.common.utils.api_marker import api
@@ -241,7 +242,7 @@ def quantize_with_accuracy_control(
 @api(canonical_alias="nncf.compress_weights")
 def compress_weights(
     model: TModel,
-    mode=CompressWeightsMode.INT8,
+    mode=CompressWeightsMode.INT8_ASYM,
     ratio: Optional[float] = None,
     group_size: Optional[int] = None,
     ignored_scope: Optional[IgnoredScope] = None,
@@ -251,7 +252,9 @@ def compress_weights(
 
     :param model: A model to be compressed.
     :param mode: Defines a mode for weight compression.
-        INT8 stands for 8-bit integer quantization of all weights.
+        INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
+        INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
+            with a typical non-fixed zero point.
         INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
             Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
             All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,
@@ -269,6 +272,12 @@ def compress_weights(
     :return: The non-trainable model with compressed weights.
     """
     if mode == CompressWeightsMode.INT8:
+        warning_deprecated(
+            "`CompressWeightsMode.INT8` is deprecated." "Please, use `CompressWeightsMode.INT8_ASYM` as value instead."
+        )
+        mode = CompressWeightsMode.INT8_ASYM
+
+    if mode in [CompressWeightsMode.INT8_ASYM, CompressWeightsMode.INT8_SYM]:
         if ratio is None:
             ratio = 1
         if group_size is None:

@@ -74,7 +74,7 @@ def quantize_impl(
 
 def compress_weights_impl(
     model: torch.nn.Module,
-    mode=CompressWeightsMode.INT8,
+    mode=CompressWeightsMode.INT8_ASYM,
     ratio: Optional[float] = None,
     group_size: Optional[int] = None,
     ignored_scope: Optional[IgnoredScope] = None,
@@ -85,7 +85,9 @@ def compress_weights_impl(
 
     :param model: a Torch model for compression.
     :param mode: Defines a mode for weight compression.
-        INT8 stands for 8-bit integer quantization of all weights.
+        INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
+        INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
+            with a typical non-fixed zero point.
         INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
             Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
             All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,
@@ -104,8 +106,10 @@ def compress_weights_impl(
     """
     if ignored_scope is not None:
         raise AttributeError("Torch backend does not support ignored scope.")
-    if mode != CompressWeightsMode.INT8:
-        raise AttributeError(f"Torch backend supports only INT8 mode for weight compression, but given {mode} mode.")
+    if mode != CompressWeightsMode.INT8_ASYM:
+        raise AttributeError(
+            f"Torch backend supports only INT8_ASYM mode for weight compression, but given {mode} mode."
+        )
     compressed_model, _ = replace_modules_by_nncf_modules(model)
     insert_pre_compression_operations(model)