Represent symmetrically quantized weights in signed data type (#2434)

### Changes Represent symmetrically quantized weights in signed data type with no zero point ### Reason for changes * To detect the quantization type without analyzing zero-point values * Signed data type for symmetrically quantized weights will lead to a smaller footprint, especially in case of grouped quantization. ### Related tickets 130625 ### Tests Updated: `tests/torch/ptq/test_weights_compression.py` and `tests/openvino/native/quantization/test_weights_compression.py` Merge after: openvinotoolkit/openvino#24457 Model | Backend | Metric name | Metric value | Metric diff | Num int4 | Num int8 | RAM MiB | Compr. time | Total time -- | -- | -- | -- | -- | -- | -- | -- | -- | -- tinyllama_data_aware_awq_scale_estimation | OV | Similarity | 0.84048 | -0.15952 | 94 | 124 | 35560 | 0:06:31 | 0:08:36 tinyllama_data_aware_awq_scale_estimation_stateful | OV | Similarity | 0.84048 | -0.15952 | 94 | 124 | 36612 | 0:06:12 | 0:07:40 tinyllama_data_aware_awq_stateful | OV | Similarity | 0.85259 | -0.14741 | 94 | 124 | 34824 | 0:01:50 | 0:03:17 tinyllama_data_aware | OV | Similarity | 0.83853 | -0.16147 | 94 | 124 | 30604 | 0:01:25 | 0:03:30 tinyllama_data_aware_gptq | OV | Similarity | 0.82187 | -0.17813 | 94 | 124 | 39624 | 0:25:09 | 0:27:10 tinyllama_data_free | OV | Similarity | 0.72057 | -0.27943 | 114 | 84 | 6671 | 0:00:42 | 0:02:46 tinyllama_int8_data_free | TORCH | Similarity | 0.95624 | -0.04376 | 0 | 312 | 30161 | 0:00:09 | 0:02:54
openvinotoolkit · Jun 13, 2024 · 85b3263 · 85b3263
1 parent 9200a22
commit 85b3263
Show file tree

Hide file tree

Showing 16 changed files with 378 additions and 305 deletions.
diff --git a/docs/usage/post_training_compression/weights_compression/Usage.md b/docs/usage/post_training_compression/weights_compression/Usage.md
@@ -9,7 +9,7 @@ The Weights Compression algorithm is aimed at compressing the weights of the mod
 #### Supported modes
 
 By default, weights are compressed asymmetrically to 8-bit integer data type - "INT8_ASYM" mode.
-OpenVINO backend also supports 3 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM and NF4. The primary precision in case of INT4_SYM mode is unsigned 4-bit integer and weights are quantized to it [symmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) with a fixed zero point equals to 8. In case of INT4_ASYM mode - also unsigned 4-bit integer, but weight are quantized to it [asymmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point.
+OpenVINO backend also supports 3 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM and NF4. The primary precision in case of INT4_SYM mode is signed 4-bit integer and weights are quantized to it [symmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) without zero point. In case of INT4_ASYM mode - unsigned 4-bit integer and weight are quantized to it [asymmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point.
 All 4-bit modes have a grouped quantization support, when small group of weights (e.g. 128) in the channel dimension share quantization parameters (scale).
 All embeddings, convolutions and last linear layers are always compressed to 8-bit integer data type. To quantize embeddings and last linear layers to 4-bit, use `all_layers=True`.
 Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to 8-bit asymmetric integer data type.
@@ -484,7 +484,7 @@ Here is the perplexity and accuracy with data-free and data-aware mixed-precisio
 - The algorithm is supported for OpenVINO and PyTorch models.
 - The compression applies in-place.
 - The compressed model is not trainable.
-- INT8_SYM, INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
+- INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
 - NF4 support is experimental - models quantized to nf4 should not be faster models quantized to 8-bit integer.
 
 #### Additional resources

diff --git a/nncf/parameters.py b/nncf/parameters.py
@@ -68,13 +68,13 @@ class CompressWeightsMode(StrEnum):
     """
     Defines a mode for weight compression.
     :param INT8_SYM: Stands for 8-bit integer symmetric quantization of all weights.
-        Weights are quantized symmetrically with a fixed zero point equals to 128.
+        Weights are quantized symmetrically without zero point.
         https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization
     :param INT8_ASYM: The same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
         with a typical non-fixed zero point.
         https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization
     :param INT4_SYM: Stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
-        Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
+        Weights are quantized to a primary precision symmetrically without zero point.
         All embeddings and the last layer are always compressed to a backup precision, which is INT8_ASYM,
         by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
         criteria and the given ratio.

diff --git a/nncf/quantization/algorithms/weight_compression/algorithm.py b/nncf/quantization/algorithms/weight_compression/algorithm.py
@@ -70,11 +70,11 @@ def __init__(
         """
         :param mode: Defines a mode for weight compression.
             INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
-                Weights are quantized symmetrically with a fixed zero point equals to 128.
+                Weights are quantized symmetrically without zero point.
             INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
                 with a typical non-fixed zero point.
             INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
-                Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
+                Weights are quantized to a primary precision symmetrically without zero point.
                 All embeddings and the last layer are always compressed to a backup precision, which is INT8_ASYM,
                 by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
                 criteria and the given ratio.

diff --git a/nncf/quantization/algorithms/weight_compression/gptq.py b/nncf/quantization/algorithms/weight_compression/gptq.py
@@ -263,7 +263,7 @@ def _quantize_weights(
                     quantized_col = decompress_nf4_weight(compressed_weights, scales[-1])
                 else:
                     compressed_weights = calculate_quantized_weight(
-                        fns.unsqueeze(weight_col, 1), scales[-1], zero_points[-1], block_compression_config
+                        fns.unsqueeze(weight_col, 1), block_compression_config, scales[-1], zero_points[-1]
                     )
                     quantized_col = do_dequantization(compressed_weights, scales[-1], zero_points[-1])
                 quantized_col = fns.flatten(quantized_col)
@@ -287,13 +287,11 @@ def _quantize_weights(
         )
 
         scales = fns.stack(scales, axis=1)
-        if wc_params.compression_config.mode == CompressWeightsMode.NF4:
-            zero_points = None
-        elif wc_params.compression_config.mode in [
-            CompressWeightsMode.INT8_SYM,
-            CompressWeightsMode.INT4_SYM,
+        if wc_params.compression_config.mode in [
+            CompressWeightsMode.INT8_ASYM,
+            CompressWeightsMode.INT4_ASYM,
         ]:
-            zero_points = fns.squeeze(zero_points[0])
-        else:
             zero_points = fns.stack(zero_points, axis=1)
+        else:
+            zero_points = None
         return scales, zero_points
diff --git a/nncf/quantization/algorithms/weight_compression/mixed_precision.py b/nncf/quantization/algorithms/weight_compression/mixed_precision.py
@@ -22,6 +22,7 @@
 from nncf.quantization.algorithms.weight_compression.backend import WeightCompressionAlgoBackend
 from nncf.quantization.algorithms.weight_compression.config import WeightCompressionConfig
 from nncf.quantization.algorithms.weight_compression.config import WeightCompressionParameters
+from nncf.quantization.algorithms.weight_compression.weight_lowering import do_dequantization
 from nncf.quantization.algorithms.weight_compression.weight_lowering import do_integer_quantization
 from nncf.quantization.algorithms.weight_compression.weight_lowering import get_integer_quantization_error
 
@@ -176,7 +177,7 @@ def _calc_weight_sensitivity(self, weight_param: WeightCompressionParameters) ->
             weight = weight.astype(TensorDataType.float32)
 
         compressed_weights, scale, zero_point = do_integer_quantization(weight, reduction_axes, backup_config)
-        decompressed_weight = (compressed_weights - zero_point).astype(weight.dtype) * scale
+        decompressed_weight = do_dequantization(compressed_weights, scale, zero_point)
         decompressed_weight = decompressed_weight.reshape(orig_shape)
         return fns.linalg.norm(decompressed_weight - weight, ord="fro").item()
 

diff --git a/nncf/quantization/algorithms/weight_compression/openvino_backend.py b/nncf/quantization/algorithms/weight_compression/openvino_backend.py
@@ -20,6 +20,7 @@
 from nncf.common.graph.transformations.commands import TargetType
 from nncf.common.graph.utils import get_reduction_axes
 from nncf.experimental.common.tensor_statistics.collectors import TensorCollector
+from nncf.experimental.tensor.definitions import TensorDataType
 from nncf.experimental.tensor.tensor import Tensor
 from nncf.openvino.graph.metatypes import openvino_metatypes as om
 from nncf.openvino.graph.model_transformer import OVModelTransformer
@@ -134,17 +135,14 @@ def transform_model(
             compression_config = wc_params.compression_config
             if compression_config.mode == CompressWeightsMode.NF4:
                 compression_dtype = ov.Type.nf4
-            elif compression_config.mode in [
-                CompressWeightsMode.INT8_ASYM,
-                CompressWeightsMode.INT8_SYM,
-                CompressWeightsMode.INT8,
-                CompressWeightsMode.INT4_ASYM,
-                CompressWeightsMode.INT4_SYM,
-            ]:
-                if compression_config.mode in [CompressWeightsMode.INT4_ASYM, CompressWeightsMode.INT4_SYM]:
-                    compression_dtype = ov.Type.u4
-                else:
-                    compression_dtype = ov.Type.u8
+            elif compression_config.mode == CompressWeightsMode.INT4_SYM:
+                compression_dtype = ov.Type.i4
+            elif compression_config.mode == CompressWeightsMode.INT4_ASYM:
+                compression_dtype = ov.Type.u4
+            elif compression_config.mode == CompressWeightsMode.INT8_SYM:
+                compression_dtype = ov.Type.i8
+            elif compression_config.mode == CompressWeightsMode.INT8_ASYM:
+                compression_dtype = ov.Type.u8
             else:
                 raise ValueError(f"{compression_config.mode.value} is not supported.")
 
@@ -175,7 +173,7 @@ def transform_model(
                 compressed_weight.tensor.data, dtype=compression_dtype, name=const_node_name
             )
             converted_const = opset.convert(compressed_const, ov.Type.f16)
-            if compressed_weight.zero_point is not None:
+            if compressed_weight.zero_point is not None and compressed_weight.tensor.dtype == TensorDataType.uint8:
                 zero_point_const = opset.constant(
                     compressed_weight.zero_point.data,
                     dtype=compression_dtype,
@@ -220,27 +218,28 @@ def dump_parameters(
 
     @staticmethod
     def get_compress_decompress_pipeline(
-        weight_compression_parameter: WeightCompressionParameters, w_shape, s_shape, z_p_shape
+        weight_compression_parameter: WeightCompressionParameters, w_shape, s_shape, z_p_shape=None
     ):
-        (
-            w,
-            s,
-            zp,
-            clamp,
-        ) = OVWeightCompressionAlgoBackend.get_compress_pipeline(
+        parameters, clamp = OVWeightCompressionAlgoBackend.get_compress_pipeline(
             weight_compression_parameter, w_shape, s_shape, z_p_shape, True
         )
 
-        result = (clamp - zp) * s
-        model = ov.Model([result], [w, s, zp])
+        if len(parameters) == 3:
+            _, s, zp = parameters
+            result = (clamp - zp) * s
+        else:
+            s = parameters[1]
+            result = clamp * s
+
+        model = ov.Model([result], parameters)
 
         compiled_model = ov.compile_model(model)
 
-        return lambda w, s, zp: compiled_model([w, s, zp])[0]
+        return lambda parameters: compiled_model(parameters)[0]
 
     @staticmethod
     def get_compress_pipeline(
-        weight_compression_parameter: WeightCompressionParameters, w_shape, s_shape, z_p_shape, return_nodes=False
+        weight_compression_parameter: WeightCompressionParameters, w_shape, s_shape, z_p_shape=None, return_nodes=False
     ):
         config = weight_compression_parameter.compression_config
         mode = config.mode
@@ -252,18 +251,23 @@ def get_compress_pipeline(
 
         w = opset.parameter(w_shape, name="w")
         s = opset.parameter(s_shape, name="s")
-        zp = opset.parameter(z_p_shape, name="zp")
+        parameters = [w, s]
+        compressed_w = w / s
+        if z_p_shape is not None:
+            zp = opset.parameter(z_p_shape, name="zp")
+            parameters.append(zp)
+            compressed_w += zp
 
-        result = opset.clamp(opset.round(w / s + zp), level_low, level_high, name="compressed_weights")
+        result = opset.clamp(opset.round(compressed_w), level_low, level_high, name="compressed_weights")
 
         if return_nodes:
-            return w, s, zp, result
+            return parameters, result
 
-        model = ov.Model([result], [w, s, zp])
+        model = ov.Model([result], parameters)
 
         compiled_model = ov.compile_model(model)
 
-        return lambda w, s, zp: compiled_model([w, s, zp])[0]
+        return lambda parameters: compiled_model(parameters)[0]
 
 
 class OVAWQAlgoAlgoBackend(OVWeightCompressionAlgoBackend):