int4 support for weight compression for OV backend (#2218)

### Changes - API is the same, just added 2 new modes: `CompressWeightsMode.INT4_SYM` (u4 weights with fake ZP=8) and `CompressWeightsMode.INT4_ASYM` (u4 weights with real ZP). Fake "ZP" is represented in IR by a single u4 value (8) in order to store weights in u4 format. Runtime claims it's faster to process u4 weights with -8 shift, rather than i4 without shift. - Changed mixed precision criteria from "nf4_error/int8_error" to "1/int8_error". ### Reason for changes Faster 4-bit format for LLM on CPU ### Related tickets 123382 ### Tests test_compare_compressed_weights test_quantization_error_calculation - [x] openvino pre-commit with nightly package (build#8) Accuracy, peak memory and elapsed time for compression with latest OV master: 2023.2.0-13001-7720135f58d ![image](https://github.com/openvinotoolkit/nncf/assets/4014476/40770809-f539-423c-b25c-eb240db6fa9e) INT4_ASYM with group size=3 ![image](https://github.com/openvinotoolkit/nncf/assets/4014476/84a27616-0a98-473c-a34e-027e66f46f45) INT4_SYM with group size=3 ![image](https://github.com/openvinotoolkit/nncf/assets/4014476/fe8d833b-ec6f-4a41-989b-c8f085ba7b65) INT8 per-channel ![image](https://github.com/openvinotoolkit/nncf/assets/4014476/65a5c8ae-8cbe-4cea-9264-6906bfd7930d)
openvinotoolkit · Oct 31, 2023 · c732de0 · c732de0
1 parent 62e73f4
commit c732de0
Show file tree

Hide file tree

Showing 22 changed files with 1,076 additions and 417 deletions.
diff --git a/nncf/common/graph/patterns/patterns.py b/nncf/common/graph/patterns/patterns.py
@@ -91,7 +91,7 @@ def __add__(self, other: "GraphPattern") -> "GraphPattern":
         Add DiGraph nodes of other to self and add edge between
         last node of self's graph and first node of other's graph.
 
-        The first and last nodes are found by nx.lexicographical_topological_sort().
+        The first and the last nodes are found by nx.lexicographical_topological_sort().
 
         For more complex cases that are not covered by this function, use `join_patterns()`.
 
@@ -195,7 +195,7 @@ def join_patterns(self, other: "GraphPattern", edges: Optional[List[Tuple[Hashab
 
         If edges is None, connect all weakly connected components of self and other by adding edges between
         the last nodes of every weakly component of self and the first nodes of every weakly component other.
-        The first and last nodes are found by nx.lexicographical_topological_sort().
+        The first and the last nodes are found by nx.lexicographical_topological_sort().
 
         # A: (a) (b)
         # B: (c) (d)

diff --git a/nncf/parameters.py b/nncf/parameters.py
@@ -62,12 +62,20 @@ class DropType(Enum):
 class CompressWeightsMode(Enum):
     """
     Defines a mode for weight compression.
-
     :param INT8: Stands for 8-bit integer quantization of all weights.
-    :param NF4: Stands for a mixed-precision weights quantization to NF4 data type. The first and last
-        layers are always compressed to a backup precision which is 8-bit integer by default. All others are quantized
-        whether to NF4 or to a backup precision depending on criteria and the given ratio.
+    :param INT4_SYM: Stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
+        Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
+        The first and the last layers are always compressed to a backup precision, which is 8-bit integer,
+        by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
+        criteria and the given ratio.
+        https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization
+    :param INT4_ASYM: The same as INT4_SYM mode, but weights are quantized to a primary precision asymmetrically
+        with a typical non-fixed zero point.
+        https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization
+    :param NF4: The the same as INT4_SYM mode, but primary precision is NF4 data type without zero point.
     """
 
     INT8 = "int8"
+    INT4_SYM = "int4_sym"
+    INT4_ASYM = "int4_asym"
     NF4 = "nf4"
diff --git a/nncf/quantization/algorithms/weight_compression/algorithm.py b/nncf/quantization/algorithms/weight_compression/algorithm.py
@@ -55,9 +55,14 @@ def __init__(
         """
         :param mode: Defines a mode for weight compression.
             INT8 stands for 8-bit integer quantization of all weights.
-            NF4 stands for a mixed-precision weights quantization to NF4 data type. The first and last layers
-            are always compressed to a backup precision which is 8-bit integer by default. All others are quantized
-            whether to NF4 or to a backup precision depending on criteria and the given ratio.
+            INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
+                Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
+                The first and the last layers are always compressed to a backup precision, which is 8-bit integer,
+                by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
+                criteria and the given ratio.
+            INT4_ASYM is the same as INT4_SYM mode, but weights are quantized to a primary precision asymmetrically
+                with a typical non-fixed zero point.
+            NF4 is the same as INT4_SYM mode, but primary precision is NF4 data type without zero point.
         :param ratio: the ratio between baseline and backup precisions (e.g. 0.9 means 90% of layers quantized to NF4
             and the rest to INT8).
         :param group_size: number of weights (e.g. 128) in the channel dimension

diff --git a/nncf/quantization/algorithms/weight_compression/backend.py b/nncf/quantization/algorithms/weight_compression/backend.py
@@ -48,9 +48,14 @@ def validate_params(mode: CompressWeightsMode, ignored_scope: Optional[IgnoredSc
 
         :param mode: Defines a mode for weight compression.
             INT8 stands for 8-bit integer quantization of all weights.
-            NF4 stands for a mixed-precision weights quantization to NF4 data type. The first and last layers
-            are always compressed to a backup precision which is 8-bit integer by default. All others are quantized
-            whether to NF4 or to a backup precision depending on criteria and the given ratio.
+            INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
+                Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
+                The first and the last layers are always compressed to a backup precision, which is 8-bit integer,
+                by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
+                criteria and the given ratio.
+            INT4_ASYM is the same as INT4_SYM mode, but weights are quantized to a primary precision asymmetrically
+                with a typical non-fixed zero point.
+            NF4 is the same as INT4_SYM mode, but primary precision is NF4 data type without zero point.
         :param ignored_scope: An ignored scope that defined the list of model control
             flow graph nodes to be ignored during quantization.
         """
@@ -73,9 +78,14 @@ def do_compression(
             corresponding to the layers for weight compression.
         :param mode: Defines a mode for weight compression.
             INT8 stands for 8-bit integer quantization of all weights.
-            NF4 stands for a mixed-precision weights quantization to NF4 data type. The first and last layers
-            are always compressed to a backup precision which is 8-bit integer by default. All others are quantized
-            whether to NF4 or to a backup precision depending on criteria and the given ratio.
+            INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
+                Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
+                The first and the last layers are always compressed to a backup precision, which is 8-bit integer,
+                by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
+                criteria and the given ratio.
+            INT4_ASYM is the same as INT4_SYM mode, but weights are quantized to a primary precision asymmetrically
+                with a typical non-fixed zero point.
+            NF4 is the same as INT4_SYM mode, but primary precision is NF4 data type without zero point.
         :param ratio: The ratio between baseline and backup precisions (e.g. 0.9 means 90% of layers quantized to NF4
             and the rest to INT8).
         :param group_size: Number of weights (e.g. 128) in the channel dimension