Skip to content

Commit

Permalink
int4 support for weight compression for OV backend (#2218)
Browse files Browse the repository at this point in the history
### Changes

- API is the same, just added 2 new modes:
`CompressWeightsMode.INT4_SYM` (u4 weights with fake ZP=8) and
`CompressWeightsMode.INT4_ASYM` (u4 weights with real ZP).
Fake "ZP" is represented in IR by a single u4 value (8) in order to
store weights in u4 format. Runtime claims it's faster to process u4
weights with -8 shift, rather than i4 without shift.

- Changed mixed precision criteria from "nf4_error/int8_error" to
"1/int8_error".

### Reason for changes

Faster 4-bit format for LLM on CPU

### Related tickets

123382

### Tests

test_compare_compressed_weights
test_quantization_error_calculation

- [x] openvino pre-commit with nightly package (build#8)

Accuracy, peak memory and elapsed time for compression with latest OV
master: 2023.2.0-13001-7720135f58d

![image](https://github.com/openvinotoolkit/nncf/assets/4014476/40770809-f539-423c-b25c-eb240db6fa9e)


INT4_ASYM with group size=3

![image](https://github.com/openvinotoolkit/nncf/assets/4014476/84a27616-0a98-473c-a34e-027e66f46f45)


INT4_SYM with group size=3

![image](https://github.com/openvinotoolkit/nncf/assets/4014476/fe8d833b-ec6f-4a41-989b-c8f085ba7b65)

INT8 per-channel

![image](https://github.com/openvinotoolkit/nncf/assets/4014476/65a5c8ae-8cbe-4cea-9264-6906bfd7930d)
  • Loading branch information
ljaljushkin authored Oct 31, 2023
1 parent 62e73f4 commit c732de0
Show file tree
Hide file tree
Showing 22 changed files with 1,076 additions and 417 deletions.
4 changes: 2 additions & 2 deletions nncf/common/graph/patterns/patterns.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ def __add__(self, other: "GraphPattern") -> "GraphPattern":
Add DiGraph nodes of other to self and add edge between
last node of self's graph and first node of other's graph.
The first and last nodes are found by nx.lexicographical_topological_sort().
The first and the last nodes are found by nx.lexicographical_topological_sort().
For more complex cases that are not covered by this function, use `join_patterns()`.
Expand Down Expand Up @@ -195,7 +195,7 @@ def join_patterns(self, other: "GraphPattern", edges: Optional[List[Tuple[Hashab
If edges is None, connect all weakly connected components of self and other by adding edges between
the last nodes of every weakly component of self and the first nodes of every weakly component other.
The first and last nodes are found by nx.lexicographical_topological_sort().
The first and the last nodes are found by nx.lexicographical_topological_sort().
# A: (a) (b)
# B: (c) (d)
Expand Down
16 changes: 12 additions & 4 deletions nncf/parameters.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,12 +62,20 @@ class DropType(Enum):
class CompressWeightsMode(Enum):
"""
Defines a mode for weight compression.
:param INT8: Stands for 8-bit integer quantization of all weights.
:param NF4: Stands for a mixed-precision weights quantization to NF4 data type. The first and last
layers are always compressed to a backup precision which is 8-bit integer by default. All others are quantized
whether to NF4 or to a backup precision depending on criteria and the given ratio.
:param INT4_SYM: Stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
The first and the last layers are always compressed to a backup precision, which is 8-bit integer,
by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
criteria and the given ratio.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization
:param INT4_ASYM: The same as INT4_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization
:param NF4: The the same as INT4_SYM mode, but primary precision is NF4 data type without zero point.
"""

INT8 = "int8"
INT4_SYM = "int4_sym"
INT4_ASYM = "int4_asym"
NF4 = "nf4"
11 changes: 8 additions & 3 deletions nncf/quantization/algorithms/weight_compression/algorithm.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,9 +55,14 @@ def __init__(
"""
:param mode: Defines a mode for weight compression.
INT8 stands for 8-bit integer quantization of all weights.
NF4 stands for a mixed-precision weights quantization to NF4 data type. The first and last layers
are always compressed to a backup precision which is 8-bit integer by default. All others are quantized
whether to NF4 or to a backup precision depending on criteria and the given ratio.
INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
The first and the last layers are always compressed to a backup precision, which is 8-bit integer,
by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
criteria and the given ratio.
INT4_ASYM is the same as INT4_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
NF4 is the same as INT4_SYM mode, but primary precision is NF4 data type without zero point.
:param ratio: the ratio between baseline and backup precisions (e.g. 0.9 means 90% of layers quantized to NF4
and the rest to INT8).
:param group_size: number of weights (e.g. 128) in the channel dimension
Expand Down
22 changes: 16 additions & 6 deletions nncf/quantization/algorithms/weight_compression/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,14 @@ def validate_params(mode: CompressWeightsMode, ignored_scope: Optional[IgnoredSc
:param mode: Defines a mode for weight compression.
INT8 stands for 8-bit integer quantization of all weights.
NF4 stands for a mixed-precision weights quantization to NF4 data type. The first and last layers
are always compressed to a backup precision which is 8-bit integer by default. All others are quantized
whether to NF4 or to a backup precision depending on criteria and the given ratio.
INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
The first and the last layers are always compressed to a backup precision, which is 8-bit integer,
by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
criteria and the given ratio.
INT4_ASYM is the same as INT4_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
NF4 is the same as INT4_SYM mode, but primary precision is NF4 data type without zero point.
:param ignored_scope: An ignored scope that defined the list of model control
flow graph nodes to be ignored during quantization.
"""
Expand All @@ -73,9 +78,14 @@ def do_compression(
corresponding to the layers for weight compression.
:param mode: Defines a mode for weight compression.
INT8 stands for 8-bit integer quantization of all weights.
NF4 stands for a mixed-precision weights quantization to NF4 data type. The first and last layers
are always compressed to a backup precision which is 8-bit integer by default. All others are quantized
whether to NF4 or to a backup precision depending on criteria and the given ratio.
INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
The first and the last layers are always compressed to a backup precision, which is 8-bit integer,
by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
criteria and the given ratio.
INT4_ASYM is the same as INT4_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
NF4 is the same as INT4_SYM mode, but primary precision is NF4 data type without zero point.
:param ratio: The ratio between baseline and backup precisions (e.g. 0.9 means 90% of layers quantized to NF4
and the rest to INT8).
:param group_size: Number of weights (e.g. 128) in the channel dimension
Expand Down
Loading

0 comments on commit c732de0

Please sign in to comment.