Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend weight compression with INT8 symmetric scheme #2288

Merged
merged 6 commits into from
Dec 7, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 20 additions & 12 deletions docs/compression_algorithms/CompressWeights.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,22 +8,30 @@ The Weights Compression algorithm is aimed at compressing the weights of the mod

#### Supported modes

By default, weights are compressed to 8-bit integer data type - "INT8" mode.
By default, weights are compressed asymmetrically to 8-bit integer data type - "INT8_ASYM" mode.
OpenVINO backend also supports 3 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM and NF4. The primary precision in case of INT4_SYM mode is unsigned 4-bit integer and weights are quantized to it [symmetrically](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) with a fixed zero point equals to 8. In case of INT4_ASYM mode - also unsigned 4-bit integer, but weight are quantized to it [asymmetrically](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point.
All 4-bit modes have a grouped quantization support, when small group of weights (e.g. 128) in the channel dimension share quantization parameters (scale).
All embeddings and last linear layers are always compressed to 8-bit integer data type.
Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to 8-bit integer data type.
Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to 8-bit asymmetric integer data type.

#### User guide

- Compress weights to 8-bit integer data type.
- Compress weights asymmetrically to 8-bit integer data type.

```python
from nncf import compress_weights
compressed_model = compress_weights(model)
```

- Compress weights symmetrically to 4-bit integer data type with group size = 128, except embeddings and last linear layers - they are compressed to 8-bit integer data type.
- Compress weights symmetrically to 8-bit integer data type.

```python
from nncf import compress_weights
from nncf import CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.INT8_SYM)
```

- Compress weights symmetrically to 4-bit integer data type with group size = 128, except embeddings and last linear layers - they are compressed asymmetrically to 8-bit integer data type.

```python
from nncf import compress_weights
Expand All @@ -36,7 +44,7 @@ compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_SYM)
If the accuracy or perplexity is still not satisfying, there are 2 more hyper-parameters to tune: `group_size` and `ratio`.
Lower group size and less ratio of 4-bit layers usually improve accuracy at the sacrifice of inference speed.
Below is the example how to compress weights of 90% of layers to 4-bit integer asymmetrically with the group size 64, and
the rest of layers to 8-bit integer data type. The same parametrization is applicable for `INT4_SYM` mode.
the rest of layers to 8-bit asymmetric integer data type. The same parametrization is applicable for `INT4_SYM` mode.

```python
from nncf import compress_weights
Expand All @@ -45,7 +53,7 @@ compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_ASYM, g
```

- `NF4` mode can be considered for improving accuracy, but currently models quantized to nf4 should not be faster models
quantized to 8-bit integer. Here's the example how to compress weights to nf4 data type with group size = 128.
quantized to 8-bit asymmetric integer. Here's the example how to compress weights to nf4 data type with group size = 128.
Different `group_size` and `ratio` are also supported.

```python
Expand Down Expand Up @@ -79,7 +87,7 @@ Here is the perplexity and model size before and after weight compression for di
</tr>
<tr>
<td class="tg-0pky">databricks/dolly-v2-3b</td>
<td class="tg-0pky">int8</td>
<td class="tg-0pky">int8_asym</td>
l-bat marked this conversation as resolved.
Show resolved Hide resolved
<td class="tg-0pky">5.07</td>
<td class="tg-0pky">0.05</td>
<td class="tg-0pky">2.6</td>
Expand Down Expand Up @@ -107,7 +115,7 @@ Here is the perplexity and model size before and after weight compression for di
</tr>
<tr>
<td class="tg-0pky">facebook/opt-6.7b</td>
<td class="tg-0pky">int8</td>
<td class="tg-0pky">int8_asym</td>
<td class="tg-0pky">4.27</td>
<td class="tg-0pky">0.01</td>
<td class="tg-0pky">6.2</td>
Expand Down Expand Up @@ -135,7 +143,7 @@ Here is the perplexity and model size before and after weight compression for di
</tr>
<tr>
<td class="tg-0pky">meta-llama/Llama-2-7b-chat-hf</td>
<td class="tg-0pky">int8</td>
<td class="tg-0pky">int8_asym</td>
<td class="tg-0pky">3.29</td>
<td class="tg-0pky">0.01</td>
<td class="tg-0pky">6.3</td>
Expand Down Expand Up @@ -163,7 +171,7 @@ Here is the perplexity and model size before and after weight compression for di
</tr>
<tr>
<td class="tg-0pky">togethercomputer/RedPajama-INCITE-7B-Instruct</td>
<td class="tg-0pky">int8</td>
<td class="tg-0pky">int8_asym</td>
<td class="tg-0pky">4.17</td>
<td class="tg-0pky">0.02</td>
<td class="tg-0pky">6.4</td>
Expand Down Expand Up @@ -191,7 +199,7 @@ Here is the perplexity and model size before and after weight compression for di
</tr>
<tr>
<td class="tg-0pky">meta-llama/Llama-2-13b-chat-hf</td>
<td class="tg-0pky">int8</td>
<td class="tg-0pky">int8_asym</td>
<td class="tg-0pky">2.91</td>
<td class="tg-0pky">0</td>
<td class="tg-0pky">12.1</td>
Expand All @@ -218,7 +226,7 @@ Here is the perplexity and model size before and after weight compression for di
- The algorithm is supported for OpenVINO and PyTorch models.
- The compression applies in-place.
- The compressed model is not trainable.
- INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
- INT8_SYM, INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
- NF4 support is experimental - models quantized to nf4 should not be faster models quantized to 8-bit integer.

#### Additional resources
Expand Down
11 changes: 9 additions & 2 deletions nncf/parameters.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,11 @@ class DropType(Enum):
class CompressWeightsMode(Enum):
"""
Defines a mode for weight compression.
:param INT8: Stands for 8-bit integer quantization of all weights.
:param INT8_SYM: Stands for 8-bit integer symmetric quantization of all weights.
ljaljushkin marked this conversation as resolved.
Show resolved Hide resolved
https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization
:param INT8_ASYM: The same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization
:param INT4_SYM: Stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,
Expand All @@ -73,9 +77,12 @@ class CompressWeightsMode(Enum):
with a typical non-fixed zero point.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization
:param NF4: The the same as INT4_SYM mode, but primary precision is NF4 data type without zero point.
:param INT8: Mode is deprecated and will be removed in future releases. Please use `INT8_ASYM` instead.
"""

INT8 = "int8"
INT8_SYM = "int8_sym"
INT8_ASYM = "int8_asym"
INT4_SYM = "int4_sym"
INT4_ASYM = "int4_asym"
NF4 = "nf4"
INT8 = "int8" # Deprecated mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about leaving INT8 as an alias for INT8_ASYM?

Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,9 @@ def __init__(
):
"""
:param mode: Defines a mode for weight compression.
INT8 stands for 8-bit integer quantization of all weights.
INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,
Expand Down
8 changes: 6 additions & 2 deletions nncf/quantization/algorithms/weight_compression/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,9 @@ def validate_params(mode: CompressWeightsMode, ignored_scope: Optional[IgnoredSc
parameters. Should be called on early algorithm steps to prevent execution of time-consuming operations.

:param mode: Defines a mode for weight compression.
INT8 stands for 8-bit integer quantization of all weights.
INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,
Expand Down Expand Up @@ -77,7 +79,9 @@ def do_compression(
:param nodes_to_compress: List of nodes in the model's graph,
corresponding to the layers for weight compression.
:param mode: Defines a mode for weight compression.
INT8 stands for 8-bit integer quantization of all weights.
INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -103,12 +103,12 @@ def do_compression(
quantized_nodes_ids.add(id(weight_node))

internal_weight_params = all_weight_params
if mode != CompressWeightsMode.INT8:
if mode not in [CompressWeightsMode.INT8_SYM, CompressWeightsMode.INT8_ASYM]:
internal_weight_params = list(filter(lambda wp: wp.metatype != OVEmbeddingMetatype, all_weight_params))
if not is_last_layer_compressed:
internal_weight_params = internal_weight_params[:-1]
primary_config = WeightCompressionConfig(mode=mode, group_size=group_size)
_assign_mixed_precision(internal_weight_params, ratio, primary_config)
primary_config = WeightCompressionConfig(mode=mode, group_size=group_size)
_assign_mixed_precision(internal_weight_params, ratio, primary_config)
ljaljushkin marked this conversation as resolved.
Show resolved Hide resolved
nncf_logger.info(_get_bitwidth_distribution_str(all_weight_params, internal_weight_params))

for wp in track(all_weight_params, description="Applying Weight Compression"):
Expand Down Expand Up @@ -172,15 +172,15 @@ class WeightCompressionConfig:
The value -1 means no grouping. Defaults to -1.
"""

mode: Optional[CompressWeightsMode] = CompressWeightsMode.INT8
mode: Optional[CompressWeightsMode] = CompressWeightsMode.INT8_ASYM
group_size: Optional[int] = -1

@property
def num_bits(self):
"""
:return: number of bits that is used for storing a single quantized value in the given mode.
"""
return 8 if self.mode == CompressWeightsMode.INT8 else 4
return 8 if self.mode in [CompressWeightsMode.INT8_SYM, CompressWeightsMode.INT8_ASYM] else 4


@dataclass
Expand Down Expand Up @@ -212,7 +212,10 @@ def _do_integer_quantization(
"""
The method quantizes the given weights to integer data type in accordance with the compression config.
The config defines a quantization mode:
INT8 mode refers to unsigned int8 asymmetric weight compression - quantization to [0, 255] range.
INT8_SYM mode refers to unsigned int4 symmetric weight compression with a fixed zero point equals to 128 -
ljaljushkin marked this conversation as resolved.
Show resolved Hide resolved
quantization to [0, 255] range.
INT8_ASYM mode refers to unsigned int8 asymmetric weight compression with a typical non-fixed zero-point -
quantization to [0, 255] range.
INT4_ASYM mode refers to unsigned int4 asymmetric weight compression with a typical non-fixed zero-point -
quantization to [0, 15] range.
INT4_SYM mode refers to unsigned int4 symmetric weight compression with a fixed zero point equals to 8 -
Expand All @@ -239,7 +242,7 @@ def _do_integer_quantization(
# weights are reshaped from [a1, r, a2] to [a1, r//gs, gs, a2]
weight, reduction_axis = _reshape_weights_for_grouped_quantization(weight, reduction_axis, group_size)

if mode in [CompressWeightsMode.INT8, CompressWeightsMode.INT4_ASYM]:
if mode in [CompressWeightsMode.INT8_ASYM, CompressWeightsMode.INT4_ASYM]:
min_values = np.min(weight, axis=reduction_axis, keepdims=True) # [a1, r, a2] -> [a1, 1, a2]
max_values = np.max(weight, axis=reduction_axis, keepdims=True) # [a1, r, a2] -> [a1, 1, a2]
scale, zero_point = calculate_scale_zero_point(
Expand Down
13 changes: 11 additions & 2 deletions nncf/quantization/quantize_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from typing import Any, Callable, Iterable, List, Optional, Tuple, TypeVar, Union

from nncf.api.compression import TModel
from nncf.common.deprecation import warning_deprecated
from nncf.common.factory import NNCFGraphFactory
from nncf.common.quantization.structs import QuantizationPreset
from nncf.common.utils.api_marker import api
Expand Down Expand Up @@ -241,7 +242,7 @@ def quantize_with_accuracy_control(
@api(canonical_alias="nncf.compress_weights")
def compress_weights(
model: TModel,
mode=CompressWeightsMode.INT8,
mode=CompressWeightsMode.INT8_ASYM,
ratio: Optional[float] = None,
group_size: Optional[int] = None,
ignored_scope: Optional[IgnoredScope] = None,
Expand All @@ -251,7 +252,9 @@ def compress_weights(

:param model: A model to be compressed.
:param mode: Defines a mode for weight compression.
INT8 stands for 8-bit integer quantization of all weights.
INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,
Expand All @@ -269,6 +272,12 @@ def compress_weights(
:return: The non-trainable model with compressed weights.
"""
if mode == CompressWeightsMode.INT8:
warning_deprecated(
"`CompressWeightsMode.INT8` is deprecated." "Please, use `CompressWeightsMode.INT8_ASYM` as value instead."
)
mode = CompressWeightsMode.INT8_ASYM

if mode in [CompressWeightsMode.INT8_ASYM, CompressWeightsMode.INT8_SYM]:
if ratio is None:
ratio = 1
if group_size is None:
Expand Down
12 changes: 8 additions & 4 deletions nncf/torch/quantization/quantize_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ def quantize_impl(

def compress_weights_impl(
model: torch.nn.Module,
mode=CompressWeightsMode.INT8,
mode=CompressWeightsMode.INT8_ASYM,
ratio: Optional[float] = None,
group_size: Optional[int] = None,
ignored_scope: Optional[IgnoredScope] = None,
Expand All @@ -85,7 +85,9 @@ def compress_weights_impl(

:param model: a Torch model for compression.
:param mode: Defines a mode for weight compression.
INT8 stands for 8-bit integer quantization of all weights.
INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,
Expand All @@ -104,8 +106,10 @@ def compress_weights_impl(
"""
if ignored_scope is not None:
raise AttributeError("Torch backend does not support ignored scope.")
if mode != CompressWeightsMode.INT8:
raise AttributeError(f"Torch backend supports only INT8 mode for weight compression, but given {mode} mode.")
if mode != CompressWeightsMode.INT8_ASYM:
l-bat marked this conversation as resolved.
Show resolved Hide resolved
raise AttributeError(
f"Torch backend supports only INT8_ASYM mode for weight compression, but given {mode} mode."
)
compressed_model, _ = replace_modules_by_nncf_modules(model)
insert_pre_compression_operations(model)

Expand Down
Loading