Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend weight compression with INT8 symmetric scheme #2288

Merged
merged 6 commits into from
Dec 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 20 additions & 12 deletions docs/compression_algorithms/CompressWeights.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,22 +8,30 @@ The Weights Compression algorithm is aimed at compressing the weights of the mod

#### Supported modes

By default, weights are compressed to 8-bit integer data type - "INT8" mode.
By default, weights are compressed asymmetrically to 8-bit integer data type - "INT8_ASYM" mode.
OpenVINO backend also supports 3 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM and NF4. The primary precision in case of INT4_SYM mode is unsigned 4-bit integer and weights are quantized to it [symmetrically](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) with a fixed zero point equals to 8. In case of INT4_ASYM mode - also unsigned 4-bit integer, but weight are quantized to it [asymmetrically](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point.
All 4-bit modes have a grouped quantization support, when small group of weights (e.g. 128) in the channel dimension share quantization parameters (scale).
All embeddings and last linear layers are always compressed to 8-bit integer data type.
Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to 8-bit integer data type.
Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to 8-bit asymmetric integer data type.

#### User guide

- Compress weights to 8-bit integer data type.
- Compress weights asymmetrically to 8-bit integer data type.

```python
from nncf import compress_weights
compressed_model = compress_weights(model)
```

- Compress weights symmetrically to 4-bit integer data type with group size = 128, except embeddings and last linear layers - they are compressed to 8-bit integer data type.
- Compress weights symmetrically to 8-bit integer data type.

```python
from nncf import compress_weights
from nncf import CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.INT8_SYM)
```

- Compress weights symmetrically to 4-bit integer data type with group size = 128, except embeddings and last linear layers - they are compressed asymmetrically to 8-bit integer data type.

```python
from nncf import compress_weights
Expand All @@ -36,7 +44,7 @@ compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_SYM)
If the accuracy or perplexity is still not satisfying, there are 2 more hyper-parameters to tune: `group_size` and `ratio`.
Lower group size and less ratio of 4-bit layers usually improve accuracy at the sacrifice of inference speed.
Below is the example how to compress weights of 90% of layers to 4-bit integer asymmetrically with the group size 64, and
the rest of layers to 8-bit integer data type. The same parametrization is applicable for `INT4_SYM` mode.
the rest of layers to 8-bit asymmetric integer data type. The same parametrization is applicable for `INT4_SYM` mode.

```python
from nncf import compress_weights
Expand All @@ -45,7 +53,7 @@ compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_ASYM, g
```

- `NF4` mode can be considered for improving accuracy, but currently models quantized to nf4 should not be faster models
quantized to 8-bit integer. Here's the example how to compress weights to nf4 data type with group size = 128.
quantized to 8-bit asymmetric integer. Here's the example how to compress weights to nf4 data type with group size = 128.
Different `group_size` and `ratio` are also supported.

```python
Expand Down Expand Up @@ -79,7 +87,7 @@ Here is the perplexity and model size before and after weight compression for di
</tr>
<tr>
<td class="tg-0pky">databricks/dolly-v2-3b</td>
<td class="tg-0pky">int8</td>
<td class="tg-0pky">int8_asym</td>
l-bat marked this conversation as resolved.
Show resolved Hide resolved
<td class="tg-0pky">5.07</td>
<td class="tg-0pky">0.05</td>
<td class="tg-0pky">2.6</td>
Expand Down Expand Up @@ -107,7 +115,7 @@ Here is the perplexity and model size before and after weight compression for di
</tr>
<tr>
<td class="tg-0pky">facebook/opt-6.7b</td>
<td class="tg-0pky">int8</td>
<td class="tg-0pky">int8_asym</td>
<td class="tg-0pky">4.27</td>
<td class="tg-0pky">0.01</td>
<td class="tg-0pky">6.2</td>
Expand Down Expand Up @@ -135,7 +143,7 @@ Here is the perplexity and model size before and after weight compression for di
</tr>
<tr>
<td class="tg-0pky">meta-llama/Llama-2-7b-chat-hf</td>
<td class="tg-0pky">int8</td>
<td class="tg-0pky">int8_asym</td>
<td class="tg-0pky">3.29</td>
<td class="tg-0pky">0.01</td>
<td class="tg-0pky">6.3</td>
Expand Down Expand Up @@ -163,7 +171,7 @@ Here is the perplexity and model size before and after weight compression for di
</tr>
<tr>
<td class="tg-0pky">togethercomputer/RedPajama-INCITE-7B-Instruct</td>
<td class="tg-0pky">int8</td>
<td class="tg-0pky">int8_asym</td>
<td class="tg-0pky">4.17</td>
<td class="tg-0pky">0.02</td>
<td class="tg-0pky">6.4</td>
Expand Down Expand Up @@ -191,7 +199,7 @@ Here is the perplexity and model size before and after weight compression for di
</tr>
<tr>
<td class="tg-0pky">meta-llama/Llama-2-13b-chat-hf</td>
<td class="tg-0pky">int8</td>
<td class="tg-0pky">int8_asym</td>
<td class="tg-0pky">2.91</td>
<td class="tg-0pky">0</td>
<td class="tg-0pky">12.1</td>
Expand All @@ -218,7 +226,7 @@ Here is the perplexity and model size before and after weight compression for di
- The algorithm is supported for OpenVINO and PyTorch models.
- The compression applies in-place.
- The compressed model is not trainable.
- INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
- INT8_SYM, INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
- NF4 support is experimental - models quantized to nf4 should not be faster models quantized to 8-bit integer.

#### Additional resources
Expand Down
14 changes: 11 additions & 3 deletions nncf/parameters.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,20 +62,28 @@ class DropType(Enum):
class CompressWeightsMode(Enum):
"""
Defines a mode for weight compression.
:param INT8: Stands for 8-bit integer quantization of all weights.
:param INT8_SYM: Stands for 8-bit integer symmetric quantization of all weights.
ljaljushkin marked this conversation as resolved.
Show resolved Hide resolved
Weights are quantized symmetrically with a fixed zero point equals to 128.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization
:param INT8_ASYM: The same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization
:param INT4_SYM: Stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,
All embeddings and the last layer are always compressed to a backup precision, which is INT8_ASYM,
by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
criteria and the given ratio.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization
:param INT4_ASYM: The same as INT4_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization
:param NF4: The the same as INT4_SYM mode, but primary precision is NF4 data type without zero point.
:param INT8: Mode is deprecated and will be removed in future releases. Please use `INT8_ASYM` instead.
"""

INT8 = "int8"
INT8_SYM = "int8_sym"
INT8_ASYM = "int8_asym"
INT4_SYM = "int4_sym"
INT4_ASYM = "int4_asym"
NF4 = "nf4"
INT8 = "int8" # Deprecated mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about leaving INT8 as an alias for INT8_ASYM?

Original file line number Diff line number Diff line change
Expand Up @@ -54,17 +54,20 @@ def __init__(
):
"""
:param mode: Defines a mode for weight compression.
INT8 stands for 8-bit integer quantization of all weights.
INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
Weights are quantized symmetrically with a fixed zero point equals to 128.
INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,
All embeddings and the last layer are always compressed to a backup precision, which is INT8_ASYM,
by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
criteria and the given ratio.
INT4_ASYM is the same as INT4_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
NF4 is the same as INT4_SYM mode, but primary precision is NF4 data type without zero point.
:param ratio: the ratio between baseline and backup precisions (e.g. 0.9 means 90% of layers quantized to NF4
and the rest to INT8).
and the rest to INT8_ASYM).
:param group_size: number of weights (e.g. 128) in the channel dimension
that share quantization parameters (scale). The value -1 means no grouping.
:param ignored_scope: An ignored scope that defined the list of model control
Expand Down
16 changes: 11 additions & 5 deletions nncf/quantization/algorithms/weight_compression/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,10 +47,13 @@ def validate_params(mode: CompressWeightsMode, ignored_scope: Optional[IgnoredSc
parameters. Should be called on early algorithm steps to prevent execution of time-consuming operations.

:param mode: Defines a mode for weight compression.
INT8 stands for 8-bit integer quantization of all weights.
INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
Weights are quantized symmetrically with a fixed zero point equals to 128.
INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,
All embeddings and the last layer are always compressed to a backup precision, which is INT8_ASYM,
by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
criteria and the given ratio.
INT4_ASYM is the same as INT4_SYM mode, but weights are quantized to a primary precision asymmetrically
Expand All @@ -77,17 +80,20 @@ def do_compression(
:param nodes_to_compress: List of nodes in the model's graph,
corresponding to the layers for weight compression.
:param mode: Defines a mode for weight compression.
INT8 stands for 8-bit integer quantization of all weights.
INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
Weights are quantized symmetrically with a fixed zero point equals to 128.
INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,
All embeddings and the last layer are always compressed to a backup precision, which is INT8_ASYM,
by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
criteria and the given ratio.
INT4_ASYM is the same as INT4_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
NF4 is the same as INT4_SYM mode, but primary precision is NF4 data type without zero point.
:param ratio: The ratio between baseline and backup precisions (e.g. 0.9 means 90% of layers quantized to NF4
and the rest to INT8).
and the rest to INT8_ASYM).
:param group_size: Number of weights (e.g. 128) in the channel dimension
that share quantization parameters (scale). The value -1 means no grouping.
:return: A resulting model with compressed weights.
Expand Down
Loading