Skip to content

Commit

Permalink
Support backup precision option for WC (#2978)
Browse files Browse the repository at this point in the history
### Changes

Add functionality to determine the backup precision to be used for
layers that are not quantized to the primary precision, which is set to
INT8_ASYM by default.

Example: Compress weights to INT4_ASYM channel-wise (group size=-1),
except embeddings, convolutions and last linear layers - they are remain
in original floating-point precision.

```python
from nncf import compress_weights, BackupMode, CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_ASYM, group size=-1, backup_mode=BackupMode.NONE) # model is openvino.Model object
```

![image](https://github.com/user-attachments/assets/b99ce421-af1c-4a9d-83ad-a97841b44e41)
statistics:

![image](https://github.com/user-attachments/assets/744f177c-f212-456c-b426-ba60ec747183)


For mode=CompressWeightsMode.INT4_ASYM,
backup_mode=BackupMode.INT8_ASYM, and a non-empty ignored_scope, the
statistics string contains three different precisions:


![image](https://github.com/user-attachments/assets/abab1a1b-c932-4001-9c84-858d10e23cda)

### Reason for changes

 To define backup mode for compress_weight

### Related tickets

* 152056

### Tests

* test_data_free_compression_with_backup_mode
* test_data_based_compression_with_backup_mode
* tinyllama_awq_backup_mode_none
  • Loading branch information
l-bat authored Oct 7, 2024
1 parent 3b2b8c3 commit 174bd03
Show file tree
Hide file tree
Showing 17 changed files with 305 additions and 31 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ The Weights Compression algorithm is aimed at compressing the weights of the mod
By default, weights are compressed asymmetrically to 8-bit integer data type - "INT8_ASYM" mode.
OpenVINO backend also supports 4 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM, NF4, E2M1. The primary precision in case of INT4_SYM mode is signed 4-bit integer and weights are quantized to it [symmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) without zero point. In case of INT4_ASYM mode - unsigned 4-bit integer and weight are quantized to it [asymmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point. In case of E2M1 mode - [e2m1](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) data type without zero point and has 8bit [E8M0](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) scale.
All 4-bit modes have a grouped quantization support, when small group of weights (e.g. 128) in the channel dimension share quantization parameters (scale).
All embeddings, convolutions and last linear layers are always compressed to 8-bit integer data type. To quantize embeddings and last linear layers to 4-bit, use `all_layers=True`.
Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to 8-bit asymmetric integer data type.
All embeddings, convolutions and last linear layers are always compressed to a backup mode, which is "INT8_ASYM", by default. To quantize embeddings and last linear layers to 4-bit, use `all_layers=True`.
Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to a backup mode. OpenVINO backend supports 3 backup modes: INT8_SYM, INT8_ASYM, and NONE, which retains the original floating-point precision of the model weights. Backup mode is supported only for mixed-precision weight quantization.

### User guide

Expand All @@ -37,6 +37,13 @@ from nncf import compress_weights, CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_SYM) # model is openvino.Model object
```

- Compress weights to NF4 with group size = 128, except embeddings, convolutions and last linear layers - they are remain in original floating-point precision.

```python
from nncf import compress_weights, BackupMode, CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.NF4, backup_mode=BackupMode.NONE) # model is openvino.Model object
```

- Generally, `INT4_SYM` mode is the fastest mixed-precision mode, but it may lead to a significant accuracy degradation or perplexity increase.
Compressing weights asymmetrically (`INT4_ASYM` mode) is the way to increase accuracy, however in turns it slows down inference a bit.
If the accuracy or perplexity is still not satisfying, there are 2 more hyper-parameters to tune: `group_size` and `ratio`. Please refer to the [example](https://github.com/openvinotoolkit/nncf/blob/develop/examples/llm_compression/openvino/tiny_llama_find_hyperparams) how to automatically tune these parameters.
Expand Down
1 change: 1 addition & 0 deletions nncf/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
from nncf.errors import UnsupportedModelError as UnsupportedModelError
from nncf.errors import UnsupportedVersionError as UnsupportedVersionError
from nncf.errors import ValidationError as ValidationError
from nncf.parameters import BackupMode as BackupMode
from nncf.parameters import CompressWeightsMode as CompressWeightsMode
from nncf.parameters import DropType as DropType
from nncf.parameters import ModelType as ModelType
Expand Down
3 changes: 3 additions & 0 deletions nncf/experimental/torch/fx/quantization/quantize_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
from nncf.experimental.torch.fx.transformations import apply_quantization_transformations
from nncf.experimental.torch.fx.transformations import revert_quantization_transformations
from nncf.experimental.torch.fx.transformations import shared_constants_unification_transformation
from nncf.parameters import BackupMode
from nncf.parameters import CompressWeightsMode
from nncf.parameters import ModelType
from nncf.parameters import QuantizationMode
Expand Down Expand Up @@ -124,6 +125,7 @@ def compress_weights_impl(
scale_estimation: bool,
gptq: bool,
lora_correction: bool,
backup_mode: BackupMode,
advanced_parameters: Optional[AdvancedCompressionParameters] = None,
) -> torch.fx.GraphModule:
"""
Expand All @@ -142,6 +144,7 @@ def compress_weights_impl(
scale_estimation,
gptq,
lora_correction,
backup_mode,
advanced_parameters,
)
shared_constants_unification_transformation(model)
Expand Down
3 changes: 3 additions & 0 deletions nncf/openvino/quantization/quantize_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
from nncf.openvino.quantization.backend_parameters import is_weight_compression_needed
from nncf.openvino.quantization.quantize_ifmodel import apply_algorithm_if_bodies
from nncf.openvino.rt_info import dump_parameters
from nncf.parameters import BackupMode
from nncf.parameters import CompressWeightsMode
from nncf.parameters import DropType
from nncf.parameters import ModelType
Expand Down Expand Up @@ -379,6 +380,7 @@ def compress_weights_impl(
scale_estimation: bool,
gptq: bool,
lora_correction: bool,
backup_mode: BackupMode,
advanced_parameters: Optional[AdvancedCompressionParameters] = None,
) -> ov.Model:
"""
Expand All @@ -398,6 +400,7 @@ def compress_weights_impl(
scale_estimation,
gptq,
lora_correction,
backup_mode,
advanced_parameters,
)
graph = NNCFGraphFactory.create(model)
Expand Down
17 changes: 17 additions & 0 deletions nncf/parameters.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,23 @@ class CompressWeightsMode(StrEnum):
E2M1 = "e2m1"


@api(canonical_alias="nncf.BackupMode")
class BackupMode(StrEnum):
"""
Defines a backup mode for weight compression.
:param NONE: Stands for original floating-point precision of the model weights.
In this mode, weights are retained in their original precision without any quantization.
:param INT8_SYM: Stands for 8-bit integer symmetric quantization without zero point.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization
:param INT8_ASYM: Stands for 8-bit integer asymmetric quantization with a typical non-fixed zero point.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization
"""

NONE = "none"
INT8_SYM = "int8_sym"
INT8_ASYM = "int8_asym"


@api(canonical_alias="nncf.SensitivityMetric")
class SensitivityMetric(StrEnum):
"""
Expand Down
Loading

0 comments on commit 174bd03

Please sign in to comment.