Skip to content

Commit

Permalink
Represent symmetrically quantized weights in signed data type (#2434)
Browse files Browse the repository at this point in the history
### Changes

Represent symmetrically quantized weights in signed data type with no
zero point

### Reason for changes

* To detect the quantization type without analyzing zero-point values
* Signed data type for symmetrically quantized weights will lead to a
smaller footprint, especially in case of grouped quantization.

### Related tickets

130625

### Tests

Updated: `tests/torch/ptq/test_weights_compression.py` and
`tests/openvino/native/quantization/test_weights_compression.py`


Merge after: openvinotoolkit/openvino#24457

Model | Backend | Metric name | Metric value | Metric diff | Num int4 |
Num int8 | RAM MiB | Compr. time | Total time
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
tinyllama_data_aware_awq_scale_estimation | OV | Similarity | 0.84048 |
-0.15952 | 94 | 124 | 35560 | 0:06:31 | 0:08:36
tinyllama_data_aware_awq_scale_estimation_stateful | OV | Similarity |
0.84048 | -0.15952 | 94 | 124 | 36612 | 0:06:12 | 0:07:40
tinyllama_data_aware_awq_stateful | OV | Similarity | 0.85259 | -0.14741
| 94 | 124 | 34824 | 0:01:50 | 0:03:17
tinyllama_data_aware | OV | Similarity | 0.83853 | -0.16147 | 94 | 124 |
30604 | 0:01:25 | 0:03:30
tinyllama_data_aware_gptq | OV | Similarity | 0.82187 | -0.17813 | 94 |
124 | 39624 | 0:25:09 | 0:27:10
tinyllama_data_free | OV | Similarity | 0.72057 | -0.27943 | 114 | 84 |
6671 | 0:00:42 | 0:02:46
tinyllama_int8_data_free | TORCH | Similarity | 0.95624 | -0.04376 | 0 |
312 | 30161 | 0:00:09 | 0:02:54
  • Loading branch information
l-bat authored Jun 13, 2024
1 parent 9200a22 commit 85b3263
Show file tree
Hide file tree
Showing 16 changed files with 378 additions and 305 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ The Weights Compression algorithm is aimed at compressing the weights of the mod
#### Supported modes

By default, weights are compressed asymmetrically to 8-bit integer data type - "INT8_ASYM" mode.
OpenVINO backend also supports 3 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM and NF4. The primary precision in case of INT4_SYM mode is unsigned 4-bit integer and weights are quantized to it [symmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) with a fixed zero point equals to 8. In case of INT4_ASYM mode - also unsigned 4-bit integer, but weight are quantized to it [asymmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point.
OpenVINO backend also supports 3 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM and NF4. The primary precision in case of INT4_SYM mode is signed 4-bit integer and weights are quantized to it [symmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) without zero point. In case of INT4_ASYM mode - unsigned 4-bit integer and weight are quantized to it [asymmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point.
All 4-bit modes have a grouped quantization support, when small group of weights (e.g. 128) in the channel dimension share quantization parameters (scale).
All embeddings, convolutions and last linear layers are always compressed to 8-bit integer data type. To quantize embeddings and last linear layers to 4-bit, use `all_layers=True`.
Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to 8-bit asymmetric integer data type.
Expand Down Expand Up @@ -484,7 +484,7 @@ Here is the perplexity and accuracy with data-free and data-aware mixed-precisio
- The algorithm is supported for OpenVINO and PyTorch models.
- The compression applies in-place.
- The compressed model is not trainable.
- INT8_SYM, INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
- INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
- NF4 support is experimental - models quantized to nf4 should not be faster models quantized to 8-bit integer.

#### Additional resources
Expand Down
4 changes: 2 additions & 2 deletions nncf/parameters.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,13 +68,13 @@ class CompressWeightsMode(StrEnum):
"""
Defines a mode for weight compression.
:param INT8_SYM: Stands for 8-bit integer symmetric quantization of all weights.
Weights are quantized symmetrically with a fixed zero point equals to 128.
Weights are quantized symmetrically without zero point.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization
:param INT8_ASYM: The same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization
:param INT4_SYM: Stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
Weights are quantized to a primary precision symmetrically without zero point.
All embeddings and the last layer are always compressed to a backup precision, which is INT8_ASYM,
by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
criteria and the given ratio.
Expand Down
4 changes: 2 additions & 2 deletions nncf/quantization/algorithms/weight_compression/algorithm.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,11 +70,11 @@ def __init__(
"""
:param mode: Defines a mode for weight compression.
INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
Weights are quantized symmetrically with a fixed zero point equals to 128.
Weights are quantized symmetrically without zero point.
INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
Weights are quantized to a primary precision symmetrically without zero point.
All embeddings and the last layer are always compressed to a backup precision, which is INT8_ASYM,
by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
criteria and the given ratio.
Expand Down
14 changes: 6 additions & 8 deletions nncf/quantization/algorithms/weight_compression/gptq.py
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,7 @@ def _quantize_weights(
quantized_col = decompress_nf4_weight(compressed_weights, scales[-1])
else:
compressed_weights = calculate_quantized_weight(
fns.unsqueeze(weight_col, 1), scales[-1], zero_points[-1], block_compression_config
fns.unsqueeze(weight_col, 1), block_compression_config, scales[-1], zero_points[-1]
)
quantized_col = do_dequantization(compressed_weights, scales[-1], zero_points[-1])
quantized_col = fns.flatten(quantized_col)
Expand All @@ -287,13 +287,11 @@ def _quantize_weights(
)

scales = fns.stack(scales, axis=1)
if wc_params.compression_config.mode == CompressWeightsMode.NF4:
zero_points = None
elif wc_params.compression_config.mode in [
CompressWeightsMode.INT8_SYM,
CompressWeightsMode.INT4_SYM,
if wc_params.compression_config.mode in [
CompressWeightsMode.INT8_ASYM,
CompressWeightsMode.INT4_ASYM,
]:
zero_points = fns.squeeze(zero_points[0])
else:
zero_points = fns.stack(zero_points, axis=1)
else:
zero_points = None
return scales, zero_points
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
from nncf.quantization.algorithms.weight_compression.backend import WeightCompressionAlgoBackend
from nncf.quantization.algorithms.weight_compression.config import WeightCompressionConfig
from nncf.quantization.algorithms.weight_compression.config import WeightCompressionParameters
from nncf.quantization.algorithms.weight_compression.weight_lowering import do_dequantization
from nncf.quantization.algorithms.weight_compression.weight_lowering import do_integer_quantization
from nncf.quantization.algorithms.weight_compression.weight_lowering import get_integer_quantization_error

Expand Down Expand Up @@ -176,7 +177,7 @@ def _calc_weight_sensitivity(self, weight_param: WeightCompressionParameters) ->
weight = weight.astype(TensorDataType.float32)

compressed_weights, scale, zero_point = do_integer_quantization(weight, reduction_axes, backup_config)
decompressed_weight = (compressed_weights - zero_point).astype(weight.dtype) * scale
decompressed_weight = do_dequantization(compressed_weights, scale, zero_point)
decompressed_weight = decompressed_weight.reshape(orig_shape)
return fns.linalg.norm(decompressed_weight - weight, ord="fro").item()

Expand Down
60 changes: 32 additions & 28 deletions nncf/quantization/algorithms/weight_compression/openvino_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from nncf.common.graph.transformations.commands import TargetType
from nncf.common.graph.utils import get_reduction_axes
from nncf.experimental.common.tensor_statistics.collectors import TensorCollector
from nncf.experimental.tensor.definitions import TensorDataType
from nncf.experimental.tensor.tensor import Tensor
from nncf.openvino.graph.metatypes import openvino_metatypes as om
from nncf.openvino.graph.model_transformer import OVModelTransformer
Expand Down Expand Up @@ -134,17 +135,14 @@ def transform_model(
compression_config = wc_params.compression_config
if compression_config.mode == CompressWeightsMode.NF4:
compression_dtype = ov.Type.nf4
elif compression_config.mode in [
CompressWeightsMode.INT8_ASYM,
CompressWeightsMode.INT8_SYM,
CompressWeightsMode.INT8,
CompressWeightsMode.INT4_ASYM,
CompressWeightsMode.INT4_SYM,
]:
if compression_config.mode in [CompressWeightsMode.INT4_ASYM, CompressWeightsMode.INT4_SYM]:
compression_dtype = ov.Type.u4
else:
compression_dtype = ov.Type.u8
elif compression_config.mode == CompressWeightsMode.INT4_SYM:
compression_dtype = ov.Type.i4
elif compression_config.mode == CompressWeightsMode.INT4_ASYM:
compression_dtype = ov.Type.u4
elif compression_config.mode == CompressWeightsMode.INT8_SYM:
compression_dtype = ov.Type.i8
elif compression_config.mode == CompressWeightsMode.INT8_ASYM:
compression_dtype = ov.Type.u8
else:
raise ValueError(f"{compression_config.mode.value} is not supported.")

Expand Down Expand Up @@ -175,7 +173,7 @@ def transform_model(
compressed_weight.tensor.data, dtype=compression_dtype, name=const_node_name
)
converted_const = opset.convert(compressed_const, ov.Type.f16)
if compressed_weight.zero_point is not None:
if compressed_weight.zero_point is not None and compressed_weight.tensor.dtype == TensorDataType.uint8:
zero_point_const = opset.constant(
compressed_weight.zero_point.data,
dtype=compression_dtype,
Expand Down Expand Up @@ -220,27 +218,28 @@ def dump_parameters(

@staticmethod
def get_compress_decompress_pipeline(
weight_compression_parameter: WeightCompressionParameters, w_shape, s_shape, z_p_shape
weight_compression_parameter: WeightCompressionParameters, w_shape, s_shape, z_p_shape=None
):
(
w,
s,
zp,
clamp,
) = OVWeightCompressionAlgoBackend.get_compress_pipeline(
parameters, clamp = OVWeightCompressionAlgoBackend.get_compress_pipeline(
weight_compression_parameter, w_shape, s_shape, z_p_shape, True
)

result = (clamp - zp) * s
model = ov.Model([result], [w, s, zp])
if len(parameters) == 3:
_, s, zp = parameters
result = (clamp - zp) * s
else:
s = parameters[1]
result = clamp * s

model = ov.Model([result], parameters)

compiled_model = ov.compile_model(model)

return lambda w, s, zp: compiled_model([w, s, zp])[0]
return lambda parameters: compiled_model(parameters)[0]

@staticmethod
def get_compress_pipeline(
weight_compression_parameter: WeightCompressionParameters, w_shape, s_shape, z_p_shape, return_nodes=False
weight_compression_parameter: WeightCompressionParameters, w_shape, s_shape, z_p_shape=None, return_nodes=False
):
config = weight_compression_parameter.compression_config
mode = config.mode
Expand All @@ -252,18 +251,23 @@ def get_compress_pipeline(

w = opset.parameter(w_shape, name="w")
s = opset.parameter(s_shape, name="s")
zp = opset.parameter(z_p_shape, name="zp")
parameters = [w, s]
compressed_w = w / s
if z_p_shape is not None:
zp = opset.parameter(z_p_shape, name="zp")
parameters.append(zp)
compressed_w += zp

result = opset.clamp(opset.round(w / s + zp), level_low, level_high, name="compressed_weights")
result = opset.clamp(opset.round(compressed_w), level_low, level_high, name="compressed_weights")

if return_nodes:
return w, s, zp, result
return parameters, result

model = ov.Model([result], [w, s, zp])
model = ov.Model([result], parameters)

compiled_model = ov.compile_model(model)

return lambda w, s, zp: compiled_model([w, s, zp])[0]
return lambda parameters: compiled_model(parameters)[0]


class OVAWQAlgoAlgoBackend(OVWeightCompressionAlgoBackend):
Expand Down
Loading

0 comments on commit 85b3263

Please sign in to comment.