Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Represent symmetrically quantized weights in signed data type #2434

Merged
merged 10 commits into from
Jun 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ The Weights Compression algorithm is aimed at compressing the weights of the mod
#### Supported modes

By default, weights are compressed asymmetrically to 8-bit integer data type - "INT8_ASYM" mode.
OpenVINO backend also supports 3 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM and NF4. The primary precision in case of INT4_SYM mode is unsigned 4-bit integer and weights are quantized to it [symmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) with a fixed zero point equals to 8. In case of INT4_ASYM mode - also unsigned 4-bit integer, but weight are quantized to it [asymmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point.
OpenVINO backend also supports 3 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM and NF4. The primary precision in case of INT4_SYM mode is signed 4-bit integer and weights are quantized to it [symmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) without zero point. In case of INT4_ASYM mode - unsigned 4-bit integer and weight are quantized to it [asymmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point.
All 4-bit modes have a grouped quantization support, when small group of weights (e.g. 128) in the channel dimension share quantization parameters (scale).
All embeddings, convolutions and last linear layers are always compressed to 8-bit integer data type. To quantize embeddings and last linear layers to 4-bit, use `all_layers=True`.
Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to 8-bit asymmetric integer data type.
Expand Down Expand Up @@ -484,7 +484,7 @@ Here is the perplexity and accuracy with data-free and data-aware mixed-precisio
- The algorithm is supported for OpenVINO and PyTorch models.
- The compression applies in-place.
- The compressed model is not trainable.
- INT8_SYM, INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
- INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
- NF4 support is experimental - models quantized to nf4 should not be faster models quantized to 8-bit integer.

#### Additional resources
Expand Down
4 changes: 2 additions & 2 deletions nncf/parameters.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,13 +68,13 @@ class CompressWeightsMode(StrEnum):
"""
Defines a mode for weight compression.
:param INT8_SYM: Stands for 8-bit integer symmetric quantization of all weights.
Weights are quantized symmetrically with a fixed zero point equals to 128.
Weights are quantized symmetrically without zero point.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization
:param INT8_ASYM: The same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization
:param INT4_SYM: Stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
Weights are quantized to a primary precision symmetrically without zero point.
All embeddings and the last layer are always compressed to a backup precision, which is INT8_ASYM,
by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
criteria and the given ratio.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -70,11 +70,11 @@ def __init__(
"""
:param mode: Defines a mode for weight compression.
INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
Weights are quantized symmetrically with a fixed zero point equals to 128.
Weights are quantized symmetrically without zero point.
INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
Weights are quantized to a primary precision symmetrically without zero point.
All embeddings and the last layer are always compressed to a backup precision, which is INT8_ASYM,
by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
criteria and the given ratio.
Expand Down
14 changes: 6 additions & 8 deletions nncf/quantization/algorithms/weight_compression/gptq.py
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,7 @@ def _quantize_weights(
quantized_col = decompress_nf4_weight(compressed_weights, scales[-1])
else:
compressed_weights = calculate_quantized_weight(
fns.unsqueeze(weight_col, 1), scales[-1], zero_points[-1], block_compression_config
fns.unsqueeze(weight_col, 1), block_compression_config, scales[-1], zero_points[-1]
)
quantized_col = do_dequantization(compressed_weights, scales[-1], zero_points[-1])
quantized_col = fns.flatten(quantized_col)
Expand All @@ -287,13 +287,11 @@ def _quantize_weights(
)

scales = fns.stack(scales, axis=1)
if wc_params.compression_config.mode == CompressWeightsMode.NF4:
zero_points = None
elif wc_params.compression_config.mode in [
CompressWeightsMode.INT8_SYM,
CompressWeightsMode.INT4_SYM,
if wc_params.compression_config.mode in [
CompressWeightsMode.INT8_ASYM,
CompressWeightsMode.INT4_ASYM,
]:
zero_points = fns.squeeze(zero_points[0])
else:
zero_points = fns.stack(zero_points, axis=1)
else:
zero_points = None
return scales, zero_points
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
from nncf.quantization.algorithms.weight_compression.backend import WeightCompressionAlgoBackend
from nncf.quantization.algorithms.weight_compression.config import WeightCompressionConfig
from nncf.quantization.algorithms.weight_compression.config import WeightCompressionParameters
from nncf.quantization.algorithms.weight_compression.weight_lowering import do_dequantization
from nncf.quantization.algorithms.weight_compression.weight_lowering import do_integer_quantization
from nncf.quantization.algorithms.weight_compression.weight_lowering import get_integer_quantization_error

Expand Down Expand Up @@ -176,7 +177,7 @@ def _calc_weight_sensitivity(self, weight_param: WeightCompressionParameters) ->
weight = weight.astype(TensorDataType.float32)

compressed_weights, scale, zero_point = do_integer_quantization(weight, reduction_axes, backup_config)
decompressed_weight = (compressed_weights - zero_point).astype(weight.dtype) * scale
decompressed_weight = do_dequantization(compressed_weights, scale, zero_point)
decompressed_weight = decompressed_weight.reshape(orig_shape)
return fns.linalg.norm(decompressed_weight - weight, ord="fro").item()

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from nncf.common.graph.transformations.commands import TargetType
from nncf.common.graph.utils import get_reduction_axes
from nncf.experimental.common.tensor_statistics.collectors import TensorCollector
from nncf.experimental.tensor.definitions import TensorDataType
from nncf.experimental.tensor.tensor import Tensor
from nncf.openvino.graph.metatypes import openvino_metatypes as om
from nncf.openvino.graph.model_transformer import OVModelTransformer
Expand Down Expand Up @@ -134,17 +135,14 @@ def transform_model(
compression_config = wc_params.compression_config
if compression_config.mode == CompressWeightsMode.NF4:
compression_dtype = ov.Type.nf4
elif compression_config.mode in [
CompressWeightsMode.INT8_ASYM,
CompressWeightsMode.INT8_SYM,
CompressWeightsMode.INT8,
CompressWeightsMode.INT4_ASYM,
CompressWeightsMode.INT4_SYM,
]:
if compression_config.mode in [CompressWeightsMode.INT4_ASYM, CompressWeightsMode.INT4_SYM]:
compression_dtype = ov.Type.u4
else:
compression_dtype = ov.Type.u8
elif compression_config.mode == CompressWeightsMode.INT4_SYM:
compression_dtype = ov.Type.i4
elif compression_config.mode == CompressWeightsMode.INT4_ASYM:
compression_dtype = ov.Type.u4
elif compression_config.mode == CompressWeightsMode.INT8_SYM:
compression_dtype = ov.Type.i8
elif compression_config.mode == CompressWeightsMode.INT8_ASYM:
compression_dtype = ov.Type.u8
else:
raise ValueError(f"{compression_config.mode.value} is not supported.")

Expand Down Expand Up @@ -175,7 +173,7 @@ def transform_model(
compressed_weight.tensor.data, dtype=compression_dtype, name=const_node_name
)
converted_const = opset.convert(compressed_const, ov.Type.f16)
if compressed_weight.zero_point is not None:
if compressed_weight.zero_point is not None and compressed_weight.tensor.dtype == TensorDataType.uint8:
zero_point_const = opset.constant(
compressed_weight.zero_point.data,
dtype=compression_dtype,
Expand Down Expand Up @@ -220,27 +218,28 @@ def dump_parameters(

@staticmethod
def get_compress_decompress_pipeline(
weight_compression_parameter: WeightCompressionParameters, w_shape, s_shape, z_p_shape
weight_compression_parameter: WeightCompressionParameters, w_shape, s_shape, z_p_shape=None
):
(
w,
s,
zp,
clamp,
) = OVWeightCompressionAlgoBackend.get_compress_pipeline(
parameters, clamp = OVWeightCompressionAlgoBackend.get_compress_pipeline(
weight_compression_parameter, w_shape, s_shape, z_p_shape, True
)

result = (clamp - zp) * s
model = ov.Model([result], [w, s, zp])
if len(parameters) == 3:
_, s, zp = parameters
result = (clamp - zp) * s
else:
s = parameters[1]
result = clamp * s

model = ov.Model([result], parameters)

compiled_model = ov.compile_model(model)

return lambda w, s, zp: compiled_model([w, s, zp])[0]
return lambda parameters: compiled_model(parameters)[0]

@staticmethod
def get_compress_pipeline(
weight_compression_parameter: WeightCompressionParameters, w_shape, s_shape, z_p_shape, return_nodes=False
weight_compression_parameter: WeightCompressionParameters, w_shape, s_shape, z_p_shape=None, return_nodes=False
):
config = weight_compression_parameter.compression_config
mode = config.mode
Expand All @@ -252,18 +251,23 @@ def get_compress_pipeline(

w = opset.parameter(w_shape, name="w")
s = opset.parameter(s_shape, name="s")
zp = opset.parameter(z_p_shape, name="zp")
parameters = [w, s]
compressed_w = w / s
if z_p_shape is not None:
zp = opset.parameter(z_p_shape, name="zp")
parameters.append(zp)
compressed_w += zp

result = opset.clamp(opset.round(w / s + zp), level_low, level_high, name="compressed_weights")
result = opset.clamp(opset.round(compressed_w), level_low, level_high, name="compressed_weights")

if return_nodes:
return w, s, zp, result
return parameters, result

model = ov.Model([result], [w, s, zp])
model = ov.Model([result], parameters)

compiled_model = ov.compile_model(model)

return lambda w, s, zp: compiled_model([w, s, zp])[0]
return lambda parameters: compiled_model(parameters)[0]


class OVAWQAlgoAlgoBackend(OVWeightCompressionAlgoBackend):
Expand Down
Loading
Loading