From f02e5efe947fb968d6d7bee44ab5d392a4e92564 Mon Sep 17 00:00:00 2001 From: andreyanufr Date: Mon, 8 Jul 2024 16:12:16 +0200 Subject: [PATCH] Updated docs with MXFP4 (e2m1, e8m0) information. (#2797) ### Changes Added short MXFP4 (e2m1, e8m0) description to docs. ### Reason for changes ### Related tickets ### Tests --- docs/Algorithms.md | 1 + .../weights_compression/Usage.md | 14 ++++++++++++-- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/docs/Algorithms.md b/docs/Algorithms.md index b896df6dcf7..d6e23cebfdb 100644 --- a/docs/Algorithms.md +++ b/docs/Algorithms.md @@ -11,6 +11,7 @@ - Symmetric 8 bit compression mode - Symmetric and asymmetric 4 bit compression mode - NF4 compression mode + - E2M1 weights with E8M0 scales compression mode - Mixed precision weights compression - Grouped weights compression diff --git a/docs/usage/post_training_compression/weights_compression/Usage.md b/docs/usage/post_training_compression/weights_compression/Usage.md index 528ed46ed15..fbbc6a23305 100644 --- a/docs/usage/post_training_compression/weights_compression/Usage.md +++ b/docs/usage/post_training_compression/weights_compression/Usage.md @@ -9,7 +9,7 @@ The Weights Compression algorithm is aimed at compressing the weights of the mod #### Supported modes By default, weights are compressed asymmetrically to 8-bit integer data type - "INT8_ASYM" mode. -OpenVINO backend also supports 3 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM and NF4. The primary precision in case of INT4_SYM mode is signed 4-bit integer and weights are quantized to it [symmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) without zero point. In case of INT4_ASYM mode - unsigned 4-bit integer and weight are quantized to it [asymmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point. +OpenVINO backend also supports 3 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM, NF4, E2M1. The primary precision in case of INT4_SYM mode is signed 4-bit integer and weights are quantized to it [symmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) without zero point. In case of INT4_ASYM mode - unsigned 4-bit integer and weight are quantized to it [asymmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point. In case of E2M1 mode - [e2m1](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) data type without zero point and has 8bit [E8M0](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) scale. All 4-bit modes have a grouped quantization support, when small group of weights (e.g. 128) in the channel dimension share quantization parameters (scale). All embeddings, convolutions and last linear layers are always compressed to 8-bit integer data type. To quantize embeddings and last linear layers to 4-bit, use `all_layers=True`. Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to 8-bit asymmetric integer data type. @@ -144,6 +144,15 @@ from nncf import compress_weights, CompressWeightsMode compressed_model = compress_weights(model, mode=CompressWeightsMode.NF4) ``` +- `E2M1` mode can be considered for improving accuracy, but currently models quantized to e2m1 should not be faster models + quantized to 8-bit asymmetric integer. Here's the example how to compress weights to e2m1 data type with group size = 32 (recommended). + Different `group_size` and `ratio` are also supported. + +```python +from nncf import compress_weights, CompressWeightsMode +compressed_model = compress_weights(model, mode=CompressWeightsMode.E2M1, group_size=32, all_layers=True) +``` + #### Evaluation results Here is the perplexity and model size before and after weight compression for different language models on the [Lambada OpenAI dataset](https://github.com/openai/gpt-2/issues/131#issuecomment-497136199). @@ -512,8 +521,9 @@ Here is the perplexity and accuracy with data-free and data-aware mixed-precisio - The algorithm is supported for OpenVINO and PyTorch models. - The compression applies in-place. - The compressed model is not trainable. -- INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only. +- INT4_SYM, INT4_ASYM, NF4 and E2M1 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only. - NF4 support is experimental - models quantized to nf4 should not be faster models quantized to 8-bit integer. +- E2M1 support is experimental - models quantized to e2m1 should not be faster models quantized to 8-bit integer. #### Additional resources