From 81e31190a98b8cf1846127da96bfd6e1dca9c595 Mon Sep 17 00:00:00 2001 From: Maciej Smyk Date: Fri, 15 Sep 2023 14:55:32 +0200 Subject: [PATCH] 19680 & 19849 (#19879) --- .../images/DEVELOPMENT_FLOW_V3_crunch.svg | 4 +- docs/_static/images/WHAT_TO_USE.svg | 4 +- .../model_optimization_guide.md | 6 ++- .../nncf/code/weight_compression_openvino.py | 6 +++ .../nncf/weight_compression.md | 37 +++++++++++++++++++ 5 files changed, 52 insertions(+), 5 deletions(-) create mode 100644 docs/optimization_guide/nncf/code/weight_compression_openvino.py create mode 100644 docs/optimization_guide/nncf/weight_compression.md diff --git a/docs/_static/images/DEVELOPMENT_FLOW_V3_crunch.svg b/docs/_static/images/DEVELOPMENT_FLOW_V3_crunch.svg index 99023d14b6c4b1..c183f387509c1c 100644 --- a/docs/_static/images/DEVELOPMENT_FLOW_V3_crunch.svg +++ b/docs/_static/images/DEVELOPMENT_FLOW_V3_crunch.svg @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:d02f98d7e50d663e0f525366b59cac16175ad437ee54147950a62b9bccb85030 -size 413331 +oid sha256:3ea9a60d2b9a1f3056f46b995f9cf5fee83d04e57ada60c60ed7661ab09d08c7 +size 367000 diff --git a/docs/_static/images/WHAT_TO_USE.svg b/docs/_static/images/WHAT_TO_USE.svg index 5a87c4558221db..5cba27ea4ee0eb 100644 --- a/docs/_static/images/WHAT_TO_USE.svg +++ b/docs/_static/images/WHAT_TO_USE.svg @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:b71a90fd9ec78356eef5ef0c9d80831c1439fbfc05d42fc0ad648f4b5aa151aa -size 286982 +oid sha256:2a30e182191979bf0693d9060eee2596313e162042f68bce31859670e21ad0c4 +size 206149 diff --git a/docs/optimization_guide/model_optimization_guide.md b/docs/optimization_guide/model_optimization_guide.md index 718fd5310aaea3..02963573151d89 100644 --- a/docs/optimization_guide/model_optimization_guide.md +++ b/docs/optimization_guide/model_optimization_guide.md @@ -8,6 +8,7 @@ ptq_introduction tmo_introduction + weight_compression Model optimization is an optional offline step of improving the final model performance and reducing the model size by applying special optimization methods, such as 8-bit quantization, pruning, etc. OpenVINO offers two optimization paths implemented in `Neural Network Compression Framework (NNCF) `__: @@ -16,9 +17,11 @@ Model optimization is an optional offline step of improving the final model perf - :doc:`Training-time Optimization `, a suite of advanced methods for training-time model optimization within the DL framework, such as PyTorch and TensorFlow 2.x. It supports methods like Quantization-aware Training, Structured and Unstructured Pruning, etc. +- :doc:`Weight Compression `, an easy-to-use method for Large Language Models footprint reduction and inference acceleration. + .. note:: OpenVINO also supports optimized models (for example, quantized) from source frameworks such as PyTorch, TensorFlow, and ONNX (in Q/DQ; Quantize/DeQuantize format). No special steps are required in this case and optimized models can be converted to the OpenVINO Intermediate Representation format (IR) right away. -Post-training Quantization is the fastest way to optimize a model and should be applied first, but it is limited in terms of achievable accuracy-performance trade-off. The recommended approach to obtain OpenVINO quantized model is to convert a model from original framework to ``ov.Model`` and ensure that the model works correctly in OpenVINO, for example, by calculating the model metrics. Then, ``ov.Model`` can be used as input for the ``nncf.quantize()`` method to get the quantized model (see the diagram below). +Post-training Quantization is the fastest way to optimize an arbitrary DL model and should be applied first, but it is limited in terms of achievable accuracy-performance trade-off. The recommended approach to obtain OpenVINO quantized model is to convert a model from original framework to ``ov.Model`` and ensure that the model works correctly in OpenVINO, for example, by calculating the model metrics. Then, ``ov.Model`` can be used as input for the ``nncf.quantize()`` method to get the quantized model (see the diagram below). In case of unsatisfactory accuracy or performance after Post-training Quantization, Training-time Optimization can be used as an option. @@ -33,6 +36,7 @@ Additional Resources - :doc:`Post-training Quantization ` - :doc:`Training-time Optimization ` +- :doc:`Weight Compression ` - :doc:`Deployment optimization ` - `HuggingFace Optimum Intel `__ diff --git a/docs/optimization_guide/nncf/code/weight_compression_openvino.py b/docs/optimization_guide/nncf/code/weight_compression_openvino.py new file mode 100644 index 00000000000000..c9ab67efd5aa32 --- /dev/null +++ b/docs/optimization_guide/nncf/code/weight_compression_openvino.py @@ -0,0 +1,6 @@ +#! [compression_8bit] +from nncf import compress_weights + +... +model = compress_weights(model) # model is openvino.Model object +#! [compression_8bit] \ No newline at end of file diff --git a/docs/optimization_guide/nncf/weight_compression.md b/docs/optimization_guide/nncf/weight_compression.md new file mode 100644 index 00000000000000..efec4839d47a1f --- /dev/null +++ b/docs/optimization_guide/nncf/weight_compression.md @@ -0,0 +1,37 @@ +# Weight Compression {#weight_compression} + +@sphinxdirective + +Enhancing Model Efficiency with Weight Compression +################################################################## + +Weight compression aims to reduce the memory footprint of a model. It can also lead to significant performance improvement for large memory-bound models, such as Large Language Models (LLMs). LLMs and other models, which require extensive memory to store the weights during inference, can benefit from weight compression in the following ways: + +- enabling the inference of exceptionally large models that cannot be accommodated in the memory of the device; +- improving the inference performance of the models by reducing the latency of the memory access when computing the operations with weights, for example, Linear layers. + +Currently, NNCF provides 8-bit weight quantization as a compression method primarily designed to optimize LLMs. The main difference between weights compression and full model quantization (post-training quantization) is that activations remain floating-point in the case of weights compression which leads to a better accuracy. Weight compression for LLMs provides a solid inference performance improvement which is on par with the performance of the full model quantization. In addition, weight compression is data-free and does not require a calibration dataset, making it easy to use. + +Compress Model Weights +###################### + +The code snippet below shows how to compress the weights of the model represented in OpenVINO IR using NNCF: + +.. tab-set:: + + .. tab-item:: OpenVINO + :sync: openvino + + .. doxygensnippet:: docs/optimization_guide/nncf/code/weight_compression_openvino.py + :language: python + :fragment: [compression_8bit] + +Now, the model is ready for compilation and inference. It can be also saved into a compressed format, resulting in a smaller binary file. + +Additional Resources +#################### + +- :doc:`Post-training Quantization ` +- :doc:`Training-time Optimization ` + +@endsphinxdirective