19680 & 19849 (openvinotoolkit#19879)

alvoron · Nov 6, 2023 · 81e3119 · 81e3119
1 parent bbc563b
commit 81e3119
Show file tree

Hide file tree

Showing 5 changed files with 52 additions and 5 deletions.
diff --git a/docs/_static/images/DEVELOPMENT_FLOW_V3_crunch.svg b/docs/_static/images/DEVELOPMENT_FLOW_V3_crunch.svg
diff --git a/docs/_static/images/WHAT_TO_USE.svg b/docs/_static/images/WHAT_TO_USE.svg
diff --git a/docs/optimization_guide/model_optimization_guide.md b/docs/optimization_guide/model_optimization_guide.md
@@ -8,6 +8,7 @@
 
    ptq_introduction
    tmo_introduction
+   weight_compression
 
 
 Model optimization is an optional offline step of improving the final model performance and reducing the model size by applying special optimization methods, such as 8-bit quantization, pruning, etc. OpenVINO offers two optimization paths implemented in `Neural Network Compression Framework (NNCF) <https://github.com/openvinotoolkit/nncf>`__:
@@ -16,9 +17,11 @@ Model optimization is an optional offline step of improving the final model perf
 
 - :doc:`Training-time Optimization <tmo_introduction>`, a suite of advanced methods for training-time model optimization within the DL framework, such as PyTorch and TensorFlow 2.x. It supports methods like Quantization-aware Training, Structured and Unstructured Pruning, etc. 
 
+- :doc:`Weight Compression <weight_compression>`, an easy-to-use method for Large Language Models footprint reduction and inference acceleration.
+
 .. note:: OpenVINO also supports optimized models (for example, quantized) from source frameworks such as PyTorch, TensorFlow, and ONNX (in Q/DQ; Quantize/DeQuantize format). No special steps are required in this case and optimized models can be converted to the OpenVINO Intermediate Representation format (IR) right away.
 
-Post-training Quantization is the fastest way to optimize a model and should be applied first, but it is limited in terms of achievable accuracy-performance trade-off. The recommended approach to obtain OpenVINO quantized model is to convert a model from original framework to ``ov.Model`` and ensure that the model works correctly in OpenVINO, for example, by calculating the model metrics. Then, ``ov.Model`` can be used as input for the ``nncf.quantize()`` method to get the quantized model (see the diagram below).
+Post-training Quantization is the fastest way to optimize an arbitrary DL model and should be applied first, but it is limited in terms of achievable accuracy-performance trade-off. The recommended approach to obtain OpenVINO quantized model is to convert a model from original framework to ``ov.Model`` and ensure that the model works correctly in OpenVINO, for example, by calculating the model metrics. Then, ``ov.Model`` can be used as input for the ``nncf.quantize()`` method to get the quantized model (see the diagram below).
 
 In case of unsatisfactory accuracy or performance after Post-training Quantization, Training-time Optimization can be used as an option.
 
@@ -33,6 +36,7 @@ Additional Resources
 
 - :doc:`Post-training Quantization <ptq_introduction>`
 - :doc:`Training-time Optimization <tmo_introduction>`
+- :doc:`Weight Compression <weight_compression>`
 - :doc:`Deployment optimization <openvino_docs_deployment_optimization_guide_dldt_optimization_guide>`
 - `HuggingFace Optimum Intel <https://huggingface.co/docs/optimum/intel/optimization_ov>`__
 

diff --git a/docs/optimization_guide/nncf/code/weight_compression_openvino.py b/docs/optimization_guide/nncf/code/weight_compression_openvino.py
@@ -0,0 +1,6 @@
+#! [compression_8bit]
+from nncf import compress_weights
+
+...
+model = compress_weights(model) # model is openvino.Model object
+#! [compression_8bit]
diff --git a/docs/optimization_guide/nncf/weight_compression.md b/docs/optimization_guide/nncf/weight_compression.md
@@ -0,0 +1,37 @@
+# Weight Compression {#weight_compression}
+
+@sphinxdirective
+
+Enhancing Model Efficiency with Weight Compression
+##################################################################
+
+Weight compression aims to reduce the memory footprint of a model. It can also lead to significant performance improvement for large memory-bound models, such as Large Language Models (LLMs). LLMs and other models, which require extensive memory to store the weights during inference, can benefit from weight compression in the following ways: 
+
+- enabling the inference of exceptionally large models that cannot be accommodated in the memory of the device; 
+- improving the inference performance of the models by reducing the latency of the memory access when computing the operations with weights, for example, Linear layers.
+
+Currently, NNCF provides 8-bit weight quantization as a compression method primarily designed to optimize LLMs. The main difference between weights compression and full model quantization (post-training quantization) is that activations remain floating-point in the case of weights compression which leads to a better accuracy. Weight compression for LLMs provides a solid inference performance improvement which is on par with the performance of the full model quantization. In addition, weight compression is data-free and does not require a calibration dataset, making it easy to use.
+
+Compress Model Weights
+######################
+
+The code snippet below shows how to compress the weights of the model represented in OpenVINO IR using NNCF:
+
+.. tab-set::
+
+   .. tab-item:: OpenVINO
+      :sync: openvino
+
+      .. doxygensnippet:: docs/optimization_guide/nncf/code/weight_compression_openvino.py
+         :language: python
+         :fragment: [compression_8bit]
+
+Now, the model is ready for compilation and inference. It can be also saved into a compressed format, resulting in a smaller binary file.
+
+Additional Resources
+####################
+
+- :doc:`Post-training Quantization <ptq_introduction>`
+- :doc:`Training-time Optimization <tmo_introduction>`
+
+@endsphinxdirective