Skip to content

Commit

Permalink
19680 & 19849 (openvinotoolkit#19879)
Browse files Browse the repository at this point in the history
  • Loading branch information
msmykx-intel authored and alvoron committed Nov 6, 2023
1 parent bbc563b commit 81e3119
Show file tree
Hide file tree
Showing 5 changed files with 52 additions and 5 deletions.
4 changes: 2 additions & 2 deletions docs/_static/images/DEVELOPMENT_FLOW_V3_crunch.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/_static/images/WHAT_TO_USE.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 5 additions & 1 deletion docs/optimization_guide/model_optimization_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

ptq_introduction
tmo_introduction
weight_compression


Model optimization is an optional offline step of improving the final model performance and reducing the model size by applying special optimization methods, such as 8-bit quantization, pruning, etc. OpenVINO offers two optimization paths implemented in `Neural Network Compression Framework (NNCF) <https://github.com/openvinotoolkit/nncf>`__:
Expand All @@ -16,9 +17,11 @@ Model optimization is an optional offline step of improving the final model perf

- :doc:`Training-time Optimization <tmo_introduction>`, a suite of advanced methods for training-time model optimization within the DL framework, such as PyTorch and TensorFlow 2.x. It supports methods like Quantization-aware Training, Structured and Unstructured Pruning, etc.

- :doc:`Weight Compression <weight_compression>`, an easy-to-use method for Large Language Models footprint reduction and inference acceleration.

.. note:: OpenVINO also supports optimized models (for example, quantized) from source frameworks such as PyTorch, TensorFlow, and ONNX (in Q/DQ; Quantize/DeQuantize format). No special steps are required in this case and optimized models can be converted to the OpenVINO Intermediate Representation format (IR) right away.

Post-training Quantization is the fastest way to optimize a model and should be applied first, but it is limited in terms of achievable accuracy-performance trade-off. The recommended approach to obtain OpenVINO quantized model is to convert a model from original framework to ``ov.Model`` and ensure that the model works correctly in OpenVINO, for example, by calculating the model metrics. Then, ``ov.Model`` can be used as input for the ``nncf.quantize()`` method to get the quantized model (see the diagram below).
Post-training Quantization is the fastest way to optimize an arbitrary DL model and should be applied first, but it is limited in terms of achievable accuracy-performance trade-off. The recommended approach to obtain OpenVINO quantized model is to convert a model from original framework to ``ov.Model`` and ensure that the model works correctly in OpenVINO, for example, by calculating the model metrics. Then, ``ov.Model`` can be used as input for the ``nncf.quantize()`` method to get the quantized model (see the diagram below).

In case of unsatisfactory accuracy or performance after Post-training Quantization, Training-time Optimization can be used as an option.

Expand All @@ -33,6 +36,7 @@ Additional Resources

- :doc:`Post-training Quantization <ptq_introduction>`
- :doc:`Training-time Optimization <tmo_introduction>`
- :doc:`Weight Compression <weight_compression>`
- :doc:`Deployment optimization <openvino_docs_deployment_optimization_guide_dldt_optimization_guide>`
- `HuggingFace Optimum Intel <https://huggingface.co/docs/optimum/intel/optimization_ov>`__

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#! [compression_8bit]
from nncf import compress_weights

...
model = compress_weights(model) # model is openvino.Model object
#! [compression_8bit]
37 changes: 37 additions & 0 deletions docs/optimization_guide/nncf/weight_compression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Weight Compression {#weight_compression}

@sphinxdirective

Enhancing Model Efficiency with Weight Compression
##################################################################

Weight compression aims to reduce the memory footprint of a model. It can also lead to significant performance improvement for large memory-bound models, such as Large Language Models (LLMs). LLMs and other models, which require extensive memory to store the weights during inference, can benefit from weight compression in the following ways:

- enabling the inference of exceptionally large models that cannot be accommodated in the memory of the device;
- improving the inference performance of the models by reducing the latency of the memory access when computing the operations with weights, for example, Linear layers.

Currently, NNCF provides 8-bit weight quantization as a compression method primarily designed to optimize LLMs. The main difference between weights compression and full model quantization (post-training quantization) is that activations remain floating-point in the case of weights compression which leads to a better accuracy. Weight compression for LLMs provides a solid inference performance improvement which is on par with the performance of the full model quantization. In addition, weight compression is data-free and does not require a calibration dataset, making it easy to use.

Compress Model Weights
######################

The code snippet below shows how to compress the weights of the model represented in OpenVINO IR using NNCF:

.. tab-set::

.. tab-item:: OpenVINO
:sync: openvino

.. doxygensnippet:: docs/optimization_guide/nncf/code/weight_compression_openvino.py
:language: python
:fragment: [compression_8bit]

Now, the model is ready for compilation and inference. It can be also saved into a compressed format, resulting in a smaller binary file.

Additional Resources
####################

- :doc:`Post-training Quantization <ptq_introduction>`
- :doc:`Training-time Optimization <tmo_introduction>`

@endsphinxdirective

0 comments on commit 81e3119

Please sign in to comment.