Updated Gen.AI and NNCF documentation (#22793)

Co-authored-by: Lyalyushkin Nikolay <[email protected]> Co-authored-by: Liubov Talamanova <[email protected]> Co-authored-by: Tatiana Savina <[email protected]>
openvinotoolkit · Feb 23, 2024 · c768c0e · c768c0e
1 parent 1e5d0e5
commit c768c0e
Show file tree

Hide file tree

Showing 2 changed files with 113 additions and 14 deletions.
diff --git a/docs/articles_en/openvino-workflow/generative-ai-models-guide.rst b/docs/articles_en/openvino-workflow/generative-ai-models-guide.rst
@@ -11,18 +11,22 @@ comes to generative models, OpenVINO supports:
 
 * Conversion, optimization and inference for text, image and audio generative models, for
   example, Llama 2, MPT, OPT, Stable Diffusion, Stable Diffusion XL, etc.
-* Int8 weight compression for text generation models.
-* Storage format reduction (fp16 precision for non-compressed models and int8 for compressed
-  models).
+* 8-bit and 4-bit weight compression including compression of Embedding layers.
+* Storage format reduction (fp16 precision for non-compressed models and int8/int4 for compressed
+  models), including GPTQ models from Hugging Face.
 * Inference on CPU and GPU platforms, including integrated Intel® Processor Graphics,
   discrete Intel® Arc™ A-Series Graphics, and discrete Intel® Data Center GPU Flex Series.
+* Fused inference primitives, for example, Scaled Dot Product Attention, Rotary Positional Embedding, 
+  Group Query Attention, Mixture of Experts, etc.
+* In-place KV-cache, Dynamic quantization, KV-cache quantization and encapsulation.
+* Dynamic beam size configuration, Speculative sampling.
 
 
 OpenVINO offers two main paths for Generative AI use cases:
 
 * Using OpenVINO as a backend for Hugging Face frameworks (transformers, diffusers) through
   the `Optimum Intel <https://huggingface.co/docs/optimum/intel/inference>`__ extension.
-* Using OpenVINO native APIs (Python and C++) with custom pipeline code.
+* Using OpenVINO native APIs (Python and C++) with `custom pipeline code <https://github.com/openvinotoolkit/openvino.genai>`__.
 
 
 In both cases, OpenVINO runtime and tools are used, the difference is mostly in the preferred
@@ -144,15 +148,18 @@ also available for CLI interface as the ``--int8`` option.
 
    8-bit weight compression is enabled by default for models larger than 1 billion parameters.
 
-`NNCF <https://github.com/openvinotoolkit/nncf>`__ also provides 4-bit weight compression,
-which is supported by OpenVINO. It can be applied to Optimum objects as follows:
+`Optimum Intel <https://huggingface.co/docs/optimum/intel/inference>`__ also provides 4-bit weight compression with ``OVWeightQuantizationConfig`` class to control weight quantization parameters. 
 
 .. code-block:: python
 
-    from nncf import compress_weights, CompressWeightsMode
+    from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
+    import nncf
 
-    model = OVModelForCausalLM.from_pretrained(model_id, export=True, load_in_8bit=False)
-    model.model = compress_weights(model.model, mode=CompressWeightsMode.INT4_SYM, group_size=128, ratio=0.8)
+    model = OVModelForCausalLM.from_pretrained(
+        model_id,
+        export=True,
+        quantization_config=OVWeightQuantizationConfig(bits=4, asym=True, ratio=0.8, dataset="ptb"),
+    ) 
 
 
 The optimized model can be saved as usual with a call to ``save_pretrained()``.
@@ -192,6 +199,42 @@ The model's form matters when an OpenVINO IR model is exported from Optimum-Inte
 This is because stateful and stateless models have a different number of inputs and outputs.
 Learn more about the `native OpenVINO API <Running-Generative-AI-Models-using-Native-OpenVINO-APIs>`__.
 
+Enabling OpenVINO Runtime Optimizations
++++++++++++++++++++++++++++++++++++++++
+OpenVINO runtime provides a set of optimizations for more efficient LLM inference. This includes **Dynamic quantization** of activations of 4/8-bit quantized MatMuls and **KV-cache quantization**.
+
+* **Dynamic quantization** enables quantization of activations of MatMul operations that have 4 or 8-bit quantized weights (see :doc:`LLM Weight Compression <weight_compression>`). 
+  It improves inference latency and throughput of LLMs, though it may cause insignificant deviation in generation accuracy.  Quantization is performed in a
+  group-wise manner, with configurable group size. It means that values in a group share quantization parameters. Larger group sizes lead to faster inference but lower accuracy. Recommended group size values are: ``32``, ``64``, or ``128``. To enable Dynamic quantization, use the corresponding 
+  inference property as follows:
+
+
+.. code-block:: python
+
+    model = OVModelForCausalLM.from_pretrained(
+        model_path,
+        ov_config={"DYNAMIC_QUANTIZATION_GROUP_SIZE": "32", "PERFORMANCE_HINT": "LATENCY"}
+    )
+
+
+
+* **KV-cache quantization** allows lowering the precision of Key and Value cache in LLMs. This helps reduce memory consumption during inference, improving latency and throughput. KV-cache can be quantized into the following precisions:
+  ``u8``, ``bf16``, ``f16``.  If ``u8`` is used, KV-cache quantization is also applied in a group-wise manner. Thus, it can use ``DYNAMIC_QUANTIZATION_GROUP_SIZE`` value if defined. 
+  Otherwise, the group size ``32`` is used by default. KV-cache quantization can be enabled as follows:
+
+
+.. code-block:: python
+
+    model = OVModelForCausalLM.from_pretrained(
+        model_path,
+        ov_config={"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32", "PERFORMANCE_HINT": "LATENCY"}
+    )
+
+
+.. note::
+
+  Currently, both Dynamic quantization and KV-cache quantization are available for CPU device.
+
 
 Working with Models Tuned with LoRA
 ++++++++++++++++++++++++++++++++++++

diff --git a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst
@@ -54,6 +54,10 @@ Now, the model is ready for compilation and inference. It can be also saved into
 
     * ``nncf.SensitivityMetric.MEAN_ACTIVATION_MAGNITUDE`` - requires dataset. The mean magnitude of the layers' inputs multiplied by inverted 8-bit quantization noise.
 
+  * ``all_layers`` - boolean parameter that enables INT4 weight quantization of all Fully-Connected and Embedding layers, including the first and last layers in the model. 
+
+  * ``awq`` - boolean parameter that enables the AWQ method for more accurate INT4 weight quantization. Especially helpful when the weights of all the layers are quantized to 4 bits. The method can sometimes result in reduced accuracy when used with Dynamic Quantization of activations. Requires dataset.
+
 
 The example below shows data-free 4-bit weight quantization applied on top of OpenVINO IR:
 
@@ -66,23 +70,24 @@ The example below shows data-free 4-bit weight quantization applied on top of Op
          :language: python
          :fragment: [compression_4bit]
 
-For data-aware weight compression refer to the following `example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino>`__.
+For data-aware weight compression refer to the following `example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama>`__.
 
 .. note::
 
    OpenVINO also supports 4-bit models from Hugging Face `Transformers <https://github.com/huggingface/transformers>`__ library optimized 
    with `GPTQ <https://github.com/PanQiWei/AutoGPTQ>`__. In this case, there is no need for an additional model optimization step because model conversion will automatically preserve the INT4 optimization results, allowing model inference to benefit from it.
 
 
-The table below shows examples of Text Generation models with different optimization settings:
+The table below shows examples of text-generation Language Models with different optimization settings in a data-free setup, where no dataset is used at the optimization step.
+The Perplexity metric is measured on the `Lambada OpenAI dataset <https://github.com/openai/gpt-2/issues/131#issuecomment-497136199>`__.
 
 .. list-table::
    :widths: 40 55 25 25
    :header-rows: 1
 
    * - Model
      - Optimization
-     - Perplexity
+     - Perplexity\*
      - Model Size (Gb)
    * - databricks/dolly-v2-3b
      - FP32
@@ -144,13 +149,64 @@ The table below shows examples of Text Generation models with different optimiza
      - INT4_SYM,group_size=64,ratio=0.8
      - 2.98
      - 8.0
+
+
+The following table shows accuracy metric in a data-aware 4-bit weight quantization setup measured on the `Wikitext dataset <https://arxiv.org/pdf/1609.07843.pdf>`__.
+
+.. list-table::
+   :widths: 40 55 25 25
+   :header-rows: 1
+
+   * - Model
+     - Optimization
+     - Word perplexity\*
+     - Model Size (Gb)
+   * - meta-llama/llama-7b-chat-hf
+     - FP32
+     - 11.57
+     - 12.61
+   * - meta-llama/llama-7b-chat-hf
+     - INT4_SYM,group_size=128,ratio=1.0,awq=True
+     - 12.34
+     - 2.6
+   * - stabilityai_stablelm-3b-4e1t
+     - FP32
+     - 10.17
+     - 10.41
+   * - stabilityai_stablelm-3b-4e1t
+     - INT4_SYM,group_size=64,ratio=1.0,awq=True
+     - 10.89
+     - 2.6
+   * - HuggingFaceH4/zephyr-7b-beta
+     - FP32
+     - 9.82
+     - 13.99
+   * - HuggingFaceH4/zephyr-7b-beta
+     - INT4_SYM,group_size=128,ratio=1.0
+     - 10.32
+     - 2.6
+
+
+\*Perplexity metric in both tables was measured without the Dynamic Quantization feature enabled in the OpenVINO runtime.
 
 
+
+Auto-tuning of Weight Compression Parameters
+############################################
+
+To find the optimal weight compression parameters for a particular model, refer to the `example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama_find_hyperparams>`__ , where weight compression parameters are being searched from the subset of values. To speed up the search, a self-designed 
+validation pipeline called `WhoWhatBench <https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python/who_what_benchmark>`__ is used. 
+The pipeline can quickly evaluate the changes in the accuracy of the optimized model compared to the baseline.
+
+
 Additional Resources
 ####################
 
-- `Data-aware weight compression <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino>`__
+- `Data-aware Weight Compression Example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama>`__
+- `Tune Weight Compression Parameters Example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama_find_hyperparams>`__
+- `WhoWhatBench <https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python/who_what_benchmark>`__ 
+- `NNCF GitHub <https://github.com/openvinotoolkit/nncf>`__
 - :doc:`Post-training Quantization <ptq_introduction>`
 - :doc:`Training-time Optimization <tmo_introduction>`
-- `NNCF GitHub <https://github.com/openvinotoolkit/nncf>`__
+