Skip to content

Commit

Permalink
Updated Gen.AI and NNCF documentation (#22793)
Browse files Browse the repository at this point in the history
Co-authored-by: Lyalyushkin Nikolay <[email protected]>
Co-authored-by: Liubov Talamanova <[email protected]>
Co-authored-by: Tatiana Savina <[email protected]>
  • Loading branch information
4 people authored Feb 23, 2024
1 parent 1e5d0e5 commit c768c0e
Show file tree
Hide file tree
Showing 2 changed files with 113 additions and 14 deletions.
61 changes: 52 additions & 9 deletions docs/articles_en/openvino-workflow/generative-ai-models-guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,18 +11,22 @@ comes to generative models, OpenVINO supports:

* Conversion, optimization and inference for text, image and audio generative models, for
example, Llama 2, MPT, OPT, Stable Diffusion, Stable Diffusion XL, etc.
* Int8 weight compression for text generation models.
* Storage format reduction (fp16 precision for non-compressed models and int8 for compressed
models).
* 8-bit and 4-bit weight compression including compression of Embedding layers.
* Storage format reduction (fp16 precision for non-compressed models and int8/int4 for compressed
models), including GPTQ models from Hugging Face.
* Inference on CPU and GPU platforms, including integrated Intel® Processor Graphics,
discrete Intel® Arc™ A-Series Graphics, and discrete Intel® Data Center GPU Flex Series.
* Fused inference primitives, for example, Scaled Dot Product Attention, Rotary Positional Embedding,
Group Query Attention, Mixture of Experts, etc.
* In-place KV-cache, Dynamic quantization, KV-cache quantization and encapsulation.
* Dynamic beam size configuration, Speculative sampling.


OpenVINO offers two main paths for Generative AI use cases:

* Using OpenVINO as a backend for Hugging Face frameworks (transformers, diffusers) through
the `Optimum Intel <https://huggingface.co/docs/optimum/intel/inference>`__ extension.
* Using OpenVINO native APIs (Python and C++) with custom pipeline code.
* Using OpenVINO native APIs (Python and C++) with `custom pipeline code <https://github.com/openvinotoolkit/openvino.genai>`__.


In both cases, OpenVINO runtime and tools are used, the difference is mostly in the preferred
Expand Down Expand Up @@ -144,15 +148,18 @@ also available for CLI interface as the ``--int8`` option.

8-bit weight compression is enabled by default for models larger than 1 billion parameters.

`NNCF <https://github.com/openvinotoolkit/nncf>`__ also provides 4-bit weight compression,
which is supported by OpenVINO. It can be applied to Optimum objects as follows:
`Optimum Intel <https://huggingface.co/docs/optimum/intel/inference>`__ also provides 4-bit weight compression with ``OVWeightQuantizationConfig`` class to control weight quantization parameters.

.. code-block:: python
from nncf import compress_weights, CompressWeightsMode
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
import nncf
model = OVModelForCausalLM.from_pretrained(model_id, export=True, load_in_8bit=False)
model.model = compress_weights(model.model, mode=CompressWeightsMode.INT4_SYM, group_size=128, ratio=0.8)
model = OVModelForCausalLM.from_pretrained(
model_id,
export=True,
quantization_config=OVWeightQuantizationConfig(bits=4, asym=True, ratio=0.8, dataset="ptb"),
)
The optimized model can be saved as usual with a call to ``save_pretrained()``.
Expand Down Expand Up @@ -192,6 +199,42 @@ The model's form matters when an OpenVINO IR model is exported from Optimum-Inte
This is because stateful and stateless models have a different number of inputs and outputs.
Learn more about the `native OpenVINO API <Running-Generative-AI-Models-using-Native-OpenVINO-APIs>`__.

Enabling OpenVINO Runtime Optimizations
+++++++++++++++++++++++++++++++++++++++
OpenVINO runtime provides a set of optimizations for more efficient LLM inference. This includes **Dynamic quantization** of activations of 4/8-bit quantized MatMuls and **KV-cache quantization**.

* **Dynamic quantization** enables quantization of activations of MatMul operations that have 4 or 8-bit quantized weights (see :doc:`LLM Weight Compression <weight_compression>`).
It improves inference latency and throughput of LLMs, though it may cause insignificant deviation in generation accuracy. Quantization is performed in a
group-wise manner, with configurable group size. It means that values in a group share quantization parameters. Larger group sizes lead to faster inference but lower accuracy. Recommended group size values are: ``32``, ``64``, or ``128``. To enable Dynamic quantization, use the corresponding
inference property as follows:


.. code-block:: python
model = OVModelForCausalLM.from_pretrained(
model_path,
ov_config={"DYNAMIC_QUANTIZATION_GROUP_SIZE": "32", "PERFORMANCE_HINT": "LATENCY"}
)
* **KV-cache quantization** allows lowering the precision of Key and Value cache in LLMs. This helps reduce memory consumption during inference, improving latency and throughput. KV-cache can be quantized into the following precisions:
``u8``, ``bf16``, ``f16``. If ``u8`` is used, KV-cache quantization is also applied in a group-wise manner. Thus, it can use ``DYNAMIC_QUANTIZATION_GROUP_SIZE`` value if defined.
Otherwise, the group size ``32`` is used by default. KV-cache quantization can be enabled as follows:


.. code-block:: python
model = OVModelForCausalLM.from_pretrained(
model_path,
ov_config={"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32", "PERFORMANCE_HINT": "LATENCY"}
)
.. note::

Currently, both Dynamic quantization and KV-cache quantization are available for CPU device.


Working with Models Tuned with LoRA
++++++++++++++++++++++++++++++++++++
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,10 @@ Now, the model is ready for compilation and inference. It can be also saved into

* ``nncf.SensitivityMetric.MEAN_ACTIVATION_MAGNITUDE`` - requires dataset. The mean magnitude of the layers' inputs multiplied by inverted 8-bit quantization noise.

* ``all_layers`` - boolean parameter that enables INT4 weight quantization of all Fully-Connected and Embedding layers, including the first and last layers in the model.

* ``awq`` - boolean parameter that enables the AWQ method for more accurate INT4 weight quantization. Especially helpful when the weights of all the layers are quantized to 4 bits. The method can sometimes result in reduced accuracy when used with Dynamic Quantization of activations. Requires dataset.


The example below shows data-free 4-bit weight quantization applied on top of OpenVINO IR:

Expand All @@ -66,23 +70,24 @@ The example below shows data-free 4-bit weight quantization applied on top of Op
:language: python
:fragment: [compression_4bit]

For data-aware weight compression refer to the following `example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino>`__.
For data-aware weight compression refer to the following `example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama>`__.

.. note::

OpenVINO also supports 4-bit models from Hugging Face `Transformers <https://github.com/huggingface/transformers>`__ library optimized
with `GPTQ <https://github.com/PanQiWei/AutoGPTQ>`__. In this case, there is no need for an additional model optimization step because model conversion will automatically preserve the INT4 optimization results, allowing model inference to benefit from it.


The table below shows examples of Text Generation models with different optimization settings:
The table below shows examples of text-generation Language Models with different optimization settings in a data-free setup, where no dataset is used at the optimization step.
The Perplexity metric is measured on the `Lambada OpenAI dataset <https://github.com/openai/gpt-2/issues/131#issuecomment-497136199>`__.

.. list-table::
:widths: 40 55 25 25
:header-rows: 1

* - Model
- Optimization
- Perplexity
- Perplexity\*
- Model Size (Gb)
* - databricks/dolly-v2-3b
- FP32
Expand Down Expand Up @@ -144,13 +149,64 @@ The table below shows examples of Text Generation models with different optimiza
- INT4_SYM,group_size=64,ratio=0.8
- 2.98
- 8.0


The following table shows accuracy metric in a data-aware 4-bit weight quantization setup measured on the `Wikitext dataset <https://arxiv.org/pdf/1609.07843.pdf>`__.

.. list-table::
:widths: 40 55 25 25
:header-rows: 1

* - Model
- Optimization
- Word perplexity\*
- Model Size (Gb)
* - meta-llama/llama-7b-chat-hf
- FP32
- 11.57
- 12.61
* - meta-llama/llama-7b-chat-hf
- INT4_SYM,group_size=128,ratio=1.0,awq=True
- 12.34
- 2.6
* - stabilityai_stablelm-3b-4e1t
- FP32
- 10.17
- 10.41
* - stabilityai_stablelm-3b-4e1t
- INT4_SYM,group_size=64,ratio=1.0,awq=True
- 10.89
- 2.6
* - HuggingFaceH4/zephyr-7b-beta
- FP32
- 9.82
- 13.99
* - HuggingFaceH4/zephyr-7b-beta
- INT4_SYM,group_size=128,ratio=1.0
- 10.32
- 2.6


\*Perplexity metric in both tables was measured without the Dynamic Quantization feature enabled in the OpenVINO runtime.



Auto-tuning of Weight Compression Parameters
############################################

To find the optimal weight compression parameters for a particular model, refer to the `example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama_find_hyperparams>`__ , where weight compression parameters are being searched from the subset of values. To speed up the search, a self-designed
validation pipeline called `WhoWhatBench <https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python/who_what_benchmark>`__ is used.
The pipeline can quickly evaluate the changes in the accuracy of the optimized model compared to the baseline.


Additional Resources
####################

- `Data-aware weight compression <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino>`__
- `Data-aware Weight Compression Example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama>`__
- `Tune Weight Compression Parameters Example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama_find_hyperparams>`__
- `WhoWhatBench <https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python/who_what_benchmark>`__
- `NNCF GitHub <https://github.com/openvinotoolkit/nncf>`__
- :doc:`Post-training Quantization <ptq_introduction>`
- :doc:`Training-time Optimization <tmo_introduction>`
- `NNCF GitHub <https://github.com/openvinotoolkit/nncf>`__


0 comments on commit c768c0e

Please sign in to comment.