Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Gen.AI and NNCF documentation #22793

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
a242895
Updated Gen.AI and NNCF documentation
AlexKoff88 Feb 12, 2024
acbeb48
Fixed style
AlexKoff88 Feb 12, 2024
25eaf8b
Added a section about compression parameters tuning
AlexKoff88 Feb 13, 2024
46689f8
Added extra info.
AlexKoff88 Feb 13, 2024
59e7e66
Update docs/articles_en/openvino_workflow/model_optimization_guide/we…
AlexKoff88 Feb 13, 2024
d9bd351
Update docs/articles_en/openvino_workflow/model_optimization_guide/we…
AlexKoff88 Feb 13, 2024
03be22c
Update docs/articles_en/openvino_workflow/model_optimization_guide/we…
AlexKoff88 Feb 13, 2024
7a96dbe
Update docs/articles_en/openvino_workflow/model_optimization_guide/we…
AlexKoff88 Feb 13, 2024
19d0b35
Applied comments
AlexKoff88 Feb 13, 2024
41e2f98
Updates set of supported features in Gen.AI
AlexKoff88 Feb 14, 2024
f206610
Update docs/articles_en/openvino_workflow/model_optimization_guide/we…
AlexKoff88 Feb 14, 2024
2a1e6dc
Update docs/articles_en/openvino_workflow/model_optimization_guide/we…
AlexKoff88 Feb 14, 2024
ca85eb5
Update docs/articles_en/openvino_workflow/model_optimization_guide/we…
AlexKoff88 Feb 14, 2024
d9f301a
Added information about KV-cache and Dynamic quantization.
AlexKoff88 Feb 15, 2024
b373a33
Fixed style
AlexKoff88 Feb 15, 2024
620b0ac
Updated Gen.AI text
AlexKoff88 Feb 16, 2024
ee7158e
Merge remote-tracking branch 'upstream/master' into ak/docs_4bit_data…
AlexKoff88 Feb 19, 2024
472bacc
Updated numbers
AlexKoff88 Feb 19, 2024
e7ba1f0
Fixed issues with ov_config doc
AlexKoff88 Feb 19, 2024
30bf4dc
Fixed gen.ai hints
AlexKoff88 Feb 19, 2024
eab47fb
Updated to the latest optimum options
AlexKoff88 Feb 19, 2024
b0ed56f
Fixed typo
AlexKoff88 Feb 19, 2024
20b8480
Update docs/articles_en/openvino-workflow/generative-ai-models-guide.rst
AlexKoff88 Feb 20, 2024
ecda063
Update docs/articles_en/openvino-workflow/generative-ai-models-guide.rst
AlexKoff88 Feb 20, 2024
0d56557
Update docs/articles_en/openvino-workflow/generative-ai-models-guide.rst
AlexKoff88 Feb 20, 2024
b838c34
Update docs/articles_en/openvino-workflow/generative-ai-models-guide.rst
AlexKoff88 Feb 20, 2024
edcf13c
Update docs/articles_en/openvino-workflow/model-optimization-guide/we…
AlexKoff88 Feb 20, 2024
f1705aa
Update docs/articles_en/openvino-workflow/model-optimization-guide/we…
AlexKoff88 Feb 20, 2024
4cd49fb
Update docs/articles_en/openvino-workflow/model-optimization-guide/we…
AlexKoff88 Feb 20, 2024
da41789
Update docs/articles_en/openvino-workflow/model-optimization-guide/we…
AlexKoff88 Feb 20, 2024
2fb61f8
Update docs/articles_en/openvino-workflow/model-optimization-guide/we…
AlexKoff88 Feb 20, 2024
6989d9e
Update docs/articles_en/openvino-workflow/model-optimization-guide/we…
AlexKoff88 Feb 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -11,18 +11,22 @@ comes to generative models, OpenVINO supports:

* Conversion, optimization and inference for text, image and audio generative models, for
example, Llama 2, MPT, OPT, Stable Diffusion, Stable Diffusion XL, etc.
* Int8 weight compression for text generation models.
* Storage format reduction (fp16 precision for non-compressed models and int8 for compressed
models).
* 8-bit and 4-bit weight compression including compression of Embedding layers.
* Storage format reduction (fp16 precision for non-compressed models and int8/int4 for compressed
models), including GPTQ models from Hugging Face.
* Inference on CPU and GPU platforms, including integrated Intel® Processor Graphics,
discrete Intel® Arc™ A-Series Graphics, and discrete Intel® Data Center GPU Flex Series.
* Fused inference primitives, for example, Scaled Dot Product Attention, Rotary Positional Embedding,
Group Query Attention, Mixture of Experts, etc.
* In-place KV-cache, Dynamic quantization, KV-cache quantization and encapsulation.
* Dynamic beam size configuration, Speculative sampling.


OpenVINO offers two main paths for Generative AI use cases:

* Using OpenVINO as a backend for Hugging Face frameworks (transformers, diffusers) through
the `Optimum Intel <https://huggingface.co/docs/optimum/intel/inference>`__ extension.
* Using OpenVINO native APIs (Python and C++) with custom pipeline code.
* Using OpenVINO native APIs (Python and C++) with `custom pipeline code <https://github.com/openvinotoolkit/openvino.genai>`__.


In both cases, OpenVINO runtime and tools are used, the difference is mostly in the preferred
Expand Down Expand Up @@ -144,15 +148,18 @@ also available for CLI interface as the ``--int8`` option.

8-bit weight compression is enabled by default for models larger than 1 billion parameters.

`NNCF <https://github.com/openvinotoolkit/nncf>`__ also provides 4-bit weight compression,
which is supported by OpenVINO. It can be applied to Optimum objects as follows:
`Optimum Intel <https://huggingface.co/docs/optimum/intel/inference>`__ also provides 4-bit weight compression with ``OVWeightQuantizationConfig`` class to control weight quantization parameters.

.. code-block:: python

from nncf import compress_weights, CompressWeightsMode
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
import nncf

model = OVModelForCausalLM.from_pretrained(model_id, export=True, load_in_8bit=False)
model.model = compress_weights(model.model, mode=CompressWeightsMode.INT4_SYM, group_size=128, ratio=0.8)
model = OVModelForCausalLM.from_pretrained(
model_id,
export=True,
quantization_config=OVWeightQuantizationConfig(bits=4, asym=True, ratio=0.8, dataset="ptb"),
)


The optimized model can be saved as usual with a call to ``save_pretrained()``.
Expand Down Expand Up @@ -192,6 +199,42 @@ The model's form matters when an OpenVINO IR model is exported from Optimum-Inte
This is because stateful and stateless models have a different number of inputs and outputs.
Learn more about the `native OpenVINO API <Running-Generative-AI-Models-using-Native-OpenVINO-APIs>`__.

Enabling OpenVINO Runtime Optimizations
+++++++++++++++++++++++++++++++++++++++
OpenVINO runtime provides a set of optimizations for more efficient LLM inference. This includes **Dynamic quantization** of activations of 4/8-bit quantized MatMuls and **KV-cache quantization**.

* **Dynamic quantization** enables quantization of activations of MatMul operations that have 4 or 8-bit quantized weights (see :doc:`LLM Weight Compression <weight_compression>`).
It improves inference latency and throughput of LLMs, though it may cause insignificant deviation in generation accuracy. Quantization is performed in a
group-wise manner, with configurable group size. It means that values in a group share quantization parameters. Larger group sizes lead to faster inference but lower accuracy. Recommended group size values are: ``32``, ``64``, or ``128``. To enable Dynamic quantization, use the corresponding
inference property as follows:


.. code-block:: python

model = OVModelForCausalLM.from_pretrained(
model_path,
ov_config={"DYNAMIC_QUANTIZATION_GROUP_SIZE": "32", "PERFORMANCE_HINT": "LATENCY"}
)



* **KV-cache quantization** allows lowering the precision of Key and Value cache in LLMs. This helps reduce memory consumption during inference, improving latency and throughput. KV-cache can be quantized into the following precisions:
``u8``, ``bf16``, ``f16``. If ``u8`` is used, KV-cache quantization is also applied in a group-wise manner. Thus, it can use ``DYNAMIC_QUANTIZATION_GROUP_SIZE`` value if defined.
Otherwise, the group size ``32`` is used by default. KV-cache quantization can be enabled as follows:


.. code-block:: python

model = OVModelForCausalLM.from_pretrained(
model_path,
ov_config={"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32", "PERFORMANCE_HINT": "LATENCY"}
)


.. note::

Currently, both Dynamic quantization and KV-cache quantization are available for CPU device.


Working with Models Tuned with LoRA
++++++++++++++++++++++++++++++++++++
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,10 @@ Now, the model is ready for compilation and inference. It can be also saved into

* ``nncf.SensitivityMetric.MEAN_ACTIVATION_MAGNITUDE`` - requires dataset. The mean magnitude of the layers' inputs multiplied by inverted 8-bit quantization noise.

* ``all_layers`` - boolean parameter that enables INT4 weight quantization of all Fully-Connected and Embedding layers, including the first and last layers in the model.

* ``awq`` - boolean parameter that enables the AWQ method for more accurate INT4 weight quantization. Especially helpful when the weights of all the layers are quantized to 4 bits. The method can sometimes result in reduced accuracy when used with Dynamic Quantization of activations. Requires dataset.


The example below shows data-free 4-bit weight quantization applied on top of OpenVINO IR:

Expand All @@ -66,23 +70,24 @@ The example below shows data-free 4-bit weight quantization applied on top of Op
:language: python
:fragment: [compression_4bit]

For data-aware weight compression refer to the following `example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino>`__.
For data-aware weight compression refer to the following `example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama>`__.

.. note::

OpenVINO also supports 4-bit models from Hugging Face `Transformers <https://github.com/huggingface/transformers>`__ library optimized
with `GPTQ <https://github.com/PanQiWei/AutoGPTQ>`__. In this case, there is no need for an additional model optimization step because model conversion will automatically preserve the INT4 optimization results, allowing model inference to benefit from it.


The table below shows examples of Text Generation models with different optimization settings:
The table below shows examples of text-generation Language Models with different optimization settings in a data-free setup, where no dataset is used at the optimization step.
The Perplexity metric is measured on the `Lambada OpenAI dataset <https://github.com/openai/gpt-2/issues/131#issuecomment-497136199>`__.

.. list-table::
:widths: 40 55 25 25
:header-rows: 1

* - Model
- Optimization
- Perplexity
- Perplexity\*
- Model Size (Gb)
* - databricks/dolly-v2-3b
- FP32
Expand Down Expand Up @@ -144,13 +149,64 @@ The table below shows examples of Text Generation models with different optimiza
- INT4_SYM,group_size=64,ratio=0.8
- 2.98
- 8.0


The following table shows accuracy metric in a data-aware 4-bit weight quantization setup measured on the `Wikitext dataset <https://arxiv.org/pdf/1609.07843.pdf>`__.

.. list-table::
:widths: 40 55 25 25
:header-rows: 1

* - Model
- Optimization
- Word perplexity\*
- Model Size (Gb)
* - meta-llama/llama-7b-chat-hf
- FP32
- 11.57
- 12.61
* - meta-llama/llama-7b-chat-hf
- INT4_SYM,group_size=128,ratio=1.0,awq=True
- 12.34
- 2.6
* - stabilityai_stablelm-3b-4e1t
- FP32
- 10.17
- 10.41
* - stabilityai_stablelm-3b-4e1t
- INT4_SYM,group_size=64,ratio=1.0,awq=True
- 10.89
- 2.6
* - HuggingFaceH4/zephyr-7b-beta
- FP32
- 9.82
- 13.99
* - HuggingFaceH4/zephyr-7b-beta
- INT4_SYM,group_size=128,ratio=1.0
- 10.32
- 2.6


\*Perplexity metric in both tables was measured without the Dynamic Quantization feature enabled in the OpenVINO runtime.



Auto-tuning of Weight Compression Parameters
############################################

To find the optimal weight compression parameters for a particular model, refer to the `example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama_find_hyperparams>`__ , where weight compression parameters are being searched from the subset of values. To speed up the search, a self-designed
validation pipeline called `WhoWhatBench <https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python/who_what_benchmark>`__ is used.
The pipeline can quickly evaluate the changes in the accuracy of the optimized model compared to the baseline.


Additional Resources
####################

- `Data-aware weight compression <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino>`__
- `Data-aware Weight Compression Example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama>`__
- `Tune Weight Compression Parameters Example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama_find_hyperparams>`__
- `WhoWhatBench <https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python/who_what_benchmark>`__
- `NNCF GitHub <https://github.com/openvinotoolkit/nncf>`__
- :doc:`Post-training Quantization <ptq_introduction>`
- :doc:`Training-time Optimization <tmo_introduction>`
- `NNCF GitHub <https://github.com/openvinotoolkit/nncf>`__


Loading