Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Gen.AI and NNCF documentation #22793

Merged
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
a242895
Updated Gen.AI and NNCF documentation
AlexKoff88 Feb 12, 2024
acbeb48
Fixed style
AlexKoff88 Feb 12, 2024
25eaf8b
Added a section about compression parameters tuning
AlexKoff88 Feb 13, 2024
46689f8
Added extra info.
AlexKoff88 Feb 13, 2024
59e7e66
Update docs/articles_en/openvino_workflow/model_optimization_guide/we…
AlexKoff88 Feb 13, 2024
d9bd351
Update docs/articles_en/openvino_workflow/model_optimization_guide/we…
AlexKoff88 Feb 13, 2024
03be22c
Update docs/articles_en/openvino_workflow/model_optimization_guide/we…
AlexKoff88 Feb 13, 2024
7a96dbe
Update docs/articles_en/openvino_workflow/model_optimization_guide/we…
AlexKoff88 Feb 13, 2024
19d0b35
Applied comments
AlexKoff88 Feb 13, 2024
41e2f98
Updates set of supported features in Gen.AI
AlexKoff88 Feb 14, 2024
f206610
Update docs/articles_en/openvino_workflow/model_optimization_guide/we…
AlexKoff88 Feb 14, 2024
2a1e6dc
Update docs/articles_en/openvino_workflow/model_optimization_guide/we…
AlexKoff88 Feb 14, 2024
ca85eb5
Update docs/articles_en/openvino_workflow/model_optimization_guide/we…
AlexKoff88 Feb 14, 2024
d9f301a
Added information about KV-cache and Dynamic quantization.
AlexKoff88 Feb 15, 2024
b373a33
Fixed style
AlexKoff88 Feb 15, 2024
620b0ac
Updated Gen.AI text
AlexKoff88 Feb 16, 2024
ee7158e
Merge remote-tracking branch 'upstream/master' into ak/docs_4bit_data…
AlexKoff88 Feb 19, 2024
472bacc
Updated numbers
AlexKoff88 Feb 19, 2024
e7ba1f0
Fixed issues with ov_config doc
AlexKoff88 Feb 19, 2024
30bf4dc
Fixed gen.ai hints
AlexKoff88 Feb 19, 2024
eab47fb
Updated to the latest optimum options
AlexKoff88 Feb 19, 2024
b0ed56f
Fixed typo
AlexKoff88 Feb 19, 2024
20b8480
Update docs/articles_en/openvino-workflow/generative-ai-models-guide.rst
AlexKoff88 Feb 20, 2024
ecda063
Update docs/articles_en/openvino-workflow/generative-ai-models-guide.rst
AlexKoff88 Feb 20, 2024
0d56557
Update docs/articles_en/openvino-workflow/generative-ai-models-guide.rst
AlexKoff88 Feb 20, 2024
b838c34
Update docs/articles_en/openvino-workflow/generative-ai-models-guide.rst
AlexKoff88 Feb 20, 2024
edcf13c
Update docs/articles_en/openvino-workflow/model-optimization-guide/we…
AlexKoff88 Feb 20, 2024
f1705aa
Update docs/articles_en/openvino-workflow/model-optimization-guide/we…
AlexKoff88 Feb 20, 2024
4cd49fb
Update docs/articles_en/openvino-workflow/model-optimization-guide/we…
AlexKoff88 Feb 20, 2024
da41789
Update docs/articles_en/openvino-workflow/model-optimization-guide/we…
AlexKoff88 Feb 20, 2024
2fb61f8
Update docs/articles_en/openvino-workflow/model-optimization-guide/we…
AlexKoff88 Feb 20, 2024
6989d9e
Update docs/articles_en/openvino-workflow/model-optimization-guide/we…
AlexKoff88 Feb 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 12 additions & 7 deletions docs/articles_en/openvino_workflow/gen_ai.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ comes to generative models, OpenVINO supports:

* Conversion, optimization and inference for text, image and audio generative models, for
example, Llama 2, MPT, OPT, Stable Diffusion, Stable Diffusion XL, etc.
* Int8 weight compression for text generation models.
* Storage format reduction (fp16 precision for non-compressed models and int8 for compressed
* 8-bit and 4-bit weight compression for text generation models.
* Storage format reduction (fp16 precision for non-compressed models and int8/int4 for compressed
models).
* Inference on CPU and GPU platforms, including integrated Intel® Processor Graphics,
discrete Intel® Arc™ A-Series Graphics, and discrete Intel® Data Center GPU Flex Series.
Expand Down Expand Up @@ -144,15 +144,20 @@ also available for CLI interface as the ``--int8`` option.

8-bit weight compression is enabled by default for models larger than 1 billion parameters.

`NNCF <https://github.com/openvinotoolkit/nncf>`__ also provides 4-bit weight compression,
which is supported by OpenVINO. It can be applied to Optimum objects as follows:
`Optimum Intel <https://huggingface.co/docs/optimum/intel/inference>`__ also provides 4-bit weight compression with ``load_in_4bit``
option and ``OVWeightQuantizationConfig``class to control weight quantization parameters.

.. code-block:: python

from nncf import compress_weights, CompressWeightsMode
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
import nncf

model = OVModelForCausalLM.from_pretrained(model_id, export=True, load_in_8bit=False)
model.model = compress_weights(model.model, mode=CompressWeightsMode.INT4_SYM, group_size=128, ratio=0.8)
model = OVModelForCausalLM.from_pretrained(
model_id,
export=True,
load_in_4bit=True,
quantization_config=OVWeightQuantizationConfig(mode=nncf.CompressWeightsMode.INT4_ASYM, ratio=0.8, dataset="ptb"),
)


The optimized model can be saved as usual with a call to ``save_pretrained()``.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,10 @@ Now, the model is ready for compilation and inference. It can be also saved into

* ``nncf.SensitivityMetric.MEAN_ACTIVATION_MAGNITUDE`` - requires dataset. The mean magnitude of the layers' inputs multiplied by inverted 8-bit quantization noise.

* ``all_layers`` - boolean parameter that enables INT4 weight quantization of all Fully-Connected and Embedding layers including first and last layer in the model.

* ``awq`` - boolean parameter that enables AWQ method for more accurate INT4 weight quantization. Especially helpful when weights of all the layers quantized to 4 bits. The method is not friendly to Dynamic Quantization of activations. Requires dataset.


The example below shows data-free 4-bit weight quantization applied on top of OpenVINO IR:

Expand All @@ -66,23 +70,24 @@ The example below shows data-free 4-bit weight quantization applied on top of Op
:language: python
:fragment: [compression_4bit]

For data-aware weight compression refer to the following `example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino>`__.
For data-aware weight compression refer to the following `example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama>`__.

.. note::

OpenVINO also supports 4-bit models from Hugging Face `Transformers <https://github.com/huggingface/transformers>`__ library optimized
with `GPTQ <https://github.com/PanQiWei/AutoGPTQ>`__. In this case, there is no need for an additional model optimization step because model conversion will automatically preserve the INT4 optimization results, allowing model inference to benefit from it.


The table below shows examples of Text Generation models with different optimization settings:
The table below shows examples of text-generation Language Models with different optimization settings in a data-free setup when no dataset is used at optimization step.
The Perplexity metric is measured on `Lambada OpenAI dataset <https://github.com/openai/gpt-2/issues/131#issuecomment-497136199>`__.

.. list-table::
:widths: 40 55 25 25
:header-rows: 1

* - Model
- Optimization
- Perplexity
- Perplexity\*
- Model Size (Gb)
* - databricks/dolly-v2-3b
- FP32
Expand Down Expand Up @@ -144,13 +149,66 @@ The table below shows examples of Text Generation models with different optimiza
- INT4_SYM,group_size=64,ratio=0.8
- 2.98
- 8.0


The following table shows accuracy metric in a data-aware 4-bit weight quantization setup measured on `Wikitext dataset <https://arxiv.org/pdf/1609.07843.pdf>`__.

.. list-table::
AlexKoff88 marked this conversation as resolved.
Show resolved Hide resolved
:widths: 40 55 25 25
:header-rows: 1

* - Model
- Optimization
- Perplexity\*
- Model Size (Gb)
* - meta-llama/llama-7b-chat-hf
- FP32
- 11.57
- 10.3
AlexKoff88 marked this conversation as resolved.
Show resolved Hide resolved
* - meta-llama/llama-7b-chat-hf
- INT4_SYM,group_size=128,ratio=1.0,awq=True
- 5.07
AlexKoff88 marked this conversation as resolved.
Show resolved Hide resolved
- 2.6
* - stabilityai_stablelm-3b-4e1t
- FP32
- 10.17
- 10.3
AlexKoff88 marked this conversation as resolved.
Show resolved Hide resolved
* - stabilityai_stablelm-3b-4e1t
- INT4_SYM,group_size=64,ratio=1.0,awq=True
- 5.07
AlexKoff88 marked this conversation as resolved.
Show resolved Hide resolved
- 2.6
* - HuggingFaceH4/zephyr-7b-beta
- FP32
- 9.82
- 10.3
AlexKoff88 marked this conversation as resolved.
Show resolved Hide resolved
* - HuggingFaceH4/zephyr-7b-beta
- INT4_SYM,group_size=128,ratio=0.8
AlexKoff88 marked this conversation as resolved.
Show resolved Hide resolved
- 5.07
AlexKoff88 marked this conversation as resolved.
Show resolved Hide resolved
- 2.6


\*Perplexity metric in both tables was measured without Dynamic Quantization feature enabled in the OpenVINO runtime.



Auto-tuning of Weight Compression Parameters
############################################

The important question that may arise is how to find a configuration of weight compression parameters that is best suited to a particular model.
AlexKoff88 marked this conversation as resolved.
Show resolved Hide resolved
We provide an `example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama_find_hyperparams>`__
for that matter where the weight compression parameters are being searched from the subset of values. To speed up the search we use self-designed
validation pipeline that we called `WhoWhatBench <https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python/who_what_benchmark>`__
that can quickly evaluate the changes in accuracy of the optimized model compared to the baseline.


Additional Resources
####################

- `Data-aware weight compression <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino>`__
- `Data-aware Weight Compression Example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama>`__
- `Tune Weight Compression Parameters Example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama_find_hyperparams>`__
- `WhoWhatBench <https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python/who_what_benchmark>`__
- `NNCF GitHub <https://github.com/openvinotoolkit/nncf>`__
- :doc:`Post-training Quantization <ptq_introduction>`
- :doc:`Training-time Optimization <tmo_introduction>`
- `NNCF GitHub <https://github.com/openvinotoolkit/nncf>`__


Loading