From 1886fc6d31e6c8b614f9eb8281cc10d8d9cadb68 Mon Sep 17 00:00:00 2001 From: Karol Blaszczak Date: Thu, 8 Feb 2024 18:56:53 +0100 Subject: [PATCH] [DOCS] review for 22356 and 22410 (#22466) relates to: https://github.com/openvinotoolkit/openvino/pull/22410 https://github.com/openvinotoolkit/openvino/pull/22356 --------- Co-authored-by: Tatiana Savina --- .../infrastructure/Assign_6.rst | 21 +- .../infrastructure/ReadValue_6.rst | 25 +- docs/articles_en/openvino_workflow/gen_ai.rst | 100 ++++-- .../stateful_models_intro.rst | 209 +++++++------ .../ways_to_get_stateful_model.rst | 288 ++++++++++-------- 5 files changed, 370 insertions(+), 273 deletions(-) diff --git a/docs/articles_en/documentation/openvino_ir/operation_sets/operations_specifications/infrastructure/Assign_6.rst b/docs/articles_en/documentation/openvino_ir/operation_sets/operations_specifications/infrastructure/Assign_6.rst index e6e06f02bf0d1e..c580e1dc03297d 100644 --- a/docs/articles_en/documentation/openvino_ir/operation_sets/operations_specifications/infrastructure/Assign_6.rst +++ b/docs/articles_en/documentation/openvino_ir/operation_sets/operations_specifications/infrastructure/Assign_6.rst @@ -5,7 +5,7 @@ Assign .. meta:: - :description: Learn about Assign-6 - an infrastructure operation, which + :description: Learn about Assign-6 - an infrastructure operation, which can be performed on a single input tensor to set a value to variable_id. **Versioned name**: *Assign-6* @@ -16,15 +16,18 @@ Assign **Detailed description**: -ReadValue, Assign and Variable define a coherent mechanism for reading, writing and storing a memory buffer between inference calls. -More details can be found on :doc:`StateAPI` documentation page. +ReadValue, Assign, and Variable define a coherent mechanism for reading, writing and +storing a memory buffer between inference calls. More details can be found on the +:doc:`StateAPI` documentation page. -*Assign* operation sets an input value to the ``variable_id`` variable. This value will be read by *ReadValue* operation on next inference call if variable was not reset. -The operation checks that the shape and type specified in ``variable_id`` variable extend (relax) -the shape and the type inferred from the 1st input and returns an error otherwise, e.g. if the type in the variable is specified -as dynamic, it means that any type for 1st input is allowed but if it is specified as f32, only f32 type is allowed. +*Assign* sets an input value to the ``variable_id`` variable. This value will be read +by the *ReadValue* operation on the next inference call if it has not been reset. +The operation checks if the shape and type specified in ``variable_id`` extend (relax) +the shape and type inferred from the 1st input. If not, it returns an error. For example, +if the type in the variable is specified as dynamic, it means that any type for 1st +input is allowed but if it is specified as f32, only f32 type is allowed. -It is expected only one pair of ReadValue, Assign operations for each Variable in the model. +Only one pair of ReadValue and Assign operations is expected for each Variable in the model. **Attributes**: @@ -47,7 +50,7 @@ It is expected only one pair of ReadValue, Assign operations for each Variable i .. code-block:: xml :force: - + diff --git a/docs/articles_en/documentation/openvino_ir/operation_sets/operations_specifications/infrastructure/ReadValue_6.rst b/docs/articles_en/documentation/openvino_ir/operation_sets/operations_specifications/infrastructure/ReadValue_6.rst index 71933bdb516b32..dda2f1430414d1 100644 --- a/docs/articles_en/documentation/openvino_ir/operation_sets/operations_specifications/infrastructure/ReadValue_6.rst +++ b/docs/articles_en/documentation/openvino_ir/operation_sets/operations_specifications/infrastructure/ReadValue_6.rst @@ -5,7 +5,7 @@ ReadValue .. meta:: - :description: Learn about ReadValue-6 - an infrastructure operation, which + :description: Learn about ReadValue-6 - an infrastructure operation, which can be performed on a single input tensor or without input tensors to return the value of variable_id. @@ -17,22 +17,25 @@ ReadValue **Detailed description**: -ReadValue, Assign and Variable define a coherent mechanism for reading, writing and storing some memory buffer between inference calls. -More details can be found on :doc:`StateAPI` documentation page. +*ReadValue*, *Assign*, and *Variable* define a coherent mechanism for reading, writing, +and storing some memory buffer between inference calls. More details can be found on the +:doc:`StateAPI` documentation page. -If 1st input is provided and this is the first inference or reset was called, +If the 1st input is provided and this is the first inference or reset has been called, *ReadValue* returns the value from the 1st input. -If 1st input is not provided and this is the first inference or reset was called, +If the 1st input is not provided and this is the first inference or reset has been called, *ReadValue* returns the tensor with the ``variable_shape`` and ``variable_type`` and zero values. -In all other cases *ReadValue* returns value from the corresponding ``variable_id`` Variable. +In all other cases *ReadValue* returns the value from the corresponding ``variable_id`` variable. -If the 1st input was provided, the operation checks that ``variable_shape`` and ``variable_type`` extend (relax) -the shape and the type inferred from the 1st input and returns an error otherwise, e.g. if ``variable_type`` is specified -as dynamic, it means that any type for 1st input is allowed but if it is specified as f32, only f32 type is allowed. +If the 1st input has been provided, the operation checks if ``variable_shape`` and ``variable_type`` +extend (relax) the shape and type inferred from the 1st input. If not, it returns an error. +For example, if ``variable_type`` is specified as dynamic, it means that any type for 1st input +is allowed but if it is specified as f32, only f32 type is allowed. + +Only one pair of ReadValue and Assign operations is expected for each Variable in the model. -It is expected only one pair of ReadValue, Assign operations for each Variable in the model. **Attributes**: @@ -90,4 +93,4 @@ It is expected only one pair of ReadValue, Assign operations for each Variable i - + diff --git a/docs/articles_en/openvino_workflow/gen_ai.rst b/docs/articles_en/openvino_workflow/gen_ai.rst index 38a19b6e181e52..ec32202bd247f5 100644 --- a/docs/articles_en/openvino_workflow/gen_ai.rst +++ b/docs/articles_en/openvino_workflow/gen_ai.rst @@ -4,25 +4,41 @@ Optimize and Deploy Generative AI Models ======================================== -Generative AI is an innovative technique that creates new data, such as text, images, video, or audio, using neural networks. OpenVINO accelerates Generative AI use cases as they mostly rely on model inference, allowing for faster development and better performance. When it comes to generative models, OpenVINO supports: +Generative AI is an innovative technique that creates new data, such as text, images, video, +or audio, using neural networks. OpenVINO accelerates Generative AI use cases as they mostly +rely on model inference, allowing for faster development and better performance. When it +comes to generative models, OpenVINO supports: -* Conversion, optimization and inference for text, image and audio generative models, for example, Llama 2, MPT, OPT, Stable Diffusion, Stable Diffusion XL, etc. +* Conversion, optimization and inference for text, image and audio generative models, for + example, Llama 2, MPT, OPT, Stable Diffusion, Stable Diffusion XL, etc. * Int8 weight compression for text generation models. -* Storage format reduction (fp16 precision for non-compressed models and int8 for compressed models). -* Inference on CPU and GPU platforms, including integrated Intel® Processor Graphics, discrete Intel® Arc™ A-Series Graphics, and discrete Intel® Data Center GPU Flex Series. +* Storage format reduction (fp16 precision for non-compressed models and int8 for compressed + models). +* Inference on CPU and GPU platforms, including integrated Intel® Processor Graphics, + discrete Intel® Arc™ A-Series Graphics, and discrete Intel® Data Center GPU Flex Series. OpenVINO offers two main paths for Generative AI use cases: -* Using OpenVINO as a backend for Hugging Face frameworks (transformers, diffusers) through the `Optimum Intel `__ extension. +* Using OpenVINO as a backend for Hugging Face frameworks (transformers, diffusers) through + the `Optimum Intel `__ extension. * Using OpenVINO native APIs (Python and C++) with custom pipeline code. -In both cases, OpenVINO runtime and tools are used, the difference is mostly in the preferred API and the final solution's footprint. Native APIs enable the use of generative models in C++ applications, ensure minimal runtime dependencies, and minimize application footprint. The Native APIs approach requires the implementation of glue code (generation loop, text tokenization, or scheduler functions), which is hidden within Hugging Face libraries for a better developer experience. +In both cases, OpenVINO runtime and tools are used, the difference is mostly in the preferred +API and the final solution's footprint. Native APIs enable the use of generative models in +C++ applications, ensure minimal runtime dependencies, and minimize application footprint. +The Native APIs approach requires the implementation of glue code (generation loop, text +tokenization, or scheduler functions), which is hidden within Hugging Face libraries for a +better developer experience. -It is recommended to start with Hugging Face frameworks. Experiment with different models and scenarios to find your fit, and then consider converting to OpenVINO native APIs based on your specific requirements. +It is recommended to start with Hugging Face frameworks. Experiment with different models and +scenarios to find your fit, and then consider converting to OpenVINO native APIs based on your +specific requirements. -Optimum Intel provides interfaces that enable model optimization (weight compression) using `Neural Network Compression Framework (NNCF) `__, and export models to the OpenVINO model format for use in native API applications. +Optimum Intel provides interfaces that enable model optimization (weight compression) using +`Neural Network Compression Framework (NNCF) `__, +and export models to the OpenVINO model format for use in native API applications. The table below summarizes the differences between Hugging Face and Native APIs approaches. @@ -85,14 +101,16 @@ To start using OpenVINO as a backend for Hugging Face, change the original Huggi +model = OVModelForCausalLM.from_pretrained(model_id, export=True) -After that, you can call ``save_pretrained()`` method to save model to the folder in the OpenVINO Intermediate Representation and use it further. +After that, you can call ``save_pretrained()`` method to save model to the folder in the OpenVINO +Intermediate Representation and use it further. .. code-block:: python model.save_pretrained(model_dir) -Alternatively, you can download and convert the model using CLI interface: ``optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf llama_openvino``. +Alternatively, you can download and convert the model using CLI interface: +``optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf llama_openvino``. In this case, you can load the converted model in OpenVINO representation directly from the disk: .. code-block:: python @@ -101,27 +119,33 @@ In this case, you can load the converted model in OpenVINO representation direct model = OVModelForCausalLM.from_pretrained(model_id) -By default, inference will run on CPU. To select a different inference device, for example, GPU, add ``device="GPU"`` to the ``from_pretrained()`` call. To switch to a different device after the model has been loaded, use the ``.to()`` method. The device naming convention is the same as in OpenVINO native API: +By default, inference will run on CPU. To select a different inference device, for example, GPU, +add ``device="GPU"`` to the ``from_pretrained()`` call. To switch to a different device after +the model has been loaded, use the ``.to()`` method. The device naming convention is the same +as in OpenVINO native API: .. code-block:: python model.to("GPU") -Optimum-Intel API also provides out-of-the-box model optimization through weight compression using NNCF which substantially reduces the model footprint and inference latency: +Optimum-Intel API also provides out-of-the-box model optimization through weight compression +using NNCF which substantially reduces the model footprint and inference latency: .. code-block:: python model = OVModelForCausalLM.from_pretrained(model_id, export=True, load_in_8bit=True) -Weight compression is applied by default to models larger than one billion parameters and is also available for CLI interface as the ``--int8`` option. +Weight compression is applied by default to models larger than one billion parameters and is +also available for CLI interface as the ``--int8`` option. .. note:: 8-bit weight compression is enabled by default for models larger than 1 billion parameters. -`NNCF `__ also provides 4-bit weight compression, which is supported by OpenVINO. It can be applied to Optimum objects as follows: +`NNCF `__ also provides 4-bit weight compression, +which is supported by OpenVINO. It can be applied to Optimum objects as follows: .. code-block:: python @@ -131,21 +155,16 @@ Weight compression is applied by default to models larger than one billion param model.model = compress_weights(model.model, mode=CompressWeightsMode.INT4_SYM, group_size=128, ratio=0.8) -The optimized model can be saved as usual with a call to ``save_pretrained()``. For more details on compression options, refer to the :doc:`weight compression guide `. +The optimized model can be saved as usual with a call to ``save_pretrained()``. +For more details on compression options, refer to the :doc:`weight compression guide `. .. note:: - OpenVINO also supports 4-bit models from Hugging Face `Transformers `__ library optimized - with `GPTQ `__. In this case, there is no need for an additional model optimization step because model conversion will automatically preserve the INT4 optimization results, allowing model inference to benefit from it. - -Another optimization that is applied by default when using ``OVModelForCausalLM`` class is transformation of the model to a stateful form. -This transformation further improves inference performance and decreases amount of allocated runtime memory in long running text generation scenarios. -It is achieved by hiding inputs and outputs of the model that represent past KV-cache tensors and handling them inside the model in a more efficient way. -This feature is activated automatically for a wide range of supported text generation models, keeping not supported models in a regular, stateless form. - -Model usage are identical for stateful and stateless models as long as Optimum-Intel API is used because KV-cache handling is an internal detail of the text-generation API of Transformers library. -But a form of a model matterns in case when exported from Optimum-Intel OpenVINO model IR is used in an application implemented with native OpenVINO API, because stateful and stateless models have different number of inputs and outputs. -Please refer to a dedicated section of this document below for more information about using native OpenVINO API. + OpenVINO also supports 4-bit models from Hugging Face `Transformers `__ + library optimized with `GPTQ `__. In this case, + there is no need for an additional model optimization step because model conversion + will automatically preserve the INT4 optimization results, allowing model inference + to benefit from it. Below are some examples of using Optimum-Intel for model conversion and inference: @@ -154,6 +173,20 @@ Below are some examples of using Optimum-Intel for model conversion and inferenc * `Instruction following using Databricks Dolly 2.0 and OpenVINO `__ * `Create an LLM-powered Chatbot using OpenVINO `__ +Stateful Model Optimization ++++++++++++++++++++++++++++ + +When you use the ``OVModelForCausalLM`` class, the model is transformed into a stateful form by default for optimization. +This transformation improves inference performance and decreases runtime memory usage in long running text generation tasks. +It is achieved by hiding the model's inputs and outputs that represent past KV-cache tensors, and handling them inside the model in a more efficient way. +This feature is activated automatically for many supported text generation models, while unsupported models remain in a regular, stateless form. + +Model usage remains the same for stateful and stateless models with the Optimum-Intel API, as KV-cache is handled internally by text-generation API of Transformers library. +The model's form matters when an OpenVINO IR model is exported from Optimum-Intel and used in an application with the native OpenVINO API. +This is because stateful and stateless models have a different number of inputs and outputs. +Learn more about the `native OpenVINO API `__. + + Working with Models Tuned with LoRA ++++++++++++++++++++++++++++++++++++ @@ -175,19 +208,20 @@ Now the model can be converted to OpenVINO using Optimum Intel Python API or CLI Running Generative AI Models using Native OpenVINO APIs ######################################################## -To run Generative AI models using native OpenVINO APIs you need to follow regular **Сonvert -> Optimize -> Deploy** path with a few simplifications. +To run Generative AI models using native OpenVINO APIs, you need to follow regular **Convert -> Optimize -> Deploy** path with a few simplifications. + +The recommended way for converting a Hugging Face model is to use the Optimum-Intel export feature. This feature enables model export in OpenVINO format without directly invoking conversion API and tools, as demonstrated above. +The conversion process is significantly simplified as Optimum-Intel provides the necessary conversion parameters. These parameters are often model-specific and require knowledge of various model input properties. -To convert the Hugging Face model, the recommended way is to use Optimum-Intel export feature that allows to export model in OpenVINO format without invoking conversion API and tools directly, as it is shown above. -In this case, the conversion process is significantly simplified because Optimum-Intel provides necessary conversion parameters which in many cases model-specific and require knowlege of a lot of model input properties. -Moreover, Optimum-Intel applies several model optimization like weight compression and using stateful form by default that further similifies model exporting flow. -You can still use a regular conversion path if model comes from outside of Hugging Face ecosystem, i.e., in source framework format (PyTorch, TensorFlow etc.) +Moreover, Optimum-Intel applies several model optimizations, such as weight compression and using stateful form by default, that further simplify the model exporting flow. +You can still use the regular conversion path if the model comes from outside the Hugging Face ecosystem, such as in its source framework format (PyTorch, TensorFlow, etc.). Model optimization can be performed within Hugging Face or directly using NNCF as described in the :doc:`weight compression guide `. Inference code that uses native API cannot benefit from Hugging Face pipelines. You need to write your custom code or take it from the available examples. Below are some examples of popular Generative AI scenarios: -* In case of LLMs for text generation, you need to handle tokenization, inference and token sampling, and de-tokenization. If token sampling involves beam search, it also needs to be written. This is covered in details by `C++ Text Generation Samples `__. -* For image generation models, you need to make a pipeline that includes several model inferences: inference for source (e.g., text) encoder models, inference loop for diffusion process and inference for decoding part. Scheduler code is also required. `C++ Implementation of Stable Diffusion `__ is a good reference point. +* In case of LLMs for text generation, you need to handle tokenization, inference and token sampling, and de-tokenization. If token sampling involves beam search, you need to implement it as well. This is covered in details by `C++ Text Generation Samples `__. +* For image generation models, you need to make a pipeline that includes several model inferences: inference for source (for example, text) encoder models, inference loop for diffusion process and inference for decoding part. Scheduler code is also required. `C++ Implementation of Stable Diffusion `__ is a good reference point. Additional Resources diff --git a/docs/articles_en/openvino_workflow/running_inference_with_openvino/stateful_models_intro.rst b/docs/articles_en/openvino_workflow/running_inference_with_openvino/stateful_models_intro.rst index 06cbf78981b02c..afa7b12f213c99 100644 --- a/docs/articles_en/openvino_workflow/running_inference_with_openvino/stateful_models_intro.rst +++ b/docs/articles_en/openvino_workflow/running_inference_with_openvino/stateful_models_intro.rst @@ -4,100 +4,133 @@ Stateful models and State API ============================== .. toctree:: - :maxdepth: 1 - :hidden: + :maxdepth: 1 + :hidden: - openvino_docs_OV_UG_ways_to_get_stateful_model + openvino_docs_OV_UG_ways_to_get_stateful_model -What is Stateful Model? -####################### +A "stateful model" is a model that implicitly preserves data between two consecutive inference +calls. The tensors saved from one run are kept in an internal memory buffer called a +"state" or a "variable" and may be passed to the next run, while never being exposed as model +output. In contrast, for a "stateless" model to pass data between runs, all produced data is +returned as output and needs to be handled by the application itself for reuse at the next +execution. -Stateful model is a model which implicitly keeps data from one inference call to the next inference call. Data is kept in internal runtime memory space usually called State or Variable. -In contrast to usual **stateless** model, which return all produced data as model outputs, **stateful** -model preserve part of the tensors saved in States without exposing them as model outputs. - -The purpose of stateful models is to natively address a sequence processing tasks, like in text generation when one model inference produce a single output token, -and it is required to perform multiple inference calls to generate a complete output sentence. -Hidden state data from previous inference should be passed to the next inference as a context. -Usually the contextual data is not required to be accessed in the user application and should be just passed through to the next inference call manually using model API. -Stateful models simplifies programming of this scenario and unlocks additional performance potential of OpenVINO runtime. - -.. _ov_ug_stateful_model_benefits: - -OpenVINO Stateful Model Benefits -################################# - -1. Speed up execution of Model - Data in State is stored in the optimized form for OpenVINO plugins, which helps to execute model effectively. - **Note:** Often requesting data from State might reduce the expected performance gains and even lead to losses, - so, it is expected that State mechanism will be used when data stored in State is not accessed frequently. - -2. Simplify user code - Typical scenarios as providing initializing values for the first inference call or copying data from model's outputs to inputs in user code - can be replaced with State. OpenVINO will manage these cases internally. +.. image:: _static/images/stateful_model_example.svg + :alt: example comparison between stateless and stateful model implementations + :align: center + :scale: 90 % + +What is more, when a model includes TensorIterator or Loop operations, turning it to stateful +makes it possible to retrieve intermediate values from each execution iteration (thanks to the +LowLatency transformation). Otherwise, the whole set of their executions needs to finish +before the data becomes available. + +Text generation is a good usage example of stateful models, as it requires multiple inference +calls to output a complete sentence, each run producing a single output token. Information +from one run is passed to the next inference as a context, which may be handled by a stateful +model natively. Potential benefits for this, as well as other scenarios, may be: + +1. **model execution speedup** - data in states is stored in the optimized form for OpenVINO + plugins, which helps to execute the model more efficiently. Importantly, *requesting data + from the state too often may reduce the expected performance gains* or even lead to + losses. Use the state mechanism only if the state data is not accessed very frequently. + +2. **user code simplification** - states can replace code-based solutions for such scenarios + as giving initializing values for the first inference call or copying data from model + outputs to inputs. With states, OpenVINO will manage these cases internally, additionally + removing the potential for additional overhead due to data representation conversion. + +3. **data processing** - some use cases require processing of data sequences. + When such a sequence is of known length and short enough, you can process it with RNN-like + models that contain a cycle inside. When the length is not known, as in the case of online + speech recognition or time series forecasting, you can divide the data in small portions and + process it step-by-step, which requires addressing the dependency between data portions. + States fulfil this purpose well: models save some data between inference runs, when one + dependent sequence is over, the state may be reset to the initial value and a new sequence + can be started. -3. Specific scenarios - Several use cases require processing of data sequences. When length of a sequence is known and small enough, - we can process it with RNN like models that contain a cycle inside. But in some cases, like online speech recognition or time series - forecasting, length of data sequence is unknown. Then data can be divided in small portions and processed step-by-step. But dependency - between data portions should be addressed. For that, models save some data between inferences - state. When one dependent sequence is over, - state should be reset to initial value and new sequence can be started. OpenVINO Stateful Model Representation ###################################### -OpenVINO contains ReadValue/Assign operations to make a model Stateful. -Each pair of ReadValue/Assign operates with State, known also as Variable, -which is an internal memory buffer to store tensor data during and between model inference calls. -ReadValue reads data tensor from State and returns it as output, Assign accepts data tensor as input and writes data to State -to save data for the next inference call. +To make a model stateful, OpenVINO replaces looped pairs of `Parameter` and `Result` with its +own two operations: -OpenVINO has a special API to simplify work with Stateful models. State is automatically saved between inferences, -and there is a way to reset state when needed. You can also read state or set it to some new value between inferences. +* ``ReadValue`` (:doc:`see specs `) + reads the data from the state and returns it as output. +* ``Assign`` (:doc:`see specs `) + accepts the data as input and saves it in the state for the next inference call. -.. image:: _static/images/stateful_model_example.svg - :align: center - -The left side of the picture shows the usual inputs and outputs to the model: Parameter/Result operations. -There is no direct connection from Result to Parameter and in order to copy data from output to input users need to put extra effort writing and maintaining additional code. -In addition, this may impose additional overhead due to data representation conversion. +Each pair of these operations works with **state**, which is automatically saved between +inference runs and can be reset when needed. This way, the burden of copying data is shifted +from the application code to OpenVINO and all related internal work is hidden from the user. -Having operations such as ReadValue and Assign allows users to replace the looped Parameter/Result pairs of operations and shift the work of copying data to OpenVINO. After the replacement, the OpenVINO model no longer contains inputs and outputs with such names, all internal work on data copying is hidden from the user, but data from the intermediate inference can always be retrieved using State API methods. +There are three methods of turning an OpenVINO model into a stateful one: +* :doc:`Optimum-Intel` - the most user-friendly option. All necessary optimizations + are recognized and applied automatically. The drawback is, the tool does not work with all + models. -.. image:: _static/images/stateful_model_init_subgraph.svg - :align: center +* :ref:`MakeStateful transformation.` - enables the user to choose which + pairs of Parameter and Result to replace, as long as the paired operations are of the same + shape and element type. -In some cases, users need to set an initial value for State, or it may be necessary to reset the value of State at a certain inference to the initial value. For such situations, an initializing subgraph for the ReadValue operation and a special "reset" method are provided. +* :ref:`LowLatency2 transformation.` - automatically detects and replaces + Parameter and Result pairs connected to hidden and cell state inputs of LSTM/RNN/GRU operations + or Loop/TensorIterator operations. -You can find more details on these operations in :doc:`ReadValue ` and -:doc:`Assign ` specification. -How to get OpenVINO Model with States -######################################### -* :doc:`Optimum-Intel` - This is the most user-friendly way to get :ref:`the Benefits<_ov_ug_stateful_model_benefits>` - from using Stateful models in OpenVINO. - All necessary optimizations will be applied automatically inside Optimum-Intel tool. +.. _ov_ug_stateful_model_inference: -* :ref:`Apply MakeStateful transformation.` - If after conversion from original model to OpenVINO representation, the resulting model contains Parameter and Result operations, - which pairwise have the same shape and element type, the MakeStateful transformation can be applied to get model with states. +Running Inference of Stateful Models +##################################### + +For the most basic applications, stateful models work out of the box. For additional control, +OpenVINO offers a dedicated API, whose methods enable you to both retrieve and change data +saved in states between inference runs. OpenVINO runtime uses ``ov::InferRequest::query_state`` +to get the list of states from a model and the ``ov::VariableState`` class to operate with +states. + +| **`ov::InferRequest` methods:** +| ``std::vector query_state();`` - gets all available states for the given + inference request +| ``void reset_state()`` - resets all States to their default values +| +| **`ov::VariableState` methods:** +| ``std::string get_name() const`` - returns name(variable_id) of the corresponding + State(Variable) +| ``void reset()`` - resets the state to the default value +| ``void set_state(const Tensor& state)`` - sets a new value for the state +| ``Tensor get_state() const`` - returns the current value of the state + + +| **Using multiple threads** +| Note that if multiple independent sequences are involved, several threads may be used to + process each section in its own infer request. However, using several infer requests + for one sequence is not recommended, as the state would not be passed automatically. Instead, + each run performed in a different infer request than the previous one would require the state + to be set "manually", using the ``ov::VariableState::set_state`` method. -* :ref:`Apply LowLatency2 transformation.` - If a model contains a loop that runs over some sequence of input data, - the LowLatency2 transformation can be applied to get model with states. +.. image:: _static/images/stateful_model_init_subgraph.svg + :alt: diagram of how initial state value is set or reset + :align: center + :scale: 80 % -.. _ov_ug_stateful_model_inference: +| **Resetting states** +| Whenever it is necessary to set the initial value of a state or reset it, an initializing +| subgraph for the ReadValue operation and a special ``reset`` method are provided. +| A case worth mentioning here is, if you decide to reset, query for states, and then retrieve +| state data. It will result in undefined values and so, needs to be avoided. -Stateful Model Inference -######################## -The example below demonstrates inference of three independent sequences of data. State should be reset between these sequences. +Stateful Model Application Example +################################### -One infer request and one thread will be used in this example. Using several threads is possible if you have several independent sequences. Then each sequence can be processed in its own infer request. Inference of one sequence in several infer requests is not recommended. In one infer request state will be saved automatically between inferences, but -if the first step is done in one infer request and the second in another, state should be set in new infer request manually (using `ov::VariableState::set_state` method). +Here is a code example demonstrating inference of three independent sequences of data. +One infer request and one thread are used. The state should be reset between consecutive +sequences. .. tab:: C++ @@ -105,35 +138,9 @@ if the first step is done in one infer request and the second in another, state :language: cpp :fragment: [ov:state_api_usage] -You can find more powerful examples demonstrating how to work with models with states in speech sample and demo. -Descriptions can be found in :doc:`Samples Overview` - -.. _ov_ug_state_api: - -OpenVINO State API -################## - -OpenVINO runtime has the `ov::InferRequest::query_state` method to get the list of states from a model and `ov::VariableState` class to operate with states. -Below you can find brief description of methods and the example of how to use this interface. - -**`ov::InferRequest` methods:** - -* `std::vector query_state();` - allows to get all available stats for the given inference request. - -* `void reset_state()` - allows to reset all States to their default values. - -**`ov::VariableState` methods:** - -* `std::string get_name() const` - returns name(variable_id) of the according State(Variable) - -* `void reset()` - reset state to the default value -* `void set_state(const Tensor& state)` - set new value for State +You can find more examples demonstrating how to work with states in other articles: -* `Tensor get_state() const` - returns current value of State +* `LLM Chatbot notebook `__ +* :doc:`Speech Recognition sample ` +* :doc:`Serving Stateful Models with OpenVINO Model Server ` diff --git a/docs/articles_en/openvino_workflow/running_inference_with_openvino/stateful_models_intro/ways_to_get_stateful_model.rst b/docs/articles_en/openvino_workflow/running_inference_with_openvino/stateful_models_intro/ways_to_get_stateful_model.rst index 37078c85664208..243c64956591b0 100644 --- a/docs/articles_en/openvino_workflow/running_inference_with_openvino/stateful_models_intro/ways_to_get_stateful_model.rst +++ b/docs/articles_en/openvino_workflow/running_inference_with_openvino/stateful_models_intro/ways_to_get_stateful_model.rst @@ -1,201 +1,251 @@ .. {#openvino_docs_OV_UG_ways_to_get_stateful_model} -Ways to get stateful models in OpenVINO -======================================== +Obtaining a Stateful OpenVINO Model +==================================== + +If the original framework does not offer a dedicated API for working with states, the +resulting OpenVINO IR model will not be stateful by default. This means it will not contain +either a state or the :doc:`Assign ` and +:doc:`ReadValue ` operations. You can still +make such models stateful (:doc:`see benefits `), +and you have three ways to do it: + +* `Optimum-Intel `__ - an automated solution + applicable to a selection of models (not covered by this article, for a usage guide + refer to the :doc:`Optimize and Deploy Generative AI Models ` article). +* :ref:`MakeStateful transformation ` - to choose which pairs of + Parameter and Result to replace. +* :ref:`LowLatency2 transformation ` - to detect and replace Parameter + and Result pairs connected to hidden and cell state inputs of LSTM/RNN/GRU operations + or Loop/TensorIterator operations. -State related Transformations -################################# - -If the original framework does not have a special API for working with States, after importing the model, OpenVINO representation will not contain State and -:doc:`Assign `/:doc:`ReadValue ` operations correspondingly, so OpenVINO Model is not Stateful by default in these cases. -This article describes the ways how to get Stateful model via OpenVINO. - -For example, if the original ONNX model contains RNN operations, OpenVINO IR will contain TensorIterator/Loop operations and the values will be obtained only after execution of the whole TensorIterator primitive. -Intermediate values from each iteration will not be available. MakeStateful and LowLatency2 transformations can be used to make such models stateful and work with these intermediate values from each iteration and receive them with a low latency after each infer request. .. _ov_ug_make_stateful: -MakeStateful -############ +MakeStateful Transformation +########################### -MakeStateful transformation changes the structure of the model by adding the ability to work with the state, -replacing provided by user Parameter/Results with Assign/ReadValue operations as it is shown at the picture below. +The MakeStateful transformation changes the structure of the model by replacing the +user-defined pairs of Parameter and Results with the Assign and ReadValue operations: .. image:: _static/images/make_stateful_simple.svg + :alt: diagram of MakeStateful Transformation + :scale: 90 % :align: center -State naming rule: in most cases, a name of a state is a concatenation of Parameter/Result tensor names. -If there are no tensor names, :doc:`friendly names` are used. +**Only strict syntax is supported**. As shown in the example below, the transformation call +must be enclosed in double quotes "MakeStateful[...]", tensor names - in single quotes +without spaces 'tensor_name_1'. + +**State naming rule**: in most cases, the name of a state is a concatenation of the +Parameter/Result tensor names. If there are no tensor names, +:doc:`friendly names ` are used. -Examples: -Detailed illustration for all examples below: +**Examples:** .. image:: _static/images/make_stateful_detailed.png + :alt: detailed diagram of MakeStateful Transformation :align: center -1. C++ API -Using tensor names: +.. tab-set:: -.. tab:: C++ + .. tab-item:: C++ - .. doxygensnippet:: docs/snippets/ov_stateful_models_intro.cpp - :language: cpp - :fragment: [ov:make_stateful_tensor_names] + .. tab-set:: -Using Parameter/Result operations: + .. tab-item:: Using tensor names -.. tab:: C++ + .. doxygensnippet:: docs/snippets/ov_stateful_models_intro.cpp + :language: cpp + :fragment: [ov:make_stateful_tensor_names] - .. doxygensnippet:: docs/snippets/ov_stateful_models_intro.cpp - :language: cpp - :fragment: [ov:make_stateful_ov_nodes] + .. tab-item:: Using Parameter/Result operations -2. ModelOptimizer command line + .. doxygensnippet:: docs/snippets/ov_stateful_models_intro.cpp + :language: cpp + :fragment: [ov:make_stateful_ov_nodes] -Using tensor names: + .. tab-item:: command line -``` ---input_model --transform "MakeStateful[param_res_names={'tensor_name_1':'tensor_name_4','tensor_name_3':'tensor_name_6'}]" -``` + .. tab-set:: -**Note:** -Only strict syntax is supported, as in the example above, the transformation call must be in double quotes -"MakeStateful[...]", the tensor names in single quotes 'tensor_name_1' and without spaces. + .. tab-item:: Using tensor names -.. _ov_ug_low_latency: + .. code-block:: sh + + --input_model --transform "MakeStateful[param_res_names={'tensor_name_1':'tensor_name_4','tensor_name_3':'tensor_name_6'}]" -LowLatencу2 -########### -LowLatency2 transformation changes the structure of the model containing :doc:`TensorIterator ` -and :doc:`Loop ` by adding the ability to work with the state, inserting the Assign/ReadValue -layers as it is shown in the picture below. -Example of applying LowLatency2 transformation: + +.. _ov_ug_low_latency: + +LowLatency2 Transformation +########################## + +The LowLatency2 transformation changes the structure of a model containing +:doc:`TensorIterator ` +and :doc:`Loop ` by automatically detecting +and replacing pairs of Parameter and Results with the Assign and ReadValue operations, +as illustrated by the following example: .. image:: _static/images/applying_low_latency_2.svg + :alt: diagram of LowLatency Transformation :align: center -After applying the transformation, ReadValue operations can receive other operations as an input, as shown in the picture above. -These inputs should set the initial value for initialization of ReadValue operations. -However, such initialization is not supported in the current State API implementation. -Input values are ignored and the initial values for the ReadValue operations are set to zeros unless otherwise specified -by the user via :ref:`State API`. +After applying the transformation, ReadValue operations can receive other operations as +input, as shown in the picture above. These inputs should set the initial value for the +initialization of ReadValue operations. However, such initialization is not supported in +the current State API implementation. Input values are ignored, and the initial values +for the ReadValue operations are set to zeros unless the user specifies otherwise via +:ref:`State API `. -**Steps to apply LowLatency2 Transformation** +Applying LowLatency2 Transformation +++++++++++++++++++++++++++++++++++++ -1. Get :doc:`ov::Model`, for example: +1. Get :doc:`ov::Model `, for example: -.. tab:: C++ + .. tab-set:: - .. doxygensnippet:: docs/snippets/ov_stateful_models_intro.cpp - :language: cpp - :fragment: [ov:get_ov_model] + .. tab-item:: C++ -2. Change the number of iterations inside TensorIterator/Loop nodes in the model using the :doc:`Reshape ` feature. + .. doxygensnippet:: docs/snippets/ov_stateful_models_intro.cpp + :language: cpp + :fragment: [ov:get_ov_model] -For example, the *sequence_lengths* dimension of input of the model > 1, it means the TensorIterator layer has number_of_iterations > 1. -You can reshape the inputs of the model to set *sequence_dimension* to exactly 1. +2. Change the number of iterations inside TensorIterator/Loop nodes in the model using the + :doc:`Reshape ` feature. -.. tab:: C++ + For example, the *sequence_lengths* dimension of the model input > 1, it means the + TensorIterator layer has the number_of_iterations > 1. You can reshape the model + inputs to set the *sequence_dimension* to exactly 1. - .. doxygensnippet:: docs/snippets/ov_stateful_models_intro.cpp - :language: cpp - :fragment: [ov:reshape_ov_model] + .. tab-set:: -**Unrolling**: If the LowLatency2 transformation is applied to a model containing TensorIterator/Loop nodes with exactly one iteration inside, these nodes are unrolled; otherwise, the nodes remain as they are. Please see [the picture](#example-of-applying-lowlatency2-transformation) for more details. + .. tab-item:: C++ -3. Apply LowLatency2 transformation + .. doxygensnippet:: docs/snippets/ov_stateful_models_intro.cpp + :language: cpp + :fragment: [ov:reshape_ov_model] -.. tab:: C++ + **Unrolling**: If the LowLatency2 transformation is applied to a model containing + TensorIterator/Loop nodes with exactly one iteration inside, these nodes are unrolled. + Otherwise, the nodes remain as they are. See the picture above for more details. - .. doxygensnippet:: docs/snippets/ov_stateful_models_intro.cpp - :language: cpp - :fragment: [ov:apply_low_latency_2] +3. Apply LowLatency2 transformation. -(Optional) Use Const Initializer argument: + .. tab-set:: -By default, the LowLatency2 transformation inserts a constant subgraph of the same shape as the previous input node, and with zero values as the initializing value for ReadValue nodes, please see the picture below. We can disable insertion of this subgraph by passing the `false` value for the `use_const_initializer` argument. + .. tab-item:: C++ -.. tab:: C++ + .. doxygensnippet:: docs/snippets/ov_stateful_models_intro.cpp + :language: cpp + :fragment: [ov:apply_low_latency_2] - .. doxygensnippet:: docs/snippets/ov_stateful_models_intro.cpp - :language: cpp - :fragment: [ov:low_latency_2_use_parameters] + (Optional) Use Const Initializer argument: -.. image:: _static/images/llt2_use_const_initializer.svg - :align: center + By default, the LowLatency2 transformation inserts a constant subgraph of the same shape + as the previous input node. The initializing value for ReadValue nodes is set to zero. + For more information, see the picture below. You can disable the insertion of this subgraph + by setting the ``use_const_initializer`` argument to ``false``. -**State naming rule:** a name of a state is a concatenation of names: original TensorIterator operation, Parameter of the body, and additional suffix "variable_" + id (0-base indexing, new indexing for each TensorIterator). You can use these rules to predict what the name of the inserted State will be after the transformation is applied. For example: + .. tab-set:: -.. tab:: C++ + .. tab-item:: C++ - .. doxygensnippet:: docs/snippets/ov_stateful_models_intro.cpp - :language: cpp - :fragment: [ov:low_latency_2] + .. doxygensnippet:: docs/snippets/ov_stateful_models_intro.cpp + :language: cpp + :fragment: [ov:low_latency_2_use_parameters] -4. Use state API. See sections :ref:`OpenVINO State API `, :ref:`Stateful Model Inference`. + .. image:: _static/images/llt2_use_const_initializer.svg + :alt: diagram of constant subgraph initialization + :align: center -**Known Limitations** + **State naming rule:** the name of a state is a concatenation of several names: the original + TensorIterator operation, the parameter of the body, and an additional suffix "variable_" + id + (zero-based indexing, new indexing for each TensorIterator). You can use these rules to predict + the name of the inserted state after applying the transformation. For example: -Unable to execute :doc:`Reshape ` to change the number iterations of TensorIterator/Loop layers to apply the transformation correctly due to hardcoded values of shapes somewhere in the model. + .. tab-set:: -The only way you can change the number iterations of TensorIterator/Loop layer is to use the Reshape feature, but models can be non-reshapable, -the most common reason is that the value of shapes is hardcoded in a constant somewhere in the model. + .. tab-item:: C++ + .. doxygensnippet:: docs/snippets/ov_stateful_models_intro.cpp + :language: cpp + :fragment: [ov:low_latency_2] -.. image:: _static/images/low_latency_limitation_2.svg - :scale: 70 % - :align: center -**Solution:** -Trim non-reshapable layers via :doc:`ModelOptimizer commandline ` arguments: - `--input`, `--output`. +4. Use state API. See sections :ref:`OpenVINO State API `, + :ref:`Stateful Model Inference `. -For example, the parameter and the problematic constant in the picture above can be trimmed using the following command line option: -`--input Reshape_layer_name`. The problematic constant can be also replaced using OpenVINO, as shown in the example below. + .. image:: _static/images/low_latency_limitation_2.svg + :alt: diagram showing low latency limitation + :scale: 70 % + :align: center - .. tab:: C++ + The only way to change the number iterations of TensorIterator/Loop layer is to use the + :doc:`Reshape ` feature. However, some models may be + non-reshapable, typically because the value of shapes is hardcoded in a constant + somewhere in the model. - .. doxygensnippet:: docs/snippets/ov_stateful_models_intro.cpp - :language: cpp - :fragment: [ov:replace_const] + In such a case, trim non-reshapable layers via + :doc:`Model Optimizer command-line ` + arguments: ``--input`` and ``--output``. + For example, the parameter and the problematic constant in the picture above can be + trimmed using the ``--input Reshape_layer_name`` command-line option. The problematic + constant can be also replaced using OpenVINO, as shown in the following example: -How to get TensorIterator/Loop operations from different frameworks via ModelOptimizer. -####################################################################################### + .. tab-set:: -**ONNX and frameworks supported via ONNX format:** *LSTM, RNN, GRU* original layers are converted to the GRU/RNN/LSTM Sequence operations. -*ONNX Loop* layer is converted to the OpenVINO Loop operation. + .. tab-item:: C++ -**TensorFlow:** *BlockLSTM* is converted to TensorIterator operation, TensorIterator body contains LSTM Cell operation, Peepholes, InputForget modifications are not supported. -*While* layer is converted to TensorIterator, TensorIterator body can contain any supported operations, but dynamic cases, when count of iterations cannot be calculated in shape inference (ModelOptimizer conversion) time, are not supported. + .. doxygensnippet:: docs/snippets/ov_stateful_models_intro.cpp + :language: cpp + :fragment: [ov:replace_const] -**TensorFlow2:** *While* layer is converted to Loop operation. Loop body can contain any supported operations. -How to create a model with state using OpenVINO -############################################### -To get a model with states ready for inference, you can convert a model from another framework to IR with Model Optimizer -or create an OpenVINO Model (details can be found in :doc:`Build OpenVINO Model section`. -Let's build the following model using C++ OpenVINO API: +Obtaining TensorIterator/Loop Operations using Model Optimizer +############################################################### -.. image:: _static/images/stateful_model_example.svg - :align: center +**ONNX and frameworks supported via ONNX format:** *LSTM, RNN, GRU* original layers are +converted to the GRU/RNN/LSTM Sequence operations. *ONNX Loop* layer is converted to the +OpenVINO Loop operation. + +**TensorFlow:** *BlockLSTM* is converted to a TensorIterator operation. TensorIterator +body contains LSTM Cell operation. Modifications such as Peepholes and InputForget are +not supported. The *While* layer is converted to a TensorIterator. TensorIterator body +can contain any supported operations. However, dynamic cases where the count of iterations +cannot be calculated during shape inference (Model Optimizer conversion) are not supported. + +**TensorFlow2:** *While* layer is converted to a Loop operation. The Loop body can contain +any supported operations. + + + +Creating a Model via OpenVINO API +################################## + +The main approach to obtaining stateful OpenVINO IR models is converting from other +frameworks. Nonetheless, it is possible to create a model from scratch. Check how to +do so in the :doc:`Build OpenVINO Model section `. + +Here is also an example of how ``ov::SinkVector`` is used to create ``ov::Model``. For a +model with states, except inputs and outputs, ``Assign`` nodes should also point to ``Model`` +to avoid deleting it during graph transformations. You can do it with the constructor, as in +the example, or with the `add_sinks(const SinkVector& sinks)` method. Also, you can delete +a sink from `ov::Model` after deleting the node from the graph with the `delete_sink()` method. -Example of Creating Model via OpenVINO API -########################################## +.. tab-set:: -.. tab:: C++ + .. tab-item:: C++ .. doxygensnippet:: docs/snippets/ov_stateful_models_intro.cpp :language: cpp :fragment: [ov:state_network] -In this example, `ov::SinkVector` is used to create `ov::Model`. For model with states, except inputs and outputs, `Assign` nodes should also point to `Model` -to avoid deleting it during graph transformations. You can do it with the constructor, as shown in the example, or with the special method `add_sinks(const SinkVector& sinks)`. Also, you can delete -sink from `ov::Model` after deleting the node from graph with the `delete_sink()` method.