[DOCS] Latency highlight for OV devices + update of Optimize Inferenc…

…e for 2024 (#23604) Port from #23575 Jira: 133389 * Added an indication on Latency being the default use for OV devices * Streamlined the Optimize Inference article for better clarity.
openvinotoolkit · Mar 21, 2024 · 1756c93 · 1756c93
1 parent 03551b8
commit 1756c93
Show file tree

Hide file tree

Showing 2 changed files with 39 additions and 40 deletions.
diff --git a/docs/articles_en/openvino-workflow/running-inference/optimize-inference.rst b/docs/articles_en/openvino-workflow/running-inference/optimize-inference.rst
@@ -23,44 +23,43 @@ Optimize Inference
                  optimizations that can be done independently. Inference
                  speed depends on latency and throughput.
 
-
-Runtime optimization, or deployment optimization, focuses on tuning inference parameters and execution means (e.g., the optimum number of requests executed simultaneously). Unlike model-level optimizations, they are highly specific to the hardware and case they are used for, and often come at a cost.
-``ov::hint::inference_precision`` is a "typical runtime configuration" which trades accuracy for performance, allowing ``fp16/bf16`` execution for the layers that remain in ``fp32`` after quantization of the original ``fp32`` model.
-
-Therefore, optimization should start with defining the use case. For example, if it is about processing millions of samples by overnight jobs in data centers, throughput could be prioritized over latency. On the other hand, real-time usages would likely trade off throughput to deliver the results at minimal latency. A combined scenario is also possible, targeting the highest possible throughput, while maintaining a specific latency threshold.
-
-It is also important to understand how the full-stack application would use the inference component "end-to-end." For example, to know what stages need to be orchestrated to save workload devoted to fetching and preparing input data.
-
-For more information on this topic, see the following articles:
-
-* :doc:`Supported Devices <../../about-openvino/compatibility-and-support/supported-devices>`
-* :doc:`Inference Devices and Modes <inference-devices-and-modes>`
-* :ref:`Inputs Pre-processing with the OpenVINO <inputs_pre_processing>`
-* :ref:`Async API <async_api>`
-* :ref:`The 'get_tensor' Idiom <tensor_idiom>`
-* For variably-sized inputs, consider :doc:`dynamic shapes <dynamic-shapes>`
-
-
-See the :doc:`latency <optimize-inference/optimizing-latency>` and :doc:`throughput <optimize-inference/optimizing-throughput>` optimization guides, for **use-case-specific optimizations**
-
-Writing Performance-Portable Inference Applications
-###################################################
-
-Although inference performed in OpenVINO Runtime can be configured with a multitude of low-level performance settings, it is not recommended in most cases. Firstly, achieving the best performance with such adjustments requires deep understanding of device architecture and the inference engine.
-
-
-Secondly, such optimization may not translate well to other device-model combinations. In other words, one set of execution parameters is likely to result in different performance when used under different conditions. For example:
-
-* both the CPU and GPU support the notion of :doc:`streams <./optimize-inference/optimizing-throughput/advanced_throughput_options>`, yet they deduce their optimal number very differently.
-* Even among devices of the same type, different execution configurations can be considered optimal, as in the case of instruction sets or the number of cores for the CPU and the batch size for the GPU.
-* Different models have different optimal parameter configurations, considering factors such as compute vs memory-bandwidth, inference precision, and possible model quantization.
-* Execution "scheduling" impacts performance strongly and is highly device-specific, for example, GPU-oriented optimizations like batching, combining multiple inputs to achieve the optimal throughput, :doc:`do not always map well to the CPU <optimize-inference/optimizing-low-level-implementation>`.
-
-
-To make the configuration process much easier and its performance optimization more portable, the option of :doc:`Performance Hints <optimize-inference/high-level-performance-hints>` has been introduced. It comprises two high-level "presets" focused on either **latency** or **throughput** and, essentially, makes execution specifics irrelevant.
-
-The Performance Hints functionality makes configuration transparent to the application, for example, anticipates the need for explicit (application-side) batching or streams, and facilitates parallel processing of separate infer requests for different input sources
-
+Runtime, or deployment optimization focuses on tuning inference and execution parameters. Unlike
+model-level optimization, it is highly specific to the hardware you use and the goal you want
+to achieve. You need to plan whether to prioritize accuracy or performance,
+:doc:`throughput <optimize-inference/optimizing-throughput>` or :doc:`latency <optimize-inference/optimizing-latency>`,
+or aim at the golden mean. You should also predict how scalable your application needs to be
+and how exactly it is going to work with the inference component. This way, you will be able
+to achieve the best results for your product.
+
+.. note::
+
+   For more information on this topic, see the following articles:
+
+   * :doc:`Inference Devices and Modes <inference-devices-and-modes>`
+   * :ref:`Inputs Pre-processing with the OpenVINO <inputs_pre_processing>`
+   * :ref:`Async API <async_api>`
+   * :ref:`The 'get_tensor' Idiom <tensor_idiom>`
+   * For variably-sized inputs, consider :doc:`dynamic shapes <dynamic-shapes>`
+
+Performance-Portable Inference
+################################
+
+To make configuration easier and performance optimization more portable, OpenVINO offers the
+:doc:`Performance Hints <optimize-inference/high-level-performance-hints>` feature. It comprises
+two high-level “presets” focused on latency **(default)** or throughput.
+
+Although inference with OpenVINO Runtime can be configured with a multitude
+of low-level performance settings, it is not recommended, as:
+
+* It requires deep understanding of device architecture and the inference engine.
+* It may not translate well to other device-model combinations. For example:
+
+  * CPU and GPU deduce their optimal number of streams differently.
+  * Different devices of the same type, favor different execution configurations.
+  * Different models favor different parameter configurations (e.g., compute vs memory-bandwidth,
+    inference precision, and possible model quantization).
+  * Execution “scheduling” impacts performance strongly and is highly device specific. GPU-oriented
+    optimizations :doc:`do not always map well to the CPU <optimize-inference/optimizing-low-level-implementation>`.
 
 Additional Resources
 ####################

diff --git a/...-workflow/running-inference/optimize-inference/high-level-performance-hints.rst b/...-workflow/running-inference/optimize-inference/high-level-performance-hints.rst
@@ -21,9 +21,9 @@ The hints, in contrast, respect the actual model, so the parameters for optimal
 Performance Hints: Latency and Throughput
 #########################################
 
-As discussed in the :doc:`Optimization Guide <../optimize-inference>` there are a few different metrics associated with inference speed. Throughput and latency are some of the most widely used metrics that measure the overall performance of an application.
+As discussed in the :doc:`Optimization Guide <../optimize-inference>` there are a few different metrics associated with inference speed. Latency and throughput are some of the most widely used metrics that measure the overall performance of an application.
 
-Therefore, in order to ease the configuration of the device, OpenVINO offers two dedicated hints, namely ``ov::hint::PerformanceMode::THROUGHPUT`` and ``ov::hint::PerformanceMode::LATENCY``.
+Therefore, in order to ease the configuration of the device, OpenVINO offers two dedicated hints, namely ``ov::hint::PerformanceMode::LATENCY`` **(default)** and ``ov::hint::PerformanceMode::THROUGHPUT``.
 
 For more information on conducting performance measurements with the ``benchmark_app``, refer to the last section in this document.