diff --git a/docs/articles_en/about-openvino/compatibility-and-support/supported-devices.rst b/docs/articles_en/about-openvino/compatibility-and-support/supported-devices.rst index 3bb46116ee1748..e38bcb64d90530 100644 --- a/docs/articles_en/about-openvino/compatibility-and-support/supported-devices.rst +++ b/docs/articles_en/about-openvino/compatibility-and-support/supported-devices.rst @@ -90,16 +90,3 @@ topic (step 3 "Configure input and output"). | \* **Of the Linux systems, versions 22.04 and 24.04 include drivers for NPU.** | **For Windows, CPU inference on ARM64 is not supported.** - -.. note:: - - With the OpenVINO 2024.0 release, support for GNA has been discontinued. To keep using it - in your solutions, revert to the 2023.3 (LTS) version. - - With the OpenVINO™ 2023.0 release, support has been cancelled for: - - - Intel® Neural Compute Stick 2 powered by the Intel® Movidius™ Myriad™ X - - Intel® Vision Accelerator Design with Intel® Movidius™ - - To keep using the MYRIAD and HDDL plugins with your hardware, - revert to the OpenVINO 2022.3 (LTS) version. diff --git a/docs/articles_en/about-openvino/release-notes-openvino.rst b/docs/articles_en/about-openvino/release-notes-openvino.rst index bf475159380dff..7cd373d7c464da 100644 --- a/docs/articles_en/about-openvino/release-notes-openvino.rst +++ b/docs/articles_en/about-openvino/release-notes-openvino.rst @@ -106,1556 +106,22 @@ Previous 2024 releases * More GenAI coverage and framework integrations to minimize code changes. - * New models supported: Llama 3.2 (1B & 3B), Gemma 2 (2B & 9B), and YOLO11. - * LLM support on NPU: Llama 3 8B, Llama 2 7B, Mistral-v0.2-7B, Qwen2-7B-Instruct and Phi-3 - Mini-Instruct. - * Noteworthy notebooks added: Sam2, Llama3.2, Llama3.2 - Vision, Wav2Lip, Whisper, and Llava. - * Preview: support for Flax, a high-performance Python neural network library based on JAX. - Its modular design allows for easy customization and accelerated inference on GPUs. - - * Broader Large Language Model (LLM) support and more model compression techniques. - - * Optimizations for built-in GPUs on Intel® Core™ Ultra Processors (Series 1) and Intel® Arc™ - Graphics include KV Cache compression for memory reduction along with improved usability, - and model load time optimizations to improve first token latency for LLMs. - * Dynamic quantization was enabled to improve first token latency for LLMs on built-in - Intel® GPUs without impacting accuracy on Intel® Core™ Ultra Processors (Series 1). Second - token latency will also improve for large batch inference. - * A new method to generate synthetic text data is implemented in the Neural Network - Compression Framework (NNCF). This will allow LLMs to be compressed more accurately using - data-aware methods without datasets. Coming soon: This feature will soon be accessible via - Optimum Intel on Hugging Face. - - * More portability and performance to run AI at the edge, in the cloud, or locally. - - * Support for - `Intel® Xeon® 6 Processors with P-cores `__ - (formerly codenamed Granite Rapids) and - `Intel® Core™ Ultra 200V series processors `__ - (formerly codenamed Arrow Lake-S). - * Preview: GenAI API enables multimodal AI deployment with support for multimodal pipelines - for improved contextual awareness, transcription pipelines for easy audio-to-text - conversions, and image generation pipelines for streamlined text-to-visual conversions. - * Speculative decoding feature added to the GenAI API for improved performance and efficient - text generation using a small draft model that is periodically corrected by the full-size - model. - * Preview: LoRA adapters are now supported in the GenAI API for developers to quickly and - efficiently customize image and text generation models for specialized tasks. - * The GenAI API now also supports LLMs on NPU allowing developers to specify NPU as the - target device, specifically for WhisperPipeline (for whisper-base, whisper-medium, and - whisper-small) and LLMPipeline (for Llama 3 8B, Llama 2 7B, Mistral-v0.2-7B, - Qwen2-7B-Instruct and Phi-3 Mini-instruct). Use driver version 32.0.100.3104 or later for - best performance. - - *Now deprecated* - * Python 3.8 is no longer supported: - **OpenVINO™ Runtime** - - *Common* - * Numpy 2.x has been adopted for all currently supported components, including NNCF. - * A new constant constructor has been added, enabling constants to be created from data pointer - as shared memory. Additionally, it can take ownership of a shared, or other, object, avoiding - a two-step process to wrap memory into ``ov::Tensor``. - * Asynchronous file reading with mmap library has been implemented, reducing loading times for - model files, especially for LLMs. - * CPU implementation of SliceScatter operator is now available, used for models such as Gemma, - supporting increased LLM performance. - *CPU Device Plugin* - * Gold support of the Intel® Xeon® 6 platform with P-cores (formerly code name Granite Rapids) - has been reached. - * Support of Intel® Core™ Ultra 200V series processors (formerly codenamed Arrow Lake-S) has - been implemented. - * LLM performance has been further improved with Rotary Position Embedding optimization; Query, - Key, and Value; and multi-layer perceptron fusion optimization. - * FP16 support has been extended with SDPA and PagedAttention, improving performance of LLM via - both native APIs and the vLLM integration. - * Models with LoRA adapters are now supported. - - - *GPU Device Plugin* - - * The KV cache INT8 compression mechanism is now available for all supported GPUs. It enables a - significant reduction in memory consumption, increasing performance with a minimal impact to - accuracy (it affects systolic devices slightly more than non-systolic ones). The feature is - activated by default for non-systolic devices. - * LoRA adapters are now functionally supported on GPU. - * A new feature of GPU weightless blob caching enables caching model structure only and reusing - the weights from the original model file. Use the new OPTIMIZE_SIZE property to activate. - * Dynamic quantization with INT4 and INT8 precisions has been implemented and enabled by - default on Intel® Core™ Ultra platforms, improving LLM first token latency. - - - *NPU Device Plugin* - - * Models retrieved from the OpenVINO cache have a smaller memory footprint now. The plugin - releases the cached model (blob) after weights are loaded in NPU regions. Model export is not - available in this scenario. Memory consumption is reduced during inference execution with one - blob size. This optimization requires the latest NPU driver: 32.0.100.3104. - * A driver bug for ``ov::intel_npu::device_total_mem_size`` has been fixed. The plugin will now - report 2GB as the maximum allocatable memory for any driver that does not support graph - extension 1.8. Even if older drivers report a larger amount of memory to be available, memory - allocation would fail when 2GB are exceeded. Plugin reports the number that driver exposes - for any driver that supports graph extension 1.8 (or newer). - * A new API is used to initialize the model (available in graph extension 1.8). - * Inference request set_tensors is now supported. - * ``ov::device::LUID`` is now exposed on Windows. - * LLM-related improvements have been implemented in terms of both memory usage and performance. - * AvgPool and MaxPool operator support has been extended, adding support for more PyTorch models. - - * NOTE: for systems based on Intel® Core™ Ultra Processors Series 2, more than 16GB of RAM may - be required to use larger models, such as Llama-2-7B, Mistral-0.2-7B, and Qwen-2-7B - (exceeding 4B parameters) with prompt sizes over 1024 tokens. - - - *OpenVINO Python API* - - * Constant now can be created from openvino.Tensor. - * The “release_memory” method has been added for a compiled model, improving control over - memory consumption. - - - - *OpenVINO Node.js API* - - * Querying the best device to perform inference of a model with specific operations - is now available in JavaScript API. - * Contribution guidelines have been improved to make it easier for developers to contribute. - * Testing scope has been extended by inference in end-to-end tests. - * JavaScript API samples have been improved for readability and ease of running. - - - - *TensorFlow Framework Support* - - * TensorFlow 2.18.0, Keras 3.6.0, NumPy 2.0.2 in Python 3.12, and NumPy 1.26.4 in other Python - versions have been added to validation. - * Out-of-the-box conversion with static ranks has been improved by devising a new shape for - Switch-Merge condition sub-graphs. - * Complex type for the following operations is now supported: ExpandDims, Pack, Prod, Rsqrt, - ScatterNd, Sub. - * The following issues have been fixed: - - * the corner case with one element in LinSpace to avoid division by zero, - * support FP16 and FP64 input types for LeakyRelu, - * support non-i32/i64 output index type for ArgMin/Max operations. - - - - *PyTorch Framework Support* - - * PyTorch version 2.5 is now supported. - * OpenVINO Model Converter (OVC) now supports TorchScript and ExportedProgram saved on a drive. - * The issue of aten.index.Tensor conversion for indices with “None” values has been fixed, - helping to support the HF Stable Diffusion model in ExportedProgram format. - - - - *ONNX Framework Support* - - * ONNX version 1.17.0 is now used. - * Customers' models with DequantizeLinear-21, com.microsoft.MatMulNBits, and - com.microsoft.QuickGelu operations are now supported. - - *JAX/Flax Framework Support* - - * JAX 0.4.35 and Flax 0.10.0 has been added to validation. - * jax._src.core.ClosedJaxpr object conversion is now supported. - * Vision Transformer from google-research/vision_transformer is now supported - (with support for 37 new operations). - - - **OpenVINO Model Server** - - * The OpenAI API text embedding endpoint has been added, enabling OVMS to be used as a building - block for AI applications like RAG. - `(read more) `__ - * The rerank endpoint has been added based on Cohere API, enabling easy similarity detection - between a query and a set of documents. It is one of the building blocks for AI applications - like RAG and makes integration with frameworks such as langchain easy. - `(read more) `__ - * The following improvements have been done to LLM text generation: - - * The ``echo`` sampling parameter together with ``logprobs`` in the ``completions`` endpoint - is now supported. - * Performance has been increased on both CPU and GPU. - * Throughput in high-concurrency scenarios has been increased with dynamic_split_fuse for GPU. - * Testing coverage and stability has been improved. - * The procedure for service deployment and model repository preparation has been simplified. - - * An experimental version of a Windows binary package - native model server for Windows OS - is - available. This release includes a set of limitations and has limited tests coverage. It is - intended for testing, while the production-ready release is expected with 2025.0. All feedback - is welcome. - - - **Neural Network Compression Framework** - - * A new nncf.data.generate_text_data() method has been added for generating a synthetic dataset - for LLM compression. This approach helps to compress LLMs more accurately in situations when - the dataset is not available or not sufficient. - `See our example `__ - for more information about the usage. - * Support of data-free and data-aware weight compression methods - nncf.compress_weights() - - has been extended with NF4 per-channel quantization, making compressed LLMs more accurate and - faster on NPU. - * Caching of computed statistics in nncf.compress_weights() is now available, significantly - reducing compression time when performing compression of the same LLM multiple times, with - different compression parameters. To enable it, set the advanced ``statistics_path`` parameter - of nncf.compress_weights() to the desired file path location. - * The ``backup_mode`` optional parameter has been added to nncf.compress_weights(), for - specifying the data type for embeddings, convolutions, and last linear layers during 4-bit - weight compression. Available options are INT8_ASYM (default), INT8_SYM, and NONE (retains - the original floating-point precision of the model weights). In certain situations, - non-default value might give better accuracy of compressed LLMs. - * Preview support is now available for optimizing models in Torch - `FX format `__, nncf.quantize(), and - nncf.compress_weights() methods. After optimization such models can be directly executed - via torch.compile(compressed_model, backend="openvino"). For more details, see - `INT8 quantization example `__. - * Memory consumption of data-aware weight compression methods - nncf.compress_weights() – has - been reduced significantly, with some variation depending on the model and method. - * Support for the following has changed: - - * NumPy 2 added - * PyTorch upgraded to 2.5.1 - * ONNX upgraded to 1.17 - * Python 3.8 discontinued - - - - **OpenVINO Tokenizers** - - * Several operations have been introduced and optimized. - * Conversion parameters and environment info have been added to ``rt_info``, improving - reproducibility and debugging. - - - - **OpenVINO.GenAI** - - * The following has been added: - - * LoRA adapter for the LLMPipeline. - * Text2ImagePipeline with LoRA adapter and text2image samples. - * VLMPipeline and visual_language_chat sample for text generation models with text and image - inputs. - * WhisperPipeline and whisper_speech_recognition sample. - - * speculative_decoding_lm has been moved to LLMPipeline based implementation and is now - installed as part of the package. - * On NPU, a set of pipelines has been enabled: WhisperPipeline (for whisper-base, - whisper-medium, and whisper-small), LLMPipeline (for Llama 3 8B, Llama 2 7B, Mistral-v0.2-7B, - Qwen2-7B-Instruct, and Phi-3 Mini-instruct). Use driver version 32.0.100.3104 or later for - best performance. - - - - - - **Other Changes and Known Issues** - - *Jupyter Notebooks* - - * `Text-to-Image generation using OpenVINO GenAI `__ - * `Multi LoRA Image Generation `__ - * `Virtual Try-on using OpenVINO and CatVTON `__ - * `Visual Language Assistant using OpenVINO GenAI `__ - * `Speech recognition using OpenVINO GenAI `__ - * `YoloV11 `__ - * `Llama-3.2-vision `__ - * `Pixtral `__ - * `Segment Anything 2 `__ - * `Video Lips-sync using Wav2Lip `__ - * `Convert JAX to OpenVINO tutorial `__ - - - *Known Issues* - - | **Component: CPU Plugin** - | ID: 155898 - | Description: - | Description: When using new version of Transformer version to convert some of LLMs - (GPT-J/GPT-NeoX or falcon-7b), the inference accuracy may be impacted on 4th or 5th - generation of Intel® Xeon® processors, due to model structure update triggering inference - precision difference in part of the model. The workaround is to use transformer version of - 4.44.2 or lower. - - | **Component: GPU Plugin** - | ID: 154583 - | Description: - | LLM accuracy can be low especially on non-systolic platforms like Intel® Core™ Ultra. When - facing the low accuracy issue, user needs to manually set a config ACTIVATION_SCALING_FACOTR - with a value of 8.0 in the compile_model() function. From the next release, scaling factor - value will be automatically applied through updated IR. - - | **Component: GenAI** - | ID: 156437, 148933 - | Description: - | When using Python GenAI APIs, if ONNX 17.0 and later is installed, it may encounter the - error “DLL load failed while importing onnx_cpp2py_export: A dynamic link library (DLL) - initialization routine failed.” It is due to the ONNX dependency issue - `onnx/onnx#6267 `__, - Install - `Microsoft Visual C++ Redistributable `__ - latest supported downloads to fix the issue. - - | **Component: GenAI** - | ID: 156944 - | Description: - | There were backward incompatible changes resulting in different text generated by LLMs like - Mistralai/Mistral-7B-Instruct-v0.2 and TinyLlama/TinyLlama-1.1B-Chat-v1.0 when using a - tokenizer converted by older openvino_tolenizers. A way to resolve the issue is to convert - tokenizer and detokenizer models using the latest openvino_tokenizers. - - - - - - - - -.. dropdown:: 2024.4 - 19 September 2024 - :animate: fade-in-slide-down - :color: secondary - - **What's new** - - * More Gen AI coverage and framework integrations to minimize code changes. - - * Support for GLM-4-9B Chat, MiniCPM-1B, Llama 3 and 3.1, Phi-3-Mini, Phi-3-Medium and - YOLOX-s models. - * Noteworthy notebooks added: Florence-2, NuExtract-tiny Structure Extraction, Flux.1 Image - Generation, PixArt-α: Photorealistic Text-to-Image Synthesis, and Phi-3-Vision Visual - Language Assistant. - - * Broader Large Language Model (LLM) support and more model compression techniques. - - * OpenVINO™ runtime optimized for Intel® Xe Matrix Extensions (Intel® XMX) systolic arrays on - built-in GPUs for efficient matrix multiplication resulting in significant LLM performance - boost with improved 1st and 2nd token latency, as well as a smaller memory footprint on - Intel® Core™ Ultra Processors (Series 2). - * Memory sharing enabled for NPUs on Intel® Core™ Ultra Processors (Series 2) for efficient - pipeline integration without memory copy overhead. - * Addition of the PagedAttention feature for discrete GPUs* enables a significant boost in - throughput for parallel inferencing when serving LLMs on Intel® Arc™ Graphics or Intel® - Data Center GPU Flex Series. - - * More portability and performance to run AI at the edge, in the cloud, or locally. - - * Support for Intel® Core™ Ultra Processors Series 2 (formerly codenamed Lunar Lake) on Windows. - * OpenVINO™ Model Server now comes with production-quality support for OpenAI-compatible API - which enables significantly higher throughput for parallel inferencing on Intel® Xeon® - processors when serving LLMs to many concurrent users. - * Improved performance and memory consumption with prefix caching, KV cache compression, and - other optimizations for serving LLMs using OpenVINO™ Model Server. - * Support for Python 3.12. - * Support for Red Hat Enterprise Linux (RHEL) version 9.3 - 9.4. - - *Now deprecated* - - * The following will not be available beyond the 2024.4 OpenVINO version: - - * The macOS x86_64 debug bins - * Python 3.8 - * Discrete Keem Bay support - - * Intel® Streaming SIMD Extensions (Intel® SSE) will be supported in source code form, but not - enabled in the binary package by default, starting with OpenVINO 2025.0. - - Check the `deprecation section <#deprecation-and-support>`__ for more information. - - **OpenVINO™ Runtime** - - *Common* - - * Encryption and decryption of topology in model cache is now supported with callback functions - provided by the user (CPU only for now; ov::cache_encryption_callbacks). - * The Ubuntu20 and Ubuntu22 Docker images now include the tokenizers and GenAI CPP modules, - including pre-installed Python modules, in development versions of these images. - * Python 3.12 is now supported. - - *CPU Device Plugin* - - * The following is now supported: - - * Tensor parallel feature for multi-socket CPU inference, with performance improvement for - LLMs with 6B+ parameters (enabled through model_distribution_policy hint configurations). - * RMSNorm operator, optimized with JIT kernel to improve both the 1st and 2nd token - performance of LLMs. - - * The following has been improved: - - * vLLM support, with PagedAttention exposing attention score as the second output. It can now - be used in the cache eviction algorithm to improve LLM serving performance. - * 1st token performance with Llama series of models, with additional CPU operator optimization - (such as MLP, SDPA) on BF16 precision. - * Default oneTBB version on Linux is now 2021.13.0, improving overall performance on latest - Intel® Xeon® platforms. - * MXFP4 weight compression models (compressing weights to 4-bit with the e2m1 data type - without a zero point and with 8-bit e8m0 scales) have been optimized for Intel® Xeon® - platforms thanks to fullyconnected compressed weight LLM support. - - * The following has been fixed: - - * Memory leak when ov::num_streams value is 0. - * CPU affinity mask is changed after OpenVINO execution when OpenVINO is compiled - with -DTHREADING=SEQ. - - - *GPU Device Plugin* - - * Dynamic quantization for LLMs is now supported on discrete GPU platforms. - * Stable Diffusion 3 is now supported with good accuracy on Intel GPU platforms. - * Both first and second token latency for LLMs have been improved on Intel GPU platforms. - * The issue of model cache not regenerating with the value changes of - ``ov::hint::performance_mode`` or ``ov::hint::dynamic_quantization_group_size`` has been - fixed. - - - *NPU Device Plugin* - - * `Remote Tensor API `__ - is now supported. - * You can now query the available number of tiles (ov::intel_npu::max_tiles) and force a - specific number of tiles to be used by the model, per inference request - (ov::intel_npu::tiles). **Note:** ov::intel_npu::tiles overrides the default number of tiles - selected by the compiler based on performance hints (ov::hint::performance_mode). Any tile - number other than 1 may be a problem for cross platform compatibility, if not tested - explicitly versus the max_tiles value. - * You can now bypass the model caching mechanism in the driver - (ov::intel_npu::bypass_umd_caching). Read more about driver and OpenVINO caching. - * Memory footprint at model execution has been reduced by one blob (compiled model) size. - For execution, the plugin no longer retrieves the compiled model from the driver, it uses the - level zero graph handle directly, instead. The compiled model is now retrieved from the driver - only during the export method. - - - *OpenVINO Python API* - - * Openvino.Tensor, when created in the shared memory mode, now prevents “garbage collection” of - numpy memory. - * The ``openvino.experimental`` submodule is now available, providing access to experimental - functionalities under development. - * New python-exclusive openvino.Model constructors have been added. - * Image padding in PreProcessor is now available. - * OpenVINO Runtime is now compatible with numpy 2.0. - - - *OpenVINO Node.js API* - - * The following has been improved - - * Unit tests for increased efficiency and stability - * Security updates applied to dependencies - - * `Electron `__ - compatibility is now confirmed with new end-to-end tests. - * `New API methods `__ added. - - - *TensorFlow Framework Support* - - * TensorFlow 2.17.0 is now supported. - * JAX 0.4.31 is now supported via a path of jax2tf with native_serialization=False - * `8 NEW* operations `__ - have been added. - * Tensor lists with multiple undefined dimensions in element_shape are now supported, enabling - support for TF Hub lite0-detection/versions/1 model. - - - *PyTorch Framework Support* - - * Torch 2.4 is now supported. - * Inplace ops are now supported automatically if the regular version is supported. - * Symmetric GPTQ model from Hugging Face will now be automatically converted to the signed type - (INT4) and zero-points will be removed. - - - *ONNX Framework Support* - - * ONNX 1.16.0 is now supported - * models with constants/inputs of uINT4/INT4 types are now supported. - * 4 NEW operations have been added. - - - **OpenVINO Model Server** - - * OpenAI API for text generation is now officially supported and recommended for production - usage. It comes with the following new features: - - * Prefix caching feature, caching the prompt evaluation to speed up text generation. - * Ability to compress the KV Cache to a lower precision, reducing memory consumption without - a significant loss of accuracy. - * ``stop`` sampling parameters, to define a sequence that stops text generation. - * ``logprobs`` sampling parameter, returning the probabilities to returned tokens. - * Generic metrics related to execution of the MediaPipe graph that can be used for autoscaling - based on the current load and the level of concurrency. - * `Demo of text generation horizontal scalability `__ - using basic docker containers and Kubernetes. - * Automatic cancelling of text generation for disconnected clients. - * Non-UTF-8 responses from the model can be now automatically changed to Unicode replacement - characters, due to their configurable handling. - * Intel GPU with paged attention is now supported. - * Support for Llama3.1 models. - - * The following has been improved: - - * Handling of model templates without bos_token is now fixed. - * Performance of the multinomial sampling algorithm. - * ``finish_reason`` in the response correctly determines reaching max_tokens (length) and - completing the sequence (stop). - * Security and stability. - - - - **Neural Network Compression Framework** - - * The LoRA Correction algorithm is now included in the Weight Compression method, improving the - accuracy of INT4-compressed models on top of other data-aware algorithms, such as AWQ and - Scale Estimation. To enable it, set the lora_correction option to True in - nncf.compress_weights(). - * The GPTQ compression algorithm can now be combined with the Scale Estimation algorithm, - making it possible to run GPTQ, AWQ, and Scale Estimation together, for the optimum-accuracy - INT4-compressed models. - * INT8 quantization of LSTMSequence and Convolution operations for constant inputs is now - enabled, resulting in better performance and reduced model size. - - - **OpenVINO Tokenizers** - - * Split and BPE tokenization operations have been reimplemented, resulting in improved - tokenization accuracy and performance. - * New building options are now available, offering up to a 12x reduction in binary size. - * An operation is now available to validate and skip/replace model-generated non-Unicode - bytecode sequences during detokenization. - - **OpenVINO.GenAI** - - * New samples and pipelines are now available: - - * An example IterableStreamer implementation in - `multinomial_causal_lm/python sample `__ - - * GenAI compilation is now available as part of OpenVINO via the –DOPENVINO_EXTRA_MODULES CMake - option. - - - - **Other Changes and Known Issues** - - *Jupyter Notebooks* - - * `Florence-2 `__ - * `NuExtract: Structure Extraction `__ - * `Flux.1 Image Generation `__ - * `PixArt-α: Photorealistic Text-to-Image Synthesis `__ - * `Phi-3-Vision Visual Language Assistant `__ - * `MiniCPMV2.6 `__ - * `InternVL2 `__ - * The list of supported models in - `LLM chatbot `__ - now includes Phi3.5, Gemma2 support - - *Known Issues* - - | **Component: CPU** - | ID: CVS-150542, CVS-145996 - | Description: - | The upgrade of default oneTBB on Linux platforms to 2021.13.0 improves overall - performance on latest Intel® Xeon® platform but causes regression in some cases. Limit the - threads usage of postprocessing done by Torch can mitigate the regression (For example: - torch.set_num_threads(n), n can be 1, beam search number, prompt batch size or other - numbers). - - | **Component: OpenVINO.Genai** - | ID: 149694 - | Description: - | Passing openvino.Tensor instance to LLMPipleine triggers incompatible arguments error if - OpenVINO and GenAI are installed from PyPI on Windows. - - | **Component: OpenVINO.Genai** - | ID: 148308 - | Description: - | OpenVINO.GenAI archive doesn't have debug libraries for OpenVINO Tokenizers and - OpenVINO.GenAI. - - | **Component: ONNX for ARM** - | ID: n/a - | Description: - | For ARM binaries, the `1.16 ONNX library `__ - is not yet available. The ONNX library for ARM, version 1.15, does not include the latest - functional and security updates. Users should update to the latest version as it becomes - available. - | Currently, if an unverified AI model is supplied to the ONNX frontend, it could lead to a - directory traversal issue. Ensure that the file name and file path that a model contains - are verified and correct. To learn more about the vulnerability, see: - `CVE-2024-27318 `__ and - `CVE-2024-27319 `__. - - | **Component: Kaldi** - | ID: n/a - | Description: - | There is a known issue with the Kaldi DL framework support on the Python version 3.12 due - to the numpy version incompatibilities. As Kaldi support in OpenVINO is currently deprecated - and will be discontinued with version 2025.0, the issue will not be addressed. - - - - - -.. dropdown:: 2024.3 - 31 July 2024 - :animate: fade-in-slide-down - :color: secondary - - **What's new** - - * More Gen AI coverage and framework integrations to minimize code changes. - - * OpenVINO pre-optimized models are now available in Hugging Face making it easier for developers - to get started with these models. - - * Broader Large Language Model (LLM) support and more model compression techniques. - - * Significant improvement in LLM performance on Intel discrete GPUs with the addition of - Multi-Head Attention (MHA) and OneDNN enhancements. - - * More portability and performance to run AI at the edge, in the cloud, or locally. - - * Improved CPU performance when serving LLMs with the inclusion of vLLM and continuous batching - in the OpenVINO Model Server (OVMS). vLLM is an easy-to-use open-source library that supports - efficient LLM inferencing and model serving. - * Ubuntu 24.04 is now officially supported. - - **OpenVINO™ Runtime** - - *Common* - - * OpenVINO may now be used as a backend for vLLM, offering better CPU performance due to - fully-connected layer optimization, fusing multiple fully-connected layers (MLP), U8 KV cache, - and dynamic split fuse. - * Ubuntu 24.04 is now officially supported, which means OpenVINO is now validated on this - system (preview support). - * The following have been improved: - - * Increasing support for models like YoloV10 or PixArt-XL-2, thanks to enabling Squeeze and - Concat layers. - * Performance of precision conversion FP16/BF16 -> FP32. - - *AUTO Inference Mode* - - * Model cache is now disabled for CPU acceleration even when cache_dir is set, because CPU - acceleration is skipped when the cached model is ready for the target device in the 2nd run. - - *Heterogeneous Inference Mode* - - * PIPELINE_PARALLEL policy is now available, to inference large models on multiple devices per - available memory size, being especially useful for large language models that don't fit into - one discrete GPU (a preview feature). - - *CPU Device Plugin* - - * Fully Connected layers have been optimized together with RoPE optimization with JIT kernel to - improve performance for LLM serving workloads on Intel AMX platforms. - * Dynamic quantization of Fully Connected layers is now enabled by default on Intel AVX2 and - AVX512 platforms, improving out-of-the-box performance for 8bit/4bit weight-compressed LLMs. - * Performance has been improved for: - - * ARM server configuration, due to migration to Intel® oneAPI Threading Building Blocks 2021.13. - * ARM for FP32 and FP16. - - *GPU Device Plugin* - - * Performance has been improved for: - - * LLMs and Stable Diffusion on discrete GPUs, due to latency decrease, through optimizations - such as Multi-Head Attention (MHA) and oneDNN improvements. - * Whisper models on discrete GPU. - - - *NPU Device Plugin* - - * NPU inference of LLMs is now supported with GenAI API (preview feature). To support LLMs on - NPU (requires the most recent version of the NPU driver), additional relevant features are - also part of the NPU plugin now. - * Models bigger than 2GB are now supported on both NPU driver - (Intel® NPU Driver - Windows* 32.0.100.2540) and NPU plugin side (both Linux and Windows). - * Memory optimizations have been implemented: - - * Weights are no longer copied from NPU compiler adapter. - * Improved memory and first-ever inference latency for inference on NPU. - - *OpenVINO Python API* - - * visit_attributes is now available for custom operation implemented in Python, enabling - serialization of operation attributes. - * Python API is now extended with new methods for Model class, e.g. Model.get_sink_index, new - overloads for Model.get_result_index. - - *OpenVINO Node.js API* - - * Tokenizers and StringTensor are now supported for LLM inference. - * Compatibility with electron.js is now restored for desktop application developers. - * Async version of Core.import_model and enhancements for Core.read_model methods are now - available, for more efficient model reading, especially for LLMs. - - *TensorFlow Framework Support* - - * Models with keras.LSTM operations are now more performant in CPU inference. - * The tensor list initialized with an undefined element shape value is now supported. - - *TensorFlow Lite Framework Support* - - * Constants containing spare tensors are now supported. - - *PyTorch Framework Support* - - * Setting types/shapes for nested structures (e.g., dictionaries and tuples) is now supported. - * The aten::layer_norm has been updated to support dynamic shape normalization. - * Dynamic shapes support in the FX graph has been improved, benefiting torch.compile and - torch.export based applications, improving performance for gemma and chatglm model - families. - - *ONNX Framework Support* - - * More models are now supported: - - * Models using the new version of the ReduceMean operation (introduced in ONNX opset 18). - * Models using the Multinomial operation (introduced in ONNX opset 7). - - - **OpenVINO Model Server** - - * The following has been improved in OpenAI API text generation: - - * Performance results, due to OpenVINO Runtime and sampling algorithms. - * Reporting generation engine metrics in the logs. - * Extra sampling parameters added. - * Request parameters affecting memory consumption now have value restrictions, within a - configurable range. - - * The following has been fixed in OpenAI API text generation: - - * Generating streamer responses impacting incomplete utf-8 sequences. - * A sporadic generation hang. - * Incompatibility of the last response from the ``completions`` endpoint stream with the vLLM - benchmarking script. - - **Neural Network Compression Framework** - - * The `MXFP4 `__ - data format is now supported in the Weight Compression method, compressing weights to 4-bit - with the e2m1 data type without a zero point and with 8-bit e8m0 scales. This feature - is enabled by setting ``mode=CompressWeightsMode.E2M1`` in nncf.compress_weights(). - * The AWQ algorithm in the Weight Compression method has been extended for patterns: - Act->MatMul and Act->MUltiply->MatMul to cover the Phi family models. - * The representation of symmetrically quantized weights has been updated to a signed data type - with no zero point. This allows NPU to support compressed LLMs with the symmetric mode. - * BF16 models in Post-Training Quantization are now supported; nncf.quantize(). - * `Activation Sparsity `__ (Contextual Sparsity) algorithm in - the Weight Compression method is now supported (preview), speeding up LLM inference. - The algorithm is enabled by setting the ``target_sparsity_by_scope`` option in - nncf.compress_weights() and supports Torch models only. - - - **OpenVINO Tokenizers** - - * The following is now supported: - - * Full Regex syntax with the PCRE2 library for text normalization and splitting. - * Left padding side for all tokenizer types. - - * GLM-4 tokenizer support, as well as detokenization support for Phi-3 and Gemma have been - improved. - - - - - - **Other Changes and Known Issues** - - *Jupyter Notebooks* - - * `Stable Diffusion V3 `__ - * `Depth Anything V2 `__ - * `RAG System with LLamaIndex `__ - * `Image Synthesis with Pixart `__ - * `Function calling LLM agent with Qwen-Agent `__ - * `Jina-CLIP `__ - * `MiniCPM -V2 Visual Language Assistant `__ - * `OpenVINO XAI: first steps `__ - * `OpenVINO XAI: deep dive `__ - * `LLM Agent with LLamaIndex `__ - * `Stable Audio `__ - * `Phi-3-vision `__ - - *OpenVINO.GenAI* - - * Performance counters have been added. - * Preview support for NPU is now available. - - *Hugging Face* - - OpenVINO pre-optimized models are now available on Hugging Face: - - * Phi-3-mini-128k-instruct ( - `INT4 `__, - `INT8 `__, - `FP16 `__) - * Mistral-7B-Instruct-v0.2 ( - `INT4 `__, - `INT8 `__, - `FP16 `__) - * Mixtral-8x7b-Instruct-v0.1 ( - `INT4 `__, - `INT8 `__) - * LCM_Dreamshaper_v7 ( - `INT8 `__, - `FP16 `__) - * starcoder2-7b ( - `INT4 `__, - `INT8 `__, - `FP16 `__) - * For all the models see `HuggingFace `__ - - - - - *Known Issues* - - | **Component: OpenVINO.GenAI** - | ID: 148308 - | Description: - | The OpenVINO.GenAI archive distribution doesn't include debug libraries for OpenVINO - Tokenizers and OpenVINO.GenAI. - - | **Component: GPU** - | ID: 146283 - | Description: - | For some LLM models, longer prompts, such as several thousand tokens, may result in - decreased accuracy on the GPU plugin. - | Workaround: - | It is recommended to run the model in the FP32 precision to avoid the issue. - - - - - -.. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -.. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ - -.. dropdown:: 2024.2 - 17 June 2024 - :animate: fade-in-slide-down - :color: secondary - - **What's new** - - * More :doc:`Gen AI <../learn-openvino/llm_inference_guide/genai-guide>` coverage and framework - integrations to minimize code changes. - - * Llama 3 optimizations for CPUs, built-in GPUs, and discrete GPUs for improved performance - and efficient memory usage. - * Support for Phi-3-mini, a family of AI models that leverages the power of small language - models for faster, more accurate and cost-effective text processing. - * Python Custom Operation is now enabled in OpenVINO making it easier for Python developers - to code their custom operations instead of using C++ custom operations (also supported). - Python Custom Operation empowers users to implement their own specialized operations into - any model. - * Notebooks expansion to ensure better coverage for new models. Noteworthy notebooks added: - DynamiCrafter, YOLOv10, Chatbot notebook with Phi-3, and QWEN2. - - - * Broader Large Language Model (LLM) support and more model compression techniques. - - * GPTQ method for 4-bit weight compression added to NNCF for more efficient inference and - improved performance of compressed LLMs. - * Significant LLM performance improvements and reduced latency for both built-in GPUs and - discrete GPUs. - * Significant improvement in 2nd token latency and memory footprint of FP16 weight LLMs on - AVX2 (13th Gen Intel® Core™ processors) and AVX512 (3rd Gen Intel® Xeon® Scalable Processors) - based CPU platforms, particularly for small batch sizes. - - * More portability and performance to run AI at the edge, in the cloud, or locally. - - * Model Serving Enhancements: - - * Preview: OpenVINO Model Server (OVMS) now supports OpenAI-compatible API along with Continuous - Batching and PagedAttention, enabling significantly higher throughput for parallel - inferencing, especially on Intel® Xeon® processors, when serving LLMs to many concurrent - users. - * OpenVINO backend for Triton Server now supports dynamic input shapes. - * Integration of TorchServe through torch.compile OpenVINO backend for easy model deployment, - provisioning to multiple instances, model versioning, and maintenance. - - * Preview: addition of the :doc:`Generate API <../learn-openvino/llm_inference_guide/genai-guide>`, - a simplified API for text generation using large language models with only a few lines of - code. The API is available through the newly launched OpenVINO GenAI package. - * Support for Intel® Atom® Processor X Series. For more details, see :doc:`System Requirements <./release-notes-openvino/system-requirements>`. - * Preview: Support for Intel® Xeon® 6 processor. - - **OpenVINO™ Runtime** - - *Common* - - * Operations and data types using UINT2, UINT3, and UINT6 are now supported, to allow for a more - efficient LLM weight compression. - * Common OV headers have been optimized, improving binary compilation time and reducing binary - size. - - *AUTO Inference Mode* - - * AUTO takes model caching into account when choosing the device for fast first-inference latency. - If model cache is already in place, AUTO will directly use the selected device instead of - temporarily leveraging CPU as first-inference device. - * Dynamic models are now loaded to the selected device, instead of loading to CPU without - considering device priority. - * Fixed the exceptions when use AUTO with stateful models having dynamic input or output. - - *CPU Device Plugin* - - * Performance when using latency mode in FP32 precision has been improved on Intel client - platforms, including Intel® Core™ Ultra (formerly codenamed Meteor Lake) and 13th Gen Core - processors (formerly codenamed Raptor Lake). - * 2nd token latency and memory footprint for FP16 LLMs have been improved significantly on AVX2 - and AVX512 based CPU platforms, particularly for small batch sizes. - * PagedAttention has been optimized on AVX2, AVX512 and AMX platforms together with INT8 KV cache - support to improve the performance when serving LLM workloads on Intel CPUs. - * LLMs with shared embeddings have been optimized to improve performance and memory consumption - on several models including Gemma. - * Performance on ARM-based servers is significantly improved with upgrade to TBB 2021.2.5. - * Improved FP32 and FP16 performance on ARM CPU. - - *GPU Device Plugin* - - * Both first token and average token latency of LLMs is improved on all GPU platforms, most - significantly on discrete GPUs. Memory usage of LLMs has been reduced as well. - * Stable Diffusion FP16 performance improved on Intel® Core™ Ultra platforms, with significant - pipeline improvement for models with dynamic-shaped input. Memory usage of the pipeline - has been reduced, as well. - * Optimized permute_f_y kernel performance has been improved. - - *NPU Device Plugin* - - * A new set of configuration options is now available. - * Performance increase has been unlocked, with the new `2408 NPU driver `__. - - *OpenVINO Python API* - - * Writing custom Python operators is now supported for basic scenarios (alignment with OpenVINO - C++ API.) This empowers users to implement their own specialized operations into any model. - Full support with more advanced features is within the scope of upcoming releases. - - *OpenVINO C API* - - * More element types are now supported to algin with the OpenVINO C++ API. - - *OpenVINO Node.js API* - - * OpenVINO node.js packages now support the electron.js framework. - * Extended and improved JS API documentation for more complete usage guidelines. - * Better JS API alignment with OpenVINO C++ API, delivering more advanced features to JS users. - - *TensorFlow Framework Support* - - * 3 new operations are now supported. See operations marked as `NEW here `__. - * LookupTableImport has received better support, required for 2 models from TF Hub: - - * mil-nce - * openimages-v4-ssd-mobilenet-v2 - - *TensorFlow Lite Framework Support* - - * The GELU operation required for customer model is now supported. - - *PyTorch Framework Support* - - * 9 new operations are now supported. - * aten::set_item now supports negative indices. - * Issue with adaptive pool when shape is list has been fixed (PR `#24586 `__). - - *ONNX Support* - - * The InputModel interface should be used from now on, instead of a number of deprecated APIs - and class symbols - * Translation for ReduceMin-18 and ReduceSumSquare-18 operators has been added, to address - customer model requests - * Behavior of the Gelu-20 operator has been fixed for the case when “none” is set as the - default value. - - **OpenVINO Model Server** - - * OpenVINO Model server can be now used for text generation use cases using OpenAI compatible API. - * Added support for continuous batching and PagedAttention algorithms for text generation with - fast and efficient in high concurrency load especially on Intel® Xeon® processors. - `Learn more about it `__. - - **Neural Network Compression Framework** - - * GPTQ method is now supported in nncf.compress_weights() for data-aware 4-bit weight - compression of LLMs. Enabled by `gptq=True`` in nncf.compress_weights(). - * Scale Estimation algorithm for more accurate 4-bit compressed LLMs. Enabled by - `scale_estimation=True`` in nncf.compress_weights(). - * Added support for models with BF16 weights in nncf.compress_weights(). - * nncf.quantize() method is now the recommended path for quantization initialization of - PyTorch models in Quantization-Aware Training. See example for more details. - * compressed_model.nncf.get_config() and nncf.torch.load_from_config() API have been added to - save and restore quantized PyTorch models. See example for more details. - * Automatic support for int8 quantization of PyTorch models with custom modules has been added. - Now it is not needed to register such modules before quantization. - - **Other Changes and Known Issues** - - *Jupyter Notebooks* - - * Latest notebooks along with the GitHub validation status can be found in the - `OpenVINO notebook section `__ - * The following notebooks have been updated or newly added: - - * `Image to Video Generation with Stable Video Diffusion `__ - * `Image generation with Stable Cascade `__ - * `One Step Sketch to Image translation with pix2pix-turbo and OpenVINO `__ - * `Animating Open-domain Images with DynamiCrafter and OpenVINO `__ - * `Text-to-Video retrieval with S3D MIL-NCE and OpenVINO `__ - * `Convert and Optimize YOLOv10 with OpenVINO `__ - * `Visual-language assistant with nanoLLaVA and OpenVINO `__ - * `Person Counting System using YOLOV8 and OpenVINO™ `__ - * `Quantization-Sparsity Aware Training with NNCF, using PyTorch framework `__ - * `Create an LLM-powered Chatbot using OpenVINO `__ - - *Known Issues* - - | **Component: TBB** - | ID: TBB-1400/ TBB-1401 - | Description: - | In 2024.2, oneTBB 2021.2.x is used for Intel Distribution of OpenVINO Ubuntu and Red Hat - archives, instead of system TBB/oneTBB. This improves performance on the new generation of - Intel® Xeon® platforms but may increase latency of some models on the previous generation. - You can build OpenVINO with **-DSYSTEM_TBB=ON** to get better latency performance for - these models. - - | **Component: python API** - | ID: CVS-141744 - | Description: - | During post commit tests we found problem related with custom operations. Fix is ready and - will be delivered with 2024.3 release. - | - Initial problem: test_custom_op hanged on destruction because it was waiting for a - thread which tried to acquire GIL. - | - The second problem is that pybind11 doesn't allow to work with GIL besides of current - scope and it's impossible to release GIL for destructors. Blocking destructors and the - GIL pybind/pybind11#1446 - | - Current solution allows to release GIL for InferRequest and all called by chain destructors. - - | **Component: CPU runtime** - | *ID:* MFDNN-11428 - | *Description:* - | Due to adopting a new OneDNN library, improving performance for most use cases, - particularly for AVX2 BRGEMM kernels with the latency hint, the following regressions may - be noticed: - | a. latency regression on certain models, such as unet-camvid-onnx-0001 and mask_rcnn_resnet50_atrous_coco on MTL Windows latency mode - | b. performance regression on Intel client platforms if the throughput hint is used - | The issue is being investigated and planned to be resolved in the following releases. - - | **Component: Hardware Configuration** - | *ID:* N/A - | *Description:* - | Reduced performance for LLMs may be observed on newer CPUs. To mitigate, modify the default settings in BIOS to change the system into 2 NUMA node system: - | 1. Enter the BIOS configuration menu. - | 2. Select EDKII Menu -> Socket Configuration -> Uncore Configuration -> Uncore General Configuration -> SNC. - | 3. The SNC setting is set to *AUTO* by default. Change the SNC setting to *disabled* to configure one NUMA node per processor socket upon boot. - | 4. After system reboot, confirm the NUMA node setting using: `numatcl -H`. Expect to see only nodes 0 and 1 on a 2-socket system with the following mapping: - | Node - 0 - 1 - | 0 - 10 - 21 - | 1 - 21 - 10 - - - - - - - - - -.. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -.. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ - -.. dropdown:: 2024.1 - 24 April 2024 - :animate: fade-in-slide-down - :color: secondary - - **What's new** - - * More Gen AI coverage and framework integrations to minimize code changes. - - * Mixtral and URLNet models optimized for performance improvements on Intel® Xeon® processors. - * Stable Diffusion 1.5, ChatGLM3-6B, and Qwen-7B models optimized for improved inference speed - on Intel® Core™ Ultra processors with integrated GPU. - * Support for Falcon-7B-Instruct, a GenAI Large Language Model (LLM) ready-to-use chat/instruct - model with superior performance metrics. - * New Jupyter Notebooks added: YOLO V9, YOLO V8 Oriented Bounding Boxes Detection (OOB), Stable - Diffusion in Keras, MobileCLIP, RMBG-v1.4 Background Removal, Magika, TripoSR, AnimateAnyone, - LLaVA-Next, and RAG system with OpenVINO and LangChain. - - * Broader LLM model support and more model compression techniques. - - * LLM compilation time reduced through additional optimizations with compressed embedding. - Improved 1st token performance of LLMs on 4th and 5th generations of Intel® Xeon® processors - with Intel® Advanced Matrix Extensions (Intel® AMX). - * Better LLM compression and improved performance with oneDNN, INT4, and INT8 support for - Intel® Arc™ GPUs. - * Significant memory reduction for select smaller GenAI models on Intel® Core™ Ultra processors - with integrated GPU. - - * More portability and performance to run AI at the edge, in the cloud, or locally. - - * The preview NPU plugin for Intel® Core™ Ultra processors is now available in the OpenVINO - open-source GitHub repository, in addition to the main OpenVINO package on PyPI. - * The JavaScript API is now more easily accessible through the npm repository, enabling - JavaScript developers' seamless access to the OpenVINO API. - * FP16 inference on ARM processors now enabled for the Convolutional Neural Network (CNN) by - default. - - **OpenVINO™ Runtime** - - *Common* - - * Unicode file paths for cached models are now supported on Windows. - * Pad pre-processing API to extend input tensor on edges with constants. - * A fix for inference failures of certain image generation models has been implemented - (fused I/O port names after transformation). - * Compiler's warnings-as-errors option is now on, improving the coding criteria and quality. - Build warnings will not be allowed for new OpenVINO code and the existing warnings have been - fixed. - - *AUTO Inference Mode* - - * Returning the ov::enable_profiling value from ov::CompiledModel is now supported. - - *CPU Device Plugin* - - * 1st token performance of LLMs has been improved on the 4th and 5th generations of Intel® Xeon® - processors with Intel® Advanced Matrix Extensions (Intel® AMX). - * LLM compilation time and memory footprint have been improved through additional optimizations - with compressed embeddings. - * Performance of MoE (e.g. Mixtral), Gemma, and GPT-J has been improved further. - * Performance has been improved significantly for a wide set of models on ARM devices. - * FP16 inference precision is now the default for all types of models on ARM devices. - * CPU architecture-agnostic build has been implemented, to enable unified binary distribution - on different ARM devices. - - *GPU Device Plugin* - - * LLM first token latency has been improved on both integrated and discrete GPU platforms. - * For the ChatGLM3-6B model, average token latency has been improved on integrated GPU platforms. - * For Stable Diffusion 1.5 FP16 precision, performance has been improved on Intel® Core™ Ultra - processors. - - *NPU Device Plugin* - - * NPU Plugin is now part of the OpenVINO GitHub repository. All the most recent plugin changes - will be immediately available in the repo. Note that NPU is part of Intel® Core™ Ultra - processors. - * New OpenVINO™ notebook “Hello, NPU!” introducing NPU usage with OpenVINO has been added. - * Version 22H2 or later is required for Microsoft Windows® 11 64-bit to run inference on NPU. - - *OpenVINO Python API* - - * GIL-free creation of RemoteTensors is now used - holding GIL means that the process is not suited - for multithreading and removing the GIL lock will increase performance which is critical for - the concept of Remote Tensors. - * Packed data type BF16 on the Python API level has been added, opening a new way of supporting - data types not handled by numpy. - * 'pad' operator support for ov::preprocess::PrePostProcessorItem has been added. - * ov.PartialShape.dynamic(int) definition has been provided. - - *OpenVINO C API* - - * Two new pre-processing APIs for scale and mean have been added. - - *OpenVINO Node.js API* - - * New methods to align JavaScript API with CPP API have been added, such as - CompiledModel.exportModel(), core.import_model(), Core set/get property and Tensor.get_size(), - and Model.is_dynamic(). - * Documentation has been extended to help developers start integrating JavaScript applications - with OpenVINO™. - - *TensorFlow Framework Support* - - * `tf.keras.layers.TextVectorization tokenizer `__ - is now supported. - * Conversion of models with Variable and HashTable (dictionary) resources has been improved. - * 8 NEW operations have been added - (`see the list here, marked as NEW `__). - * 10 operations have received complex tensor support. - * Input tensor names for TF1 models have been adjusted to have a single name per input. - * Hugging Face model support coverage has increased significantly, due to: - - * extraction of input signature of a model in memory has been fixed, - * reading of variable values for a model in memory has been fixed. - - *PyTorch Framework Support* - - * ModuleExtension, a new type of extension for PyTorch models is now supported - (`PR #23536 `__). - * 22 NEW operations have been added. - * Experimental support for models produced by torch.export (FX graph) has been added - (`PR #23815 `__). - - *ONNX Framework Support* - - * 8 new operations have been added. - - **OpenVINO Model Server** - - * OpenVINO™ Runtime backend used is now 2024.1 - * OpenVINO™ models with String data type on output are supported. Now, OpenVINO™ Model Server - can support models with input and output of the String type, so developers can take advantage - of the tokenization built into the model as the first layer. Developers can also rely on any - postprocessing embedded into the model which returns text only. Check the - `demo on string input data with the universal-sentence-encoder model `__ - and the - `String output model demo `__. - * MediaPipe Python calculators have been updated to support relative paths for all related - configuration and Python code files. Now, the complete graph configuration folder can be - deployed in an arbitrary path without any code changes. - * KServe REST API support has been extended to properly handle the string format in JSON body, - just like the binary format compatible with NVIDIA Triton™. - * `A demo showcasing a full RAG algorithm `__ - fully delegated to the model server has been added. - - **Neural Network Compression Framework** - - * Model subgraphs can now be defined in the ignored scope for INT8 Post-training Quantization, - nncf.quantize(), which simplifies excluding accuracy-sensitive layers from quantization. - * A batch size of more than 1 is now partially supported for INT8 Post-training Quantization, - speeding up the process. Note that it is not recommended for transformer-based models as it - may impact accuracy. Here is an - `example demo `__. - * Now it is possible to apply fine-tuning on INT8 models after Post-training Quantization to - improve model accuracy and make it easier to move from post-training to training-aware - quantization. Here is an - `example demo `__. - - **OpenVINO Tokenizers** - - * TensorFlow support has been extended - TextVectorization layer translation: - - * Aligned existing ops with TF ops and added a translator for them. - * Added new ragged tensor ops and string ops. - - * A new tokenizer type, RWKV is now supported: - - * Added Trie tokenizer and Fuse op for ragged tensors. - * A new way to get OV Tokenizers: build a vocab from file. - - * Tokenizer caching has been redesigned to work with the OpenVINO™ model caching mechanism. - - **Other Changes and Known Issues** - - *Jupyter Notebooks* - - The default branch for the OpenVINO™ Notebooks repository has been changed from 'main' to - 'latest'. The 'main' branch of the notebooks repository is now deprecated and will be maintained - until September 30, 2024. - - The new branch, 'latest', offers a better user experience and simplifies maintenance due to - significant refactoring and an improved directory naming structure. - - Use the local - `README.md `__ - file and OpenVINO™ Notebooks at - `GitHub Pages `__ - to navigate through the content. - - The following notebooks have been updated or newly added: - - * `Grounded Segment Anything `__ - * `Visual Content Search with MobileCLIP `__ - * `YOLO V8 Oriented Bounding Box Detection Optimization `__ - * `Magika: AI-powered fast and efficient file type identification `__ - * `Keras Stable Diffusion `__ - * `RMBG background removal `__ - * `AnimateAnyone: pose guided image to video generation `__ - * `LLaVA-Next visual-language assistant `__ - * `TripoSR: single image 3d reconstruction `__ - * `RAG system with OpenVINO and LangChain `__ - - *Known Issues* - - | **Component: CPU Plugin** - | *ID:* N/A - | *Description:* - | Default CPU pinning policy on Windows has been changed to follow Windows' policy - instead of controlling the CPU pinning in the OpenVINO plugin. This brings certain dynamic or - performance variance on Windows. Developers can use ov::hint::enable_cpu_pinning to enable - or disable CPU pinning explicitly. - - | **Component: Hardware Configuration** - | *ID:* N/A - | *Description:* - | Reduced performance for LLMs may be observed on newer CPUs. To mitigate, modify the default settings in BIOS to - | change the system into 2 NUMA node system: - | 1. Enter the BIOS configuration menu. - | 2. Select EDKII Menu -> Socket Configuration -> Uncore Configuration -> Uncore General Configuration -> SNC. - | 3. The SNC setting is set to *AUTO* by default. Change the SNC setting to *disabled* to configure one NUMA node per processor socket upon boot. - | 4. After system reboot, confirm the NUMA node setting using: `numatcl -H`. Expect to see only nodes 0 and 1 on a - | 2-socket system with the following mapping: - | Node - 0 - 1 - | 0 - 10 - 21 - | 1 - 21 - 10 - - - - - - - - - - -.. dropdown:: 2024.0 - 06 March 2024 - :animate: fade-in-slide-down - :color: secondary - - **What's new** - - * More Generative AI coverage and framework integrations to minimize code changes. - - * Improved out-of-the-box experience for TensorFlow sentence encoding models through the - installation of OpenVINO™ toolkit Tokenizers. - * New and noteworthy models validated: - Mistral, StableLM-tuned-alpha-3b, and StableLM-Epoch-3B. - * OpenVINO™ toolkit now supports Mixture of Experts (MoE), a new architecture that helps - process more efficient generative models through the pipeline. - * JavaScript developers now have seamless access to OpenVINO API. This new binding enables a - smooth integration with JavaScript API. - - * Broader Large Language Model (LLM) support and more model compression techniques. - - * Broader Large Language Model (LLM) support and more model compression techniques. - * Improved quality on INT4 weight compression for LLMs by adding the popular technique, - Activation-aware Weight Quantization, to the Neural Network Compression Framework (NNCF). - This addition reduces memory requirements and helps speed up token generation. - * Experience enhanced LLM performance on Intel® CPUs, with internal memory state enhancement, - and INT8 precision for KV-cache. Specifically tailored for multi-query LLMs like ChatGLM. - * The OpenVINO™ 2024.0 release makes it easier for developers, by integrating more OpenVINO™ - features with the Hugging Face ecosystem. Store quantization configurations for popular - models directly in Hugging Face to compress models into INT4 format while preserving - accuracy and performance. - - * More portability and performance to run AI at the edge, in the cloud, or locally. - - * A preview plugin architecture of the integrated Neural Processor Unit (NPU) as part of - Intel® Core™ Ultra processor (formerly codenamed Meteor Lake) is now included in the - main OpenVINO™ package on PyPI. - * Improved performance on ARM by enabling the ARM threading library. In addition, we now - support multi-core ARM processors and enabled FP16 precision by default on MacOS. - * New and improved LLM serving samples from OpenVINO Model Server for multi-batch inputs and - Retrieval Augmented Generation (RAG). - - **OpenVINO™ Runtime** - - *Common* - - * The legacy API for CPP and Python bindings has been removed. - * StringTensor support has been extended by operators such as ``Gather``, ``Reshape``, and - ``Concat``, as a foundation to improve support for tokenizer operators and compliance with - the TensorFlow Hub. - * oneDNN has been updated to v3.3. - (`see oneDNN release notes `__). - - *CPU Device Plugin* - - * LLM performance on Intel® CPU platforms has been improved for systems based on AVX2 and - AVX512, using dynamic quantization and internal memory state optimization, such as INT8 - precision for KV-cache. 13th and 14th generations of Intel® Core™ processors and Intel® Core™ - Ultra processors use AVX2 for CPU execution, and these platforms will benefit from speedup. - Enable these features by setting ``"DYNAMIC_QUANTIZATION_GROUP_SIZE":"32"`` and - ``"KV_CACHE_PRECISION":"u8"`` in the configuration file. - * The ``ov::affinity`` API configuration is now deprecated and will be removed in release - 2025.0. - * The following have been improved and optimized: - - * Multi-query structure LLMs (such as ChatGLM 2/3) for BF16 on the 4th and 5th generation - Intel® Xeon® Scalable processors. - * `Mixtral `__ model performance. - * 8-bit compressed LLM compilation time and memory usage, valuable for models with large - embeddings like `Qwen `__. - * Convolutional networks in FP16 precision on ARM processors. - - *GPU Device Plugin* - - * The following have been improved and optimized: - - * Average token latency for LLMs on integrated GPU (iGPU) platforms, using INT4-compressed - models with large context size on Intel® Core™ Ultra processors. - * LLM beam search performance on iGPU. Both average and first-token latency decrease may be - expected for larger context sizes. - * Multi-batch performance of YOLOv5 on iGPU platforms. - - * Memory usage for LLMs has been optimized, enabling '7B' models with larger context on - 16Gb platforms. - - *NPU Device Plugin (preview feature)* - - * The NPU plugin for OpenVINO™ is now available through PyPI (run “pip install openvino”). - - *OpenVINO Python API* - - * ``.add_extension`` method signatures have been aligned, improving API behavior for better - user experience. - - *OpenVINO C API* - - * ov_property_key_cache_mode (C++ ov::cache_mode) now enables the ``optimize_size`` and - ``optimize_speed`` modes to set/get model cache. - * The VA surface on Windows exception has been fixed. - - *OpenVINO Node.js API* - - * OpenVINO - `JS bindings `__ - are consistent with the OpenVINO C++ API. - * A new distribution channel is now available: Node Package Manager (npm) software registry - (:doc:`check the installation guide <../get-started/install-openvino/install-openvino-npm>`). - * JavaScript API is now available for Windows users, as some limitations for platforms other - than Linux have been removed. - - *TensorFlow Framework Support* - - * String tensors are now natively supported, handled on input, output, and intermediate layers - (`PR #22024 `__). - - * TensorFlow Hub universal-sentence-encoder-multilingual inferred out of the box - * string tensors supported for ``Gather``, ``Concat``, and ``Reshape`` operations - * integration with openvino-tokenizers module - importing openvino-tokenizers automatically - patches TensorFlow FE with the required translators for models with tokenization - - * Fallback for Model Optimizer by operation to the legacy Frontend is no longer available. - Fallback by .json config will remain until Model Optimizer is discontinued - (`PR #21523 `__). - * Support for the following has been added: - - * Mutable variables and resources such as HashTable*, Variable, VariableV2 - (`PR #22270 `__). - * New tensor types: tf.u16, tf.u32, and tf.u64 - (`PR #21864 `__). - * 14 NEW Ops*. - `Check the list here (marked as NEW) `__. - * TensorFlow 2.15 - (`PR #22180 `__). - - * The following issues have been fixed: - - * UpSampling2D conversion crashed when input type as int16 - (`PR #20838 `__). - * IndexError list index for Squeeze - (`PR #22326 `__). - * Correct FloorDiv computation for signed integers - (`PR #22684 `__). - * Fixed bad cast error for tf.TensorShape to ov.PartialShape - (`PR #22813 `__). - * Fixed reading tf.string attributes for models in memory - (`PR #22752 `__). - - *ONNX Framework Support* - - * ONNX Frontend now uses the OpenVINO API 2.0. - - *PyTorch Framework Support* - - * Names for outputs unpacked from dict or tuple are now clearer - (`PR #22821 `__). - * FX Graph (torch.compile) now supports kwarg inputs, improving data type coverage. - (`PR #22397 `__). - - **OpenVINO Model Server** - - * OpenVINO™ Runtime backend used is now 2024.0. - * Text generation demo now supports multi batch size, with streaming and unary clients. - * The REST client now supports servables based on mediapipe graphs, including python pipeline - nodes. - * Included dependencies have received security-related updates. - * Reshaping a model in runtime based on the incoming requests (auto shape and auto batch size) - is deprecated and will be removed in the future. Using OpenVINO's dynamic shape models is - recommended instead. - - **Neural Network Compression Framework (NNCF)** - - * The `Activation-aware Weight Quantization (AWQ) `__ - algorithm for data-aware 4-bit weights compression is now available. It facilitates better - accuracy for compressed LLMs with high ratio of 4-bit weights. To enable it, use the - dedicated ``awq`` optional parameter of ``the nncf.compress_weights()`` API. - * ONNX models are now supported in Post-training Quantization with Accuracy Control, through - the ``nncf.quantize_with_accuracy_control()``, method. It may be used for models in the - OpenVINO IR and ONNX formats. - * A `weight compression example tutorial `__ - is now available, demonstrating how to find the appropriate hyperparameters for the TinyLLama - model from the Hugging Face Transformers, as well as other LLMs, with some modifications. - - **OpenVINO Tokenizer** - - * Regex support has been improved. - * Model coverage has been improved. - * Tokenizer metadata has been added to rt_info. - * Limited support for Tensorflow Text models has been added: convert MUSE for TF Hub with - string inputs. - * OpenVINO Tokenizers have their own repository now: - `/openvino_tokenizers `__ - - **Other Changes and Known Issues** - - *Jupyter Notebooks* - - The following notebooks have been updated or newly added: - - * `Mobile language assistant with MobileVLM `__ - * `Depth estimation with DepthAnything `__ - * `Kosmos-2 `__ - * `Zero-shot Image Classification with SigLIP `__ - * `Personalized image generation with PhotoMaker `__ - * `Voice tone cloning with OpenVoice `__ - * `Line-level text detection with Surya `__ - * `InstantID: Zero-shot Identity-Preserving Generation using OpenVINO `__ - * `Tutorial for Big Image Transfer (BIT) model quantization using NNCF `__ - * `Tutorial for OpenVINO Tokenizers integration into inference pipelines `__ - * `LLM chatbot `__ and - `LLM RAG pipeline `__ - have received integration with new models: minicpm-2b-dpo, gemma-7b-it, qwen1.5-7b-chat, baichuan2-7b-chat - - *Known issues* - - | **Component: CPU Plugin** - | *ID:* N/A - | *Description:* - | Starting with 24.0, model inputs and outputs will no longer have tensor names, unless - explicitly set to align with the PyTorch framework behavior. - - | **Component: GPU runtime** - | *ID:* 132376 - | *Description:* - | First-inference latency slow down for LLMs on Intel® Core™ Ultra processors. Up to 10-20% - drop may occur due to radical memory optimization for processing long sequences - (about 1.5-2 GB reduced memory usage). - - | **Component: CPU runtime** - | *ID:* N/A - | *Description:* - | Performance results (first token latency) may vary from those offered by the previous - OpenVINO version, for “latency” hint inference of LLMs with long prompts on Intel® Xeon® - platforms with 2 or more sockets. The reason is that all CPU cores of just the single - socket running the application are employed, lowering the memory overhead for LLMs when - numa control is not used. - | *Workaround:* - | The behavior is expected but stream and thread configuration may be used to include cores - from all sockets. +Deprecation And Support ++++++++++++++++++++++++++++++ +Using deprecated features and components is not advised. They are available to enable a smooth +transition to new solutions and will be discontinued in the future. To keep using discontinued +features, you will have to revert to the last LTS OpenVINO version supporting them. +For more details, refer to the `OpenVINO Legacy Features and Components __` +page. @@ -1664,13 +130,6 @@ Previous 2024 releases -Deprecation And Support -+++++++++++++++++++++++++++++ -Using deprecated features and components is not advised. They are available to enable a smooth -transition to new solutions and will be discontinued in the future. To keep using discontinued -features, you will have to revert to the last LTS OpenVINO version supporting them. -For more details, refer to the `OpenVINO Legacy Features and Components __` -page. Discontinued in 2024 ----------------------------- @@ -1730,97 +189,19 @@ Deprecated and to be removed in the future `model conversion transition guide `__. * OpenVINO property Affinity API will be discontinued with OpenVINO 2025.0. It will be replaced with CPU binding configurations (``ov::hint::enable_cpu_pinning``). -* OpenVINO Model Server components: - - * “auto shape” and “auto batch size” (reshaping a model in runtime) will be removed in the - future. OpenVINO's dynamic shape models are recommended instead. - -* Starting with 2025.0 MacOS x86 will no longer be recommended for use due to the discontinuation - of validation. Full support will be removed later in 2025. - -* A number of notebooks have been deprecated. For an up-to-date listing of available notebooks, - refer to the `OpenVINO™ Notebook index (openvinotoolkit.github.io) `__. - .. dropdown:: See the deprecated notebook list - :animate: fade-in-slide-down - :color: muted - * `Handwritten OCR with OpenVINO™ `__ - * See alternative: `Optical Character Recognition (OCR) with OpenVINO™ `__, - * See alternative: `PaddleOCR with OpenVINO™ `__, - * See alternative: `Handwritten Text Recognition Demo `__ - * `Image In-painting with OpenVINO™ `__ - - * See alternative: `Image Inpainting Python Demo `__ - - * `Interactive Machine Translation with OpenVINO `__ - - * See alternative: `Machine Translation Python* Demo `__ - - * `Super Resolution with OpenVINO™ `__ - - * See alternative: `Super Resolution with PaddleGAN and OpenVINO `__ - * See alternative: `Image Processing C++ Demo `__ - - * `Image Colorization with OpenVINO Tutorial `__ - * `Interactive Question Answering with OpenVINO™ `__ - - * See alternative: `BERT Question Answering Embedding Python* Demo `__ - * See alternative: `BERT Question Answering Python* Demo `__ - - * `Vehicle Detection And Recognition with OpenVINO™ `__ - - * See alternative: `Security Barrier Camera C++ Demo `__ - - * `The attention center model with OpenVINO™ `_ - * `Image Generation with DeciDiffusion `_ - * `Image generation with DeepFloyd IF and OpenVINO™ `_ - * `Depth estimation using VI-depth with OpenVINO™ `_ - * `Instruction following using Databricks Dolly 2.0 and OpenVINO™ `_ - - * See alternative: `LLM Instruction-following pipeline with OpenVINO `__ - - * `Image generation with FastComposer and OpenVINO™ `__ - * `Video Subtitle Generation with OpenAI Whisper `__ - - * See alternative: `Automatic speech recognition using Distil-Whisper and OpenVINO `__ - - * `Introduction to Performance Tricks in OpenVINO™ `__ - * `Speaker Diarization with OpenVINO™ `__ - * `Subject-driven image generation and editing using BLIP Diffusion and OpenVINO `__ - * `Text Prediction with OpenVINO™ `__ - * `Training to Deployment with TensorFlow and OpenVINO™ `__ - * `Speech to Text with OpenVINO™ `__ - * `Convert and Optimize YOLOv7 with OpenVINO™ `__ - * `Quantize Data2Vec Speech Recognition Model using NNCF PTQ API `__ - - * See alternative: `Quantize Speech Recognition Models with accuracy control using NNCF PTQ API `__ - - * `Semantic segmentation with LRASPP MobileNet v3 and OpenVINO `__ - * `Video Recognition using SlowFast and OpenVINO™ `__ - - * See alternative: `Live Action Recognition with OpenVINO™ `__ - - * `Semantic Segmentation with OpenVINO™ using Segmenter `__ - * `Programming Language Classification with OpenVINO `__ - * `Stable Diffusion Text-to-Image Demo `__ - - * See alternative: `Stable Diffusion v2.1 using Optimum-Intel OpenVINO and multiple Intel Hardware `__ - - * `Text-to-Image Generation with Stable Diffusion v2 and OpenVINO™ `__ +* OpenVINO Model Server components: - * See alternative: `Stable Diffusion v2.1 using Optimum-Intel OpenVINO and multiple Intel Hardware `__ + * “auto shape” and “auto batch size” (reshaping a model in runtime) will be removed in the + future. OpenVINO's dynamic shape models are recommended instead. - * `Image generation with Segmind Stable Diffusion 1B (SSD-1B) model and OpenVINO `__ - * `Data Preparation for 2D Medical Imaging `__ - * `Train a Kidney Segmentation Model with MONAI and PyTorch Lightning `__ - * `Live Inference and Benchmark CT-scan Data with OpenVINO™ `__ +* Starting with 2025.0 MacOS x86 is no longer recommended for use due to the discontinuation + of validation. Full support will be removed later in 2025. - * See alternative: `Quantize a Segmentation Model and Show Live Inference `__ - * `Live Style Transfer with OpenVINO™ `__ @@ -1855,7 +236,7 @@ of Intel Corporation in the U.S. and/or other countries. Other names and brands may be claimed as the property of others. -Copyright © 2024, Intel Corporation. All rights reserved. +Copyright © 2025, Intel Corporation. All rights reserved. For more complete information about compiler optimizations, see our Optimization Notice. diff --git a/docs/articles_en/documentation.rst b/docs/articles_en/documentation.rst index c1dd34f5373429..8222a870c91a3b 100644 --- a/docs/articles_en/documentation.rst +++ b/docs/articles_en/documentation.rst @@ -16,6 +16,7 @@ Documentation Tool Ecosystem OpenVINO Extensibility OpenVINO™ Security + Legacy Features This section provides reference documents that guide you through the OpenVINO toolkit workflow, from preparing models, optimizing them, to deploying them in your own deep learning applications. diff --git a/docs/articles_en/documentation/legacy-features.rst b/docs/articles_en/documentation/legacy-features.rst new file mode 100644 index 00000000000000..0b09b23c081134 --- /dev/null +++ b/docs/articles_en/documentation/legacy-features.rst @@ -0,0 +1,112 @@ +Legacy Features and Components +============================== + +.. meta:: + :description: A list of deprecated OpenVINO™ components. + +Since OpenVINO has grown very rapidly in recent years, a number of its features +and components have been replaced by other solutions. Some of them are still +supported to assure OpenVINO users are given enough time to adjust their projects, +before the features are fully discontinued. + +This section will give you an overview of these major changes and tell you how +you can proceed to get the best experience and results with the current OpenVINO +offering. + + +Discontinued: +############# + +.. dropdown:: OpenVINO Development Tools Package + + | *New solution:* OpenVINO Runtime includes all supported components + | *Old solution:* `See how to install Development Tools `__ + | + | OpenVINO Development Tools used to be the OpenVINO package with tools for + advanced operations on models, such as Model conversion API, Benchmark Tool, + Accuracy Checker, Annotation Converter, Post-Training Optimization Tool, + and Open Model Zoo tools. Most of these tools have been either removed, + replaced by other solutions, or moved to the OpenVINO Runtime package. + +.. dropdown:: Model Optimizer / Conversion API + + | *New solution:* :doc:`Direct model support and OpenVINO Converter (OVC) <../openvino-workflow/model-preparation>` + | *Old solution:* `Legacy Conversion API `__ + | + | The role of Model Optimizer and later the Conversion API was largely reduced + when all major model frameworks became supported directly. For converting model + files explicitly, it has been replaced with a more light-weight and efficient + solution, the OpenVINO Converter (launched with OpenVINO 2023.1). + +.. dropdown:: Open Model ZOO + + | *New solution:* users are encouraged to use public model repositories such as `Hugging Face `__ + | *Old solution:* `Open Model ZOO `__ + | + | Open Model ZOO provided a collection of models prepared for use with OpenVINO, + and a small set of tools enabling a level of automation for the process. + Since the tools have been mostly replaced by other solutions and several + other model repositories have recently grown in size and popularity, + Open Model ZOO will no longer be maintained. You may still use its resources + until they are fully removed. `Check the OMZ GitHub project `__ + +.. dropdown:: Multi-Device Execution + + | *New solution:* :doc:`Automatic Device Selection <../openvino-workflow/running-inference/inference-devices-and-modes/auto-device-selection>` + | *Old solution:* `Check the legacy solution `__ + | + | The behavior and results of the Multi-Device Execution mode are covered by the ``CUMULATIVE_THROUGHPUT`` + option of the Automatic Device Selection. The only difference is that ``CUMULATIVE_THROUGHPUT`` uses + the devices specified by AUTO, which means that adding devices manually is not mandatory, + while with MULTI, the devices had to be specified before the inference. + +.. dropdown:: Caffe, and Kaldi model formats + + | *New solution:* conversion to ONNX via external tools + | *Old solution:* model support discontinued with OpenVINO 2024.0 + | `The last version supporting Apache MXNet, Caffe, and Kaldi model formats `__ + | :doc:`See the currently supported frameworks <../openvino-workflow/model-preparation>` + +.. dropdown:: Post-training Optimization Tool (POT) + + | *New solution:* Neural Network Compression Framework (NNCF) now offers the same functionality + | *Old solution:* POT discontinued with OpenVINO 2024.0 + | :doc:`See how to use NNCF for model optimization <../openvino-workflow/model-optimization>` + | `Check the NNCF GitHub project, including documentation `__ + +.. dropdown:: Inference API 1.0 + + | *New solution:* API 2.0 launched in OpenVINO 2022.1 + | *Old solution:* discontinued with OpenVINO 2024.0 + | `2023.2 is the last version supporting API 1.0 `__ + +.. dropdown:: Compile tool + + | *New solution:* the tool is no longer needed + | *Old solution:* discontinued with OpenVINO 2023.0 + | If you need to compile a model for inference on a specific device, use the following script: + + .. tab-set:: + + .. tab-item:: Python + :sync: py + + .. doxygensnippet:: docs/articles_en/assets/snippets/export_compiled_model.py + :language: python + :fragment: [export_compiled_model] + + .. tab-item:: C++ + :sync: cpp + + .. doxygensnippet:: docs/articles_en/assets/snippets/export_compiled_model.cpp + :language: cpp + :fragment: [export_compiled_model] + +.. dropdown:: TensorFlow integration (OVTF) + + | *New solution:* Direct model support and OpenVINO Converter (OVC) + | *Old solution:* discontinued in OpenVINO 2023.0 + | + | OpenVINO now features a native TensorFlow support, with no need for explicit model + conversion. + diff --git a/docs/articles_en/learn-openvino/llm_inference_guide.rst b/docs/articles_en/learn-openvino/llm_inference_guide.rst index 372c3b6d652bfc..8401923b8c7ac6 100644 --- a/docs/articles_en/learn-openvino/llm_inference_guide.rst +++ b/docs/articles_en/learn-openvino/llm_inference_guide.rst @@ -55,7 +55,10 @@ options: as well as conversion on the fly. For integration with the final product it may offer lower performance, though. - +Note that the base version of OpenVINO may also be used to run generative AI. Although it may +offer a simpler environment, with fewer dependencies, it has significant limitations and a more +demanding implementation process. For reference, see +`the article on generative AI usage of OpenVINO 2024.6 `__. The advantages of using OpenVINO for generative model deployment: diff --git a/docs/articles_en/learn-openvino/openvino-samples/benchmark-tool.rst b/docs/articles_en/learn-openvino/openvino-samples/benchmark-tool.rst index 5a706061777594..cde0eef055d5cb 100644 --- a/docs/articles_en/learn-openvino/openvino-samples/benchmark-tool.rst +++ b/docs/articles_en/learn-openvino/openvino-samples/benchmark-tool.rst @@ -8,7 +8,10 @@ Benchmark Tool devices. -This page demonstrates how to use the Benchmark Tool to estimate deep learning inference performance on supported devices. +This page demonstrates how to use the Benchmark Tool to estimate deep learning inference +performance on supported devices. Note that the MULTI plugin mentioned here is considered +a legacy tool and currently is just a mapping of the +:doc:`AUTO plugin <../../openvino-workflow/running-inference/inference-devices-and-modes/auto-device-selection>`. .. note:: diff --git a/docs/sphinx_setup/_static/download/GenAI_Quick_Start_Guide.pdf b/docs/sphinx_setup/_static/download/GenAI_Quick_Start_Guide.pdf index 5f24d28643598e..c5632a7e3f9627 100644 Binary files a/docs/sphinx_setup/_static/download/GenAI_Quick_Start_Guide.pdf and b/docs/sphinx_setup/_static/download/GenAI_Quick_Start_Guide.pdf differ