Merge branch 'master' of https://github.com/openvinotoolkit/openvino

mryzhov · Jun 23, 2021 · 390ef35 · 390ef35
2 parents 6b5d513 + afe033a
commit 390ef35
Show file tree

Hide file tree

Showing 1,541 changed files with 8,729 additions and 57,025 deletions.
diff --git a/.gitmodules b/.gitmodules
@@ -1,19 +1,7 @@
-[submodule "inference-engine/thirdparty/ade"]
-	path = inference-engine/thirdparty/ade
-	url = https://github.com/opencv/ade.git
-	ignore = dirty
 [submodule "inference-engine/thirdparty/mkl-dnn"]
 	path = inference-engine/thirdparty/mkl-dnn
 	url = https://github.com/openvinotoolkit/oneDNN.git
 	ignore = dirty
-[submodule "inference-engine/tests/ie_test_utils/common_test_utils/gtest"]
-	path = inference-engine/tests/ie_test_utils/common_test_utils/gtest
-	url = https://github.com/openvinotoolkit/googletest.git
-	ignore = dirty
-[submodule "inference-engine/samples/thirdparty/gflags"]
-	path = inference-engine/samples/thirdparty/gflags
-	url = https://github.com/gflags/gflags.git
-	ignore = dirty
 [submodule "thirdparty/xbyak"]
 	path = thirdparty/xbyak
 	url = https://github.com/herumi/xbyak.git
@@ -22,3 +10,15 @@
 	path = thirdparty/zlib/zlib
 	url = https://github.com/madler/zlib.git
 	ignore = dirty
+[submodule "thirdparty/ade"]
+	path = thirdparty/ade
+	url = https://github.com/opencv/ade.git
+	ignore = dirty
+[submodule "thirdparty/gflags"]
+	path = thirdparty/gflags
+	url = https://github.com/gflags/gflags.git
+	ignore = dirty
+[submodule "thirdparty/gtest"]
+	path = thirdparty/gtest
+	url = https://github.com/openvinotoolkit/googletest.git
+	ignore = dirty
diff --git a/README.md b/README.md
@@ -42,7 +42,7 @@ Please report questions, issues and suggestions using:
 ---
 \* Other names and brands may be claimed as the property of others.
 
-[Open Model Zoo]:https://github.com/opencv/open_model_zoo
+[Open Model Zoo]:https://github.com/openvinotoolkit/open_model_zoo
 [Inference Engine]:https://software.intel.com/en-us/articles/OpenVINO-InferEngine
 [Model Optimizer]:https://software.intel.com/en-us/articles/OpenVINO-ModelOptimizer
 [nGraph]:https://docs.openvinotoolkit.org/latest/openvino_docs_nGraph_DG_DevGuide.html

diff --git a/docs/IE_DG/API_Changes.md b/docs/IE_DG/API_Changes.md
@@ -10,10 +10,14 @@ The sections below contain detailed list of changes made to the Inference Engine
 
 ### Deprecated API
 
+ **InferenceEngine::Parameter**
+
  * InferenceEngine::Parameter(const std::shared_ptr<ngraph::Variant>&)
  * InferenceEngine::Parameter(std::shared_ptr<ngraph::Variant>& var)
  * std::shared_ptr<ngraph::Variant> InferenceEngine::Parameter::asVariant() const
  * InferenceEngine::Parameter::operator std::shared_ptr<ngraph::Variant>() const
+
+ **GPU plugin configuration keys**
  * KEY_CLDNN_NV12_TWO_INPUTS GPU plugin option. Use KEY_GPU_NV12_TWO_INPUTS instead
  * KEY_CLDNN_PLUGIN_PRIORITY GPU plugin option. Use KEY_GPU_PLUGIN_PRIORITY instead
  * KEY_CLDNN_PLUGIN_THROTTLE GPU plugin option. Use KEY_GPU_PLUGIN_THROTTLE instead
@@ -24,6 +28,38 @@ The sections below contain detailed list of changes made to the Inference Engine
  * KEY_TUNING_MODE GPU plugin option
  * KEY_TUNING_FILE GPU plugin option
 
+ **InferenceEngine::IInferRequest**
+ * IInferRequest interface is deprecated, use InferRequest wrapper:
+  * Constructor for InferRequest from IInferRequest:: Ptr is deprecated
+  * Cast operator for InferRequest to IInferRequest shared pointer is deprecated
+
+ **InferenceEngine::ICNNNetwork**
+ * ICNNNetwork interface is deprecated by means of deprecation of all its methods, use CNNNetwork wrapper
+  * CNNNetwork methods working with ICNNNetwork are deprecated:
+  * Cast to ICNNNetwork shared pointer
+  * Cast to reference to ICNNNetwork interface
+  * Constructor from ICNNNetwork shared pointer
+
+ **InferenceEngine::IExecutableNetwork**
+ * IExecutableNetwork is deprecated, use ExecutableNetwork wrappers:
+  * Constructor of ExecutableNetwork from IExecutableNetwork shared pointer is deprecated
+ * The following ExecutableNetwork methods are deprecated:
+  * ExecutableNetwork::reset
+  * Cast operator to IExecutableNetwork shared pointer
+  * ExecutableNetwork::CreateInferRequestPtr - use ExecutableNetwork::CreateInferRequest instead
+
+ **Extensions API**
+ * InferenceEngine::make_so_pointer which is used to create Extensions library is replaced by std::make_shared<Extension>(..)
+ * InferenceEngine::IExtension::Release is deprecated with no replacement
+ * Use IE_DEFINE_EXTENSION_CREATE_FUNCTION helper macro instead of explicit declaration of CreateExtension function, which create extension.
+
+ **Other changes**
+ * Version::ApiVersion structure is deprecated, Inference Engine does not have API version anymore
+ * LowLatency - use lowLatency2 instead
+ * CONFIG_KEY(DUMP_EXEC_GRAPH_AS_DOT) - use InferenceEngine::ExecutableNetwork::GetExecGraphInfo::serialize() instead
+ * Core::ImportNetwork with no device - pass device name explicitly.
+ * details::InferenceEngineException - use InferenceEngine::Exception and its derivatives instead.
+
 ## 2021.3
 
 ### New API

diff --git a/docs/IE_DG/Int8Inference.md b/docs/IE_DG/Int8Inference.md
@@ -1,6 +1,13 @@
 # Low-Precision 8-bit Integer Inference {#openvino_docs_IE_DG_Int8Inference}
 
-## Disclaimer
+## Table of Contents
+1. [Supported devices](#supported-devices)
+2. [Low-Precision 8-bit Integer Inference Workflow](#low-precision-8-bit-integer-inference-workflow)
+3. [Prerequisites](#prerequisites)
+4. [Inference](#inference)
+5. [Results analysis](#results-analysis)
+
+## Supported devices
 
 Low-precision 8-bit inference is optimized for:
 - Intel® architecture processors with the following instruction set architecture extensions:  
@@ -12,16 +19,22 @@ Low-precision 8-bit inference is optimized for:
   - Intel® Iris® Xe Graphics
   - Intel® Iris® Xe MAX Graphics
 - A model must be quantized. You can use a quantized model from [OpenVINO™ Toolkit Intel's Pre-Trained Models](@ref omz_models_group_intel) or quantize a model yourself. For quantization, you can use the:
-  - [Post-Training Optimization Tool](@ref pot_README) delivered with the Intel® Distribution of OpenVINO™ toolkit release package.
+  - [Post-Training Optimization Tool](@ref pot_docs_LowPrecisionOptimizationGuide) delivered with the Intel® Distribution of OpenVINO™ toolkit release package.
   - [Neural Network Compression Framework](https://www.intel.com/content/www/us/en/artificial-intelligence/posts/openvino-nncf.html) available on GitHub: https://github.com/openvinotoolkit/nncf
 
-## Introduction
-
-A lot of investigation was made in the field of deep learning with the idea of using low precision computations during inference in order to boost deep learning pipelines and gather higher performance. For example, one of the popular approaches is to shrink the precision of activations and weights values from `fp32` precision to smaller ones, for example, to `fp11` or `int8`. For more information about this approach, refer to 
-**Brief History of Lower Precision in Deep Learning** section in [this whitepaper](https://software.intel.com/en-us/articles/lower-numerical-precision-deep-learning-inference-and-training).
+## Low-Precision 8-bit Integer Inference Workflow
 
 8-bit computations (referred to as `int8`) offer better performance compared to the results of inference in higher precision (for example, `fp32`), because they allow loading more data into a single processor instruction. Usually the cost for significant boost is a reduced accuracy. However, it is proved that an accuracy drop can be negligible and depends on task requirements, so that the application engineer can set up the maximum accuracy drop that is acceptable.
 
+For 8-bit integer computations, a model must be quantized. Quantized models can be downloaded from [Overview of OpenVINO™ Toolkit Intel's Pre-Trained Models](@ref omz_models_group_intel). If the model is not quantized, you can use the [Post-Training Optimization Tool](@ref pot_README) to quantize the model. The quantization process adds [FakeQuantize](../ops/quantization/FakeQuantize_1.md) layers on activations and weights for most layers. Read more about mathematical computations in the [Uniform Quantization with Fine-Tuning](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md).
+
+When you pass the quantized IR to the OpenVINO™ plugin, the plugin automatically recognizes it as a quantized model and performs 8-bit inference. Note, if you pass a quantized model to another plugin that does not support 8-bit inference but supports all operations from the model, the model is inferred in precision that this plugin supports.
+
+In *Runtime stage* stage, the quantized model is loaded to the plugin. The plugin uses `Low Precision Transformation` component to update the model to infer it in low precision:
+   - Update `FakeQuantize` layers to have quantized output tensors in low precision range and add dequantization layers to compensate the update. Dequantization layers are pushed through as many layers as possible to have more layers in low precision. After that, most layers have quantized input tensors in low precision range and can be inferred in low precision. Ideally, dequantization layers should be fused in the next `FakeQuantize` layer.
+   - Weights are quantized and stored in `Constant` layers. 
+
+## Prerequisites
 
 Let's explore quantized [TensorFlow* implementation of ResNet-50](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/resnet-50-tf) model. Use [Model Downloader](@ref omz_tools_downloader) tool to download the `fp16` model from [OpenVINO™ Toolkit - Open Model Zoo repository](https://github.com/openvinotoolkit/open_model_zoo):
 ```sh
@@ -31,28 +44,16 @@ After that you should quantize model by the [Model Quantizer](@ref omz_tools_dow
 ```sh
 ./quantizer.py --model_dir public/resnet-50-tf --dataset_dir <DATASET_DIR> --precisions=FP16-INT8
 ```
+
+## Inference
+
 The simplest way to infer the model and collect performance counters is [C++ Benchmark Application](../../inference-engine/samples/benchmark_app/README.md). 
 ```sh
 ./benchmark_app -m resnet-50-tf.xml -d CPU -niter 1 -api sync -report_type average_counters  -report_folder pc_report_dir
 ```
 If you infer the model with the OpenVINO™ CPU plugin and collect performance counters, all operations (except last not quantized SoftMax) are executed in INT8 precision.  
 
-## Low-Precision 8-bit Integer Inference Workflow
-
-For 8-bit integer computations, a model must be quantized. Quantized models can be downloaded from [Overview of OpenVINO™ Toolkit Intel's Pre-Trained Models](@ref omz_models_group_intel). If the model is not quantized, you can use the [Post-Training Optimization Tool](@ref pot_README) to quantize the model. The quantization process adds [FakeQuantize](../ops/quantization/FakeQuantize_1.md) layers on activations and weights for most layers. Read more about mathematical computations in the [Uniform Quantization with Fine-Tuning](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md).
-
-8-bit inference pipeline includes two stages (also refer to the figure below):
-1. *Offline stage*, or *model quantization*. During this stage, [FakeQuantize](../ops/quantization/FakeQuantize_1.md) layers are added before most layers to have quantized tensors before layers in a way that low-precision accuracy drop for 8-bit integer inference satisfies the specified threshold. The output of this stage is a quantized model. Quantized model precision is not changed, quantized tensors are in original precision range (`fp32`). `FakeQuantize` layer has `levels` attribute which defines quants count. Quants count defines precision which is used during inference. For `int8` range `levels` attribute value has to be 255 or 256. To quantize the model, you can use the [Post-Training Optimization Tool](@ref pot_README) delivered with the Intel® Distribution of OpenVINO™ toolkit release package.
-
-   When you pass the quantized IR to the OpenVINO™ plugin, the plugin automatically recognizes it as a quantized model and performs 8-bit inference. Note, if you pass a quantized model to another plugin that does not support 8-bit inference but supports all operations from the model, the model is inferred in precision that this plugin supports.
-
-2. *Runtime stage*. This stage is an internal procedure of the OpenVINO™ plugin. During this stage, the quantized model is loaded to the plugin. The plugin uses `Low Precision Transformation` component to update the model to infer it in low precision:
-   - Update `FakeQuantize` layers to have quantized output tensors in low precision range and add dequantization layers to compensate the update. Dequantization layers are pushed through as many layers as possible to have more layers in low precision. After that, most layers have quantized input tensors in low precision range and can be inferred in low precision. Ideally, dequantization layers should be fused in the next `FakeQuantize` layer.
-   - Weights are quantized and stored in `Constant` layers. 
-
-![int8_flow]
-
-## Performance Counters
+## Results analysis
 
 Information about layer precision is stored in the performance counters that are
 available from the Inference Engine API. For example, the part of performance counters table for quantized [TensorFlow* implementation of ResNet-50](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/resnet-50-tf) model inference on [CPU Plugin](supported_plugins/CPU.md) looks as follows:
@@ -79,5 +80,3 @@ available from the Inference Engine API. For example, the part of performance co
 > * Suffix `FP32` for layers computed in 32-bit precision 
 
 All `Convolution` layers are executed in int8 precision. Rest layers are fused into Convolutions using post operations optimization technique, which is described in [Internal CPU Plugin Optimizations](supported_plugins/CPU.md).
-
-[int8_flow]: img/cpu_int8_flow.png
diff --git a/docs/IE_DG/Intro_to_Performance.md b/docs/IE_DG/Intro_to_Performance.md
@@ -31,6 +31,12 @@ input images to achieve optimal throughput. However, high batch size also comes
 latency penalty. So, for more real-time oriented usages, lower batch sizes (as low as a single input) are used.
 Refer to the [Benchmark App](../../inference-engine/samples/benchmark_app/README.md) sample, which allows latency vs. throughput measuring.
 
+## Using Caching API for first inference latency optimization
+Since with the 2021.4 release, Inference Engine provides an ability to enable internal caching of loaded networks.
+This can significantly reduce load network latency for some devices at application startup.
+Internally caching uses plugin's Export/ImportNetwork flow, like it is done for [Compile tool](../../inference-engine/tools/compile_tool/README.md), using the regular ReadNetwork/LoadNetwork API.
+Refer to the [Model Caching Overview](Model_caching_overview.md) for more detailed explanation.
+
 ## Using Async API
 To gain better performance on accelerators, such as VPU, the Inference Engine uses the asynchronous approach (see
 [Integrating Inference Engine in Your Application (current API)](Integrate_with_customer_application_new_API.md)).

diff --git a/docs/IE_DG/Model_caching_overview.md b/docs/IE_DG/Model_caching_overview.md
@@ -0,0 +1,65 @@
+# Model Caching Overview {#openvino_docs_IE_DG_Model_caching_overview}
+
+## Introduction
+
+As described in [Inference Engine Introduction](inference_engine_intro.md), common application flow consists of the following steps:
+
+1. **Create Inference Engine Core object**
+
+2. **Read the Intermediate Representation** - Read an Intermediate Representation file into an object of the `InferenceEngine::CNNNetwork`
+
+3. **Prepare inputs and outputs**
+
+4. **Set configuration** Pass device-specific loading configurations to the device
+
+5. **Compile and Load Network to device** - Use the `InferenceEngine::Core::LoadNetwork()` method with specific device
+
+6. **Set input data**
+
+7. **Execute**
+
+Step #5 can potentially perform several time-consuming device-specific optimizations and network compilations,
+and such delays can lead to bad user experience on application startup. To avoid this, some devices offer
+Import/Export network capability, and it is possible to either use [Compile tool](../../inference-engine/tools/compile_tool/README.md)
+or enable model caching to export compiled network automatically. Reusing cached networks can significantly reduce load network time.
+
+
+## Set "CACHE_DIR" config option to enable model caching
+
+To enable model caching, the application must specify the folder where to store cached blobs. It can be done like this
+
+
+@snippet snippets/InferenceEngine_Caching0.cpp part0
+
+With this code, if device supports Import/Export network capability, cached blob is automatically created inside the `myCacheFolder` folder
+CACHE_DIR config is set to the Core object. If device does not support Import/Export capability, cache is just not created and no error is thrown
+
+Depending on your device, total time for loading network on application startup can be significantly reduced.
+Please also note that very first LoadNetwork (when cache is not yet created) takes slightly longer time to 'export' compiled blob into a cache file
+![caching_enabled]
+
+## Even faster: use LoadNetwork(modelPath)
+
+In some cases, applications do not need to customize inputs and outputs every time. Such applications always
+call `cnnNet = ie.ReadNetwork(...)`, then `ie.LoadNetwork(cnnNet, ..)` and it can be further optimized.
+For such cases, more convenient API to load network in one call is introduced in the 2021.4 release.
+
+@snippet snippets/InferenceEngine_Caching1.cpp part1
+
+With enabled model caching, total load time is even smaller - in case that ReadNetwork is optimized as well
+
+@snippet snippets/InferenceEngine_Caching2.cpp part2
+
+![caching_times]
+
+
+## Advanced examples
+
+Not every device supports network import/export capability, enabling of caching for such devices do not have any effect.
+To check in advance if a particular device supports model caching, your application can use the following code:
+
+@snippet snippets/InferenceEngine_Caching3.cpp part3
+
+
+[caching_enabled]: ../img/caching_enabled.png
+[caching_times]: ../img/caching_times.png
diff --git a/docs/IE_DG/img/cpu_int8_flow.png b/docs/IE_DG/img/cpu_int8_flow.png
diff --git a/docs/IE_PLUGIN_DG/PluginTesting.md b/docs/IE_PLUGIN_DG/PluginTesting.md
@@ -21,7 +21,7 @@ Engine concepts: plugin creation, multiple executable networks support, multiple
 
     @snippet single_layer_tests/convolution.cpp test_convolution:declare_parameters
 
-    - Instantiate the test itself using standard GoogleTest macro `INSTANTIATE_TEST_CASE_P`:
+    - Instantiate the test itself using standard GoogleTest macro `INSTANTIATE_TEST_SUITE_P`:
 
     @snippet single_layer_tests/convolution.cpp test_convolution:instantiate
 

diff --git a/docs/MO_DG/img/DeepSpeech-0.8.2.png b/docs/MO_DG/img/DeepSpeech-0.8.2.png
diff --git a/docs/MO_DG/img/DeepSpeech.png b/docs/MO_DG/img/DeepSpeech.png
diff --git a/docs/MO_DG/prepare_model/convert_model/Convert_Model_From_PyTorch.md b/docs/MO_DG/prepare_model/convert_model/Convert_Model_From_PyTorch.md
@@ -25,6 +25,8 @@ It is not a full list of models that can be converted to ONNX\* and to IR.
 * F3Net topology can be converted using [Convert PyTorch\* F3Net to the IR](pytorch_specific/Convert_F3Net.md) instruction.
 * QuartzNet topologies from [NeMo project](https://github.com/NVIDIA/NeMo) can be converted using [Convert PyTorch\* QuartzNet to the IR](pytorch_specific/Convert_QuartzNet.md) instruction.
 * YOLACT topology can be converted using [Convert PyTorch\* YOLACT to the IR](pytorch_specific/Convert_YOLACT.md) instruction.
+* [RCAN](https://github.com/yulunzhang/RCAN) topologies can be converted using [Convert PyTorch\* RCAN to the IR](pytorch_specific/Convert_RCAN.md) instruction.
+* [BERT_NER](https://github.com/kamalkraj/BERT-NER) can be converted using [Convert PyTorch* BERT-NER to the IR](pytorch_specific/Convert_Bert_ner.md) instruction.
 
 ## Export PyTorch\* Model to ONNX\* Format <a name="export-to-onnx"></a>