OV Performance Hints (CPU and GPU logic for selecting the actual conf…

…igs), while AUTO/MULTI are passing them thru) (#6993) * rebasing the perf-modes-2021.3 to the 2021.4 Caveats: the (explicit) setting #streams is not disabled (as it was before for experiments with DLBenchmark), and the logic slighlty differ (streamsSet) (cherry picked from commit 1ae1edc) * overriding streams (to force the TPUT mode to the DLBenchnark) (cherry picked from commit 7f506cd) * disabling reducing #streams to fully mimic baseline c4df94d of the 2021.3 (before experiments) (cherry picked from commit 85073dd) * clang/identation (cherry picked from commit 050a415) * splitting the Transformation to general and CPU specific. Now hopefully,this fully mimics the baseline c4df94d of the 2021.3 (before experiments), as the streams reduce num (as well as early exit on GRU/LSTM/TensorIterator) is deisabled (cherry picked from commit e98b2c1) * disabling GRU/LSTM/TI + reducing of streams + 5D considered compute-limited only for int8 (cherry picked from commit 32b8d80) * refactored to avoid compute_limited_ratio, reverted the reducing #streams, removed LSTM from limitations (cherry picked from commit f2b9721) * isa-based threshold logic (cherry picked from commit b218457) * mode->hint (cherry picked from commit ec20aa8) * optional PERFORMANCE_HINT_NUM_REQUESTS (cherry picked from commit 5a3883e) * moving the perfHints to the common OV config class + initial tests (CPU only, as the actual AUTO/MULTI should be accommodated on the master) (cherry picked from commit (then fixed)45bafe7d527f466507dea0693aeed51be4ebf776) * AUTO support for PerfHints * MULTI support for PerfHints * Enabling Perf hints for the GPU plugin * brushing settings output a bit * disabling "throughput" perf hint being default (until OV 2.0) * uncommenting the logic which was disabled to force the DLBenchmark to use the throughput mode by default * removing dead and experimental code, and debug printfs * clang/code-style * code-review remarks * Moved the output of the actual params that the hint produced to the right place * aligning MULTI's GetConfig beh to HETERO's as captured in the preso (CVS-59960) ratified with the ArchForum * clang * benchmark_app brushing * Update inference-engine/samples/benchmark_app/README.md * propagating the perf hints thru one more scenario in the merged AUTO-MULTI * fixed mispint * Python benchmark_app update for perf hints * addresssing reviewers comments on the python benchmark_app * simplifying/brushing logic a bit * refactor the heuristic to the separate file (to be shared with iGPU soon) * refactor conversion of modes to the specific GPU config per feedback from Vladimir
openvinotoolkit · Sep 13, 2021 · 3bec324 · 3bec324
1 parent 2793963
commit 3bec324
Show file tree

Hide file tree

Showing 25 changed files with 647 additions and 104 deletions.
diff --git a/inference-engine/samples/benchmark_app/README.md b/inference-engine/samples/benchmark_app/README.md
@@ -1,6 +1,7 @@
 # Benchmark C++ Tool {#openvino_inference_engine_samples_benchmark_app_README}
 
-This topic demonstrates how to use the Benchmark C++ Tool to estimate deep learning inference performance on supported devices. Performance can be measured for two inference modes: synchronous (latency-oriented) and asynchronous (throughput-oriented).
+This topic demonstrates how to use the Benchmark C++ Tool to estimate deep learning inference performance on supported devices.
+Performance can be measured for two inference modes: latency- and throughput-oriented.
 
 > **NOTE:** This topic describes usage of C++ implementation of the Benchmark Tool. For the Python* implementation, refer to [Benchmark Python* Tool](../../../tools/benchmark_tool/README.md).
 
@@ -12,27 +13,29 @@ This topic demonstrates how to use the Benchmark C++ Tool to estimate deep learn
 
 ## How It Works
 
-Upon start-up, the application reads command-line parameters and loads a network and images/binary files to the Inference Engine plugin, which is chosen depending on a specified device. The number of infer requests and execution approach depend on the mode defined with the `-api` command-line parameter.
+Upon start-up, the application reads command-line parameters and loads a network and inputs (images/binary files) to the specified device.
 
-> **NOTE**: By default, Inference Engine samples, tools and demos expect input with BGR channels order. If you trained your model to work with RGB order, you need to manually rearrange the default channels order in the sample or demo application or reconvert your model using the Model Optimizer tool with `--reverse_input_channels` argument specified. For more information about the argument, refer to **When to Reverse Input Channels** section of [Converting a Model Using General Conversion Parameters](../../../docs/MO_DG/prepare_model/convert_model/Converting_Model_General.md).
+  **NOTE**: By default, Inference Engine samples, tools and demos expect input with BGR channels order.
+  If you trained your model to work with RGB order, you need to manually rearrange the default channels order in the sample or demo application
+  or reconvert your model using the Model Optimizer tool with `--reverse_input_channels` argument specified.
+  For more information about the argument, refer to **When to Reverse Input Channels** section of
+  [Converting a Model Using General Conversion Parameters](../../../docs/MO_DG/prepare_model/convert_model/Converting_Model_General.md).
 
-If you run the application in the synchronous mode, it creates one infer request and executes the `Infer` method.
-If you run the application in the asynchronous mode, it creates as many infer requests as specified in the `-nireq` command-line parameter and executes the `StartAsync` method for each of them. If `-nireq` is not set, the application will use the default value for specified device.
+Device-specific execution parameters (number of streams, threads, and so on) can be either explicitly specified through the command line
+or left default. In the last case, the sample logic will select the values for the optimal throughput.
+While experimenting with individual parameters allows to find the performance sweet spot, usually, the parameters are not very performance-portable,
+so the values from one machine or device are not necessarily optimal for another.
+From this perspective, the most portable way is experimenting only with the performance hints. To learn more, refer to the section on the command-line parameters below.
 
 A number of execution steps is defined by one of the following parameters:
 * Number of iterations specified with the `-niter` command-line argument
 * Time duration specified with the `-t` command-line argument
 * Both of them (execution will continue until both conditions are met)
 * Predefined duration if `-niter` and `-t` are not specified. Predefined duration value depends on a device.
 
-During the execution, the application collects latency for each executed infer request.
-
-Reported latency value is calculated as a median value of all collected latencies. Reported throughput value is reported
-in frames per second (FPS) and calculated as a derivative from:
-* Reported latency in the Sync mode
-* The total execution time in the Async mode
-
-Throughput value also depends on batch size.
+During the execution, the application calculates latency (if applicable) and overall throughput:
+* By default, the median latency value is reported
+* Throughput is calculated as overall_inference_time/number_of_processed_requests. Note that the throughput value also depends on batch size.
 
 The application also collects per-layer Performance Measurement (PM) counters for each executed infer request if you
 enable statistics dumping by setting the `-report_type` parameter to one of the possible values:
@@ -56,7 +59,7 @@ Note that the benchmark_app usually produces optimal performance for any device
 ./benchmark_app -m <model> -i <input> -d CPU
 ```
 
-But it is still may be non-optimal for some cases, especially for very small networks. More details can read in [Introduction to Performance Topics](../../../docs/IE_DG/Intro_to_Performance.md).
+But it is still may be sub-optimal for some cases, especially for very small networks. More details can read in [Introduction to Performance Topics](../../../docs/IE_DG/Intro_to_Performance.md).
 
 As explained in the  [Introduction to Performance Topics](../../../docs/IE_DG/Intro_to_Performance.md) section, for all devices, including new [MULTI device](../../../docs/IE_DG/supported_plugins/MULTI.md) it is preferable to use the FP16 IR for the model.
 Also if latency of the CPU inference on the multi-socket machines is of concern, please refer to the same
@@ -83,7 +86,12 @@ Options:
     -l "<absolute_path>"        Required for CPU custom layers. Absolute path to a shared library with the kernels implementations.
           Or
     -c "<absolute_path>"        Required for GPU custom kernels. Absolute path to an .xml file with the kernels description.
-    -api "<sync/async>"         Optional. Enable Sync/Async API. Default value is "async".
+    -hint "<throughput(or just 'tput')/latency">
+                                Optional. Performance hint (optimize for latency or throughput).
+                                The hint allows the OpenVINO device to select the right network-specific settings,
+                                as opposite to just accepting specific values from the sample command line.
+                                So you can specify only the hint without setting explicit 'nstreams' or other device-specific options.
+    -api "<sync/async>"         Optional (deprecated). Enable Sync/Async API. Default value is "async".
     -niter "<integer>"          Optional. Number of iterations. If not specified, the number of iterations is calculated depending on a device.
     -nireq "<integer>"          Optional. Number of infer requests. Default value is determined automatically for a device.
     -b "<integer>"              Optional. Batch size value. If not specified, the batch size value is determined from Intermediate Representation.

diff --git a/inference-engine/samples/benchmark_app/benchmark_app.hpp b/inference-engine/samples/benchmark_app/benchmark_app.hpp
@@ -22,8 +22,15 @@ static const char model_message[] =
     "Required. Path to an .xml/.onnx file with a trained model or to a .blob files with "
     "a trained compiled model.";
 
+/// @brief message for performance hint
+static const char hint_message[] =
+    "Optional. Performance hint (optimize for latency or throughput). "
+    "The hint allows the OpenVINO device to select the right network-specific settings,"
+    "as opposite to just accepting specific values from the sample command line."
+    "So you can specify only the hint without setting  explicit 'nstreams' or other device-specific options";
+
 /// @brief message for execution mode
-static const char api_message[] = "Optional. Enable Sync/Async API. Default value is \"async\".";
+static const char api_message[] = "Optional (deprecated). Enable Sync/Async API. Default value is \"async\".";
 
 /// @brief message for assigning cnn calculation to device
 static const char target_device_message[] =
@@ -193,6 +200,9 @@ DEFINE_string(i, "", input_message);
 /// It is a required parameter
 DEFINE_string(m, "", model_message);
 
+/// @brief Define execution mode
+DEFINE_string(hint, "", hint_message);
+
 /// @brief Define execution mode
 DEFINE_string(api, "async", api_message);
 

diff --git a/inference-engine/samples/benchmark_app/main.cpp b/inference-engine/samples/benchmark_app/main.cpp
@@ -59,7 +59,10 @@ bool ParseAndCheckCommandLine(int argc, char* argv[]) {
     if (FLAGS_api != "async" && FLAGS_api != "sync") {
         throw std::logic_error("Incorrect API. Please set -api option to `sync` or `async` value.");
     }
-
+    if (!FLAGS_hint.empty() && FLAGS_hint != "throughput" && FLAGS_hint != "tput" && FLAGS_hint != "latency") {
+        throw std::logic_error("Incorrect performance hint. Please set -hint option to"
+                               "either `throughput`(tput) or `latency' value.");
+    }
     if (!FLAGS_report_type.empty() && FLAGS_report_type != noCntReport && FLAGS_report_type != averageCntReport &&
         FLAGS_report_type != detailedCntReport) {
         std::string err = "only " + std::string(noCntReport) + "/" + std::string(averageCntReport) + "/" +
@@ -208,6 +211,11 @@ int main(int argc, char* argv[]) {
         // ----------------- 3. Setting device configuration
         // -----------------------------------------------------------
         next_step();
+        std::string ov_perf_hint;
+        if (FLAGS_hint == "throughput" || FLAGS_hint == "tput")
+            ov_perf_hint = CONFIG_VALUE(THROUGHPUT);
+        else if (FLAGS_hint == "latency")
+            ov_perf_hint = CONFIG_VALUE(LATENCY);
 
         bool perf_counts = false;
         // Update config per device according to command line parameters
@@ -219,6 +227,13 @@ int main(int argc, char* argv[]) {
                 config[device] = {};
             std::map<std::string, std::string>& device_config = config.at(device);
 
+            // high-level performance modes
+            if (!ov_perf_hint.empty()) {
+                device_config[CONFIG_KEY(PERFORMANCE_HINT)] = ov_perf_hint;
+                if (FLAGS_nireq != 0)
+                    device_config[CONFIG_KEY(PERFORMANCE_HINT_NUM_REQUESTS)] = std::to_string(FLAGS_nireq);
+            }
+
             // Set performance counter
             if (isFlagSetInCommandLine("pc")) {
                 // set to user defined value
@@ -241,6 +256,7 @@ int main(int argc, char* argv[]) {
             }
             perf_counts = (device_config.at(CONFIG_KEY(PERF_COUNT)) == CONFIG_VALUE(YES)) ? true : perf_counts;
 
+            // the rest are individual per-device settings (overriding the values set with perf modes)
             auto setThroughputStreams = [&]() {
                 const std::string key = device + "_THROUGHPUT_STREAMS";
                 if (device_nstreams.count(device)) {
@@ -255,7 +271,7 @@ int main(int argc, char* argv[]) {
                                                " or via configuration file.");
                     }
                     device_config[key] = device_nstreams.at(device);
-                } else if (!device_config.count(key) && (FLAGS_api == "async")) {
+                } else if (ov_perf_hint.empty() && !device_config.count(key) && (FLAGS_api == "async")) {
                     slog::warn << "-nstreams default value is determined automatically for " << device
                                << " device. "
                                   "Although the automatic selection usually provides a "
@@ -484,9 +500,24 @@ int main(int argc, char* argv[]) {
                 batchSize = 1;
             }
         }
-        // ----------------- 8. Setting optimal runtime parameters
+        // ----------------- 8. Querying optimal runtime parameters
         // -----------------------------------------------------
         next_step();
+        // output of the actual settings that the device selected based on the hint
+        if (!ov_perf_hint.empty()) {
+            for (const auto& device : devices) {
+                std::vector<std::string> supported_config_keys =
+                    ie.GetMetric(device, METRIC_KEY(SUPPORTED_CONFIG_KEYS));
+                slog::info << "Device: " << device << slog::endl;
+                for (const auto& cfg : supported_config_keys) {
+                    try {
+                        slog::info << "  {" << cfg << " , " << exeNetwork.GetConfig(cfg).as<std::string>();
+                    } catch (...) {
+                    };
+                    slog::info << " }" << slog::endl;
+                }
+            }
+        }
 
         // Update number of streams
         for (auto&& ds : device_nstreams) {

diff --git a/inference-engine/src/cldnn_engine/cldnn_config.cpp b/inference-engine/src/cldnn_engine/cldnn_config.cpp
@@ -46,8 +46,10 @@ void Config::UpdateFromMap(const std::map<std::string, std::string>& configMap)
     for (auto& kvp : configMap) {
         std::string key = kvp.first;
         std::string val = kvp.second;
-
-        if (key.compare(PluginConfigParams::KEY_PERF_COUNT) == 0) {
+        const auto hints = perfHintsConfig.SupportedKeys();
+        if (hints.end() != std::find(hints.begin(), hints.end(), key)) {
+            perfHintsConfig.SetConfig(key, val);
+        } else if (key.compare(PluginConfigParams::KEY_PERF_COUNT) == 0) {
             if (val.compare(PluginConfigParams::YES) == 0) {
                 useProfiling = true;
             } else if (val.compare(PluginConfigParams::NO) == 0) {
@@ -341,6 +343,9 @@ void Config::adjustKeyMapValues() {
         key_config_map[GPUConfigParams::KEY_GPU_ENABLE_LOOP_UNROLLING] = PluginConfigParams::YES;
     else
         key_config_map[GPUConfigParams::KEY_GPU_ENABLE_LOOP_UNROLLING] = PluginConfigParams::NO;
+    key_config_map.insert({ PluginConfigParams::KEY_PERFORMANCE_HINT, perfHintsConfig.ovPerfHint });
+    key_config_map.insert({ PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS,
+                     std::to_string(perfHintsConfig.ovPerfHintNumRequests) });
 }
 IE_SUPPRESS_DEPRECATED_END
 

diff --git a/inference-engine/src/cldnn_engine/cldnn_config.h b/inference-engine/src/cldnn_engine/cldnn_config.h
@@ -8,7 +8,7 @@
 #include <string>
 
 #include "cldnn_custom_layer.h"
-
+#include <ie_performance_hints.hpp>
 #include <cldnn/graph/network.hpp>
 
 namespace CLDNNPlugin {
@@ -62,6 +62,7 @@ struct Config {
     bool enable_loop_unrolling;
 
     std::map<std::string, std::string> key_config_map;
+    InferenceEngine::PerfHintsConfig  perfHintsConfig;
 };
 
 }  // namespace CLDNNPlugin
diff --git a/inference-engine/src/cldnn_engine/cldnn_engine.cpp b/inference-engine/src/cldnn_engine/cldnn_engine.cpp
@@ -553,14 +553,40 @@ void clDNNEngine::UpdateConfig(CLDNNPlugin::Config& conf, const InferenceEngine:
     }
 }
 
+std::map<std::string, std::string> clDNNEngine::ConvertPerfHintsToConfig(
+        const std::map<std::string, std::string>& network_config,
+        const CLDNNPlugin::Config& plugin_config) const {
+    // deduces the actual settings from the performance hints and returns fully-defined config
+    auto config = network_config;
+    const auto &mode = config.find(PluginConfigParams::KEY_PERFORMANCE_HINT);
+    // the mode may have just arrived to the LoadNetwork, or was set with the plugins' SetConfig
+    if (mode != config.end() || !plugin_config.perfHintsConfig.ovPerfHint.empty()) {
+        const auto mode_name = (mode != config.end())
+                               ? PerfHintsConfig::CheckPerformanceHintValue(mode->second)
+                               : plugin_config.perfHintsConfig.ovPerfHint;
+        //checking streams (to avoid overriding what user might explicitly set in the incoming config or previously via SetConfig)
+        const auto streams = config.find(PluginConfigParams::KEY_GPU_THROUGHPUT_STREAMS);
+        if (streams == config.end() && !streamsSet) {
+            if (mode_name == CONFIG_VALUE(LATENCY)) {
+                config[PluginConfigParams::KEY_GPU_THROUGHPUT_STREAMS] = std::to_string(1);
+            } else if (mode_name == CONFIG_VALUE(THROUGHPUT)) {
+                config[PluginConfigParams::KEY_GPU_THROUGHPUT_STREAMS] = CONFIG_VALUE(GPU_THROUGHPUT_AUTO);
+                config[GPUConfigParams::KEY_GPU_PLUGIN_THROTTLE] = std::to_string(1);
+            }
+        }
+    }
+    return config;
+}
+
 IExecutableNetworkInternal::Ptr clDNNEngine::LoadExeNetworkImpl(const InferenceEngine::CNNNetwork &network,
-                                                                const std::map<std::string, std::string> &config) {
+                                                                const std::map<std::string, std::string> &orig_config) {
     OV_ITT_SCOPED_TASK(itt::domains::CLDNNPlugin, "clDNNEngine::LoadExeNetworkImpl");
     // verification of supported input
     InferenceEngine::InputsDataMap _networkInputs = network.getInputsInfo();
     check_inputs(_networkInputs);
 
     CLDNNPlugin::Config conf = _impl->m_config;
+    auto config = ConvertPerfHintsToConfig(orig_config, conf);
     UpdateConfig(conf, network, config);
 
     CLDNNRemoteCLContext::Ptr context;
@@ -606,7 +632,7 @@ IExecutableNetworkInternal::Ptr clDNNEngine::LoadExeNetworkImpl(const InferenceE
 
 IExecutableNetworkInternal::Ptr clDNNEngine::LoadExeNetworkImpl(const InferenceEngine::CNNNetwork &network,
                                                                 const IRemoteContext::Ptr &context,
-                                                                const std::map<std::string, std::string> &config) {
+                                                                const std::map<std::string, std::string> &orig_config) {
     InferenceEngine::InputsDataMap _networkInputs = network.getInputsInfo();
     check_inputs(_networkInputs);
 
@@ -616,6 +642,7 @@ IExecutableNetworkInternal::Ptr clDNNEngine::LoadExeNetworkImpl(const InferenceE
     }
 
     CLDNNPlugin::Config conf = getContextImpl(casted)->GetConfig();
+    auto config = ConvertPerfHintsToConfig(orig_config, conf);
     UpdateConfig(conf, network, config);
 
     auto transformedNetwork = CloneAndTransformNetwork(network, conf);
@@ -647,6 +674,7 @@ IRemoteContext::Ptr clDNNEngine::GetDefaultContext(const ParamMap& params) {
 }
 
 void clDNNEngine::SetConfig(const std::map<std::string, std::string> &config) {
+    streamsSet = (config.find(PluginConfigParams::KEY_GPU_THROUGHPUT_STREAMS) != config.end());
     _impl->m_config.UpdateFromMap(config);
 }
 

diff --git a/inference-engine/src/cldnn_engine/cldnn_engine.h b/inference-engine/src/cldnn_engine/cldnn_engine.h
@@ -20,6 +20,7 @@ class clDNNEngine : public InferenceEngine::IInferencePlugin,
                     public InferenceEngine::gpu::details::param_map_obj_getter {
     struct impl;
     std::shared_ptr<impl> _impl;
+    bool streamsSet = false;
 
     // key: device_id, value: cldnn device
     std::map<std::string, cldnn::device::ptr> device_map;
@@ -31,6 +32,9 @@ class clDNNEngine : public InferenceEngine::IInferencePlugin,
     InferenceEngine::CNNNetwork CloneAndTransformNetwork(const InferenceEngine::CNNNetwork& network,
                                                          const CLDNNPlugin::Config& config) const;
 
+    std::map<std::string, std::string> ConvertPerfHintsToConfig(const std::map<std::string, std::string>& network_config,
+                                                               const CLDNNPlugin::Config& plugin_config) const;
+
     void RegisterPrimitives();
     void UpdateConfig(Config& conf, const InferenceEngine::CNNNetwork &network, const std::map<std::string, std::string> &params) const;
 public:

diff --git a/inference-engine/src/cldnn_engine/cldnn_executable_network.cpp b/inference-engine/src/cldnn_engine/cldnn_executable_network.cpp
@@ -34,11 +34,12 @@ namespace CLDNNPlugin {
 
 CLDNNExecNetwork::CLDNNExecNetwork(InferenceEngine::CNNNetwork &network, std::shared_ptr<IRemoteContext> context, Config config) :
     InferenceEngine::ExecutableNetworkThreadSafeDefault{[&]()->InferenceEngine::ITaskExecutor::Ptr {
-        if (config.throughput_streams > 1) {
+        if (config.exclusiveAsyncRequests) {
+            //exclusiveAsyncRequests essentially disables the streams (and hence should be checked first) => aligned with the CPU behavior
+            return ExecutorManager::getInstance()->getExecutor("GPU");
+        }  else if (config.throughput_streams > 1) {
             return std::make_shared<InferenceEngine::CPUStreamsExecutor>(
                 IStreamsExecutor::Config{"CLDNNPlugin executor", config.throughput_streams});
-        } else if (config.exclusiveAsyncRequests) {
-            return ExecutorManager::getInstance()->getExecutor("GPU");
         } else {
             return std::make_shared<InferenceEngine::CPUStreamsExecutor>(
                 IStreamsExecutor::Config{"CLDNNPlugin executor", 1});