[CPU BF16] Bfloat16 inference optimizations (openvinotoolkit#2633)

* [CPU BF16] Greedy mode was added * [IE TESTS][BF16] Added support for operations with bf16 precision in the single layer tests. * Added cpu specific bfloat16 single layer tests for the jit_eltwise primitive. * [CPU TESTS] Activation and logical single layer tests fixes. * [IE TESTS] Fix activation single layer tests run. * [IE TESTS][CPU] CPUTestBase further refactoring. * [CPU BF16] Support for Bfloat16 type was added to the MVN layer. (#3) * [CPU BF16] MVN layer bfloat16 compatibility. * [CPU BF16] MVN bfloat16 minor fixes. * [CPU BF16] MVN node exception about BF16 support replaced with precision redefinition. * [CPU BF16] MVN layer bloat16 support fixed for quantization operations and blocking layout. * [CPU] Input and output precision checks were added to MVN layer. * [IE TESTS][CPU BF16] Most of the bloat16 tests have been fixed. * Bf16 crop layer (#4) * [IE TESTS][CPU] Cpu specific test for the Crop layer has been created. * [IE TESTS][CPU] Deprecated Crop single layer test removed. * [CPU BF16] Bfloat16 precision was added to the Crop layer. * [CPU BF16] Crop layer minor code improvements. * [IE TESTS][CPU] Crop layer test added 2D tensor tests. * [IE TESTS][CPU] Crop layer test, obsolete comment removed. * [IE TESTS][CPU] Fixed CropIE include path. * Crop test fix for older gcc compiler. * [CPU BF16] Reduce layer extended with bfloat16 support. * [IE TESTS][CPU] CPU specific single layer test for Reduce operation. * BF16 optimized layers * [CPU BF16] Bfloat16 custom type added to the MKLDNN plugin. * [CPU BF16] Mem alignment to 16 bytes added to bfloat16 class union. * [IE TESTS][CPU] Permute cpu specific single layer test and minor cpu tests fixes * MVN cpu single layer tests extended with nhwc ndhwc layouts. * Mod mode removed from Eltwise cpu single layer test. * Permute cpu specific single layer test. * Smoke keyword was added to the CPU single layer tests. * Normalize node was modified for BF16 support * [CPU BF16] The RegionYolo layer has been extended with the bfloat16 type support. * Resample node was extended with BF16 * Select layer was enabled with BF16 * psroi supports bf16 (#7) * reorders replaces converts (#9) * BF16 planar pooling was enabled * [CPU BF16] Cpu_convert added to the RegionYOLO node. * [IE TESTS][CPU] Crop single layer test has been rewritten using the StridedSlice operation. * [IE TESTS][CPU] Covert layer test extended with bf16 precision. * [CPU BF16] The bfloat16 class was renamed bfloat16_t and some refactoring has been done. * [CPU BF16] RegionYOLO and Softmax were aligned with the review. * [IE TESTS CPU] CPU single layer tests refactored according to the review suggestions. * [IE TESTS CPU] The Reduce CPU single layer test was extended with different mem orders. * [IE TESTS CPU] Minor fixes after the review. * [IE TESTS CPU] Common plugin configuration has been moved to PreparePluginConfiguration function. * Minor changes after review * StridedSlice, Select, ScaleShift notes were resolved * Fixes to the Reduce operation cpu test and minor fixes related to the review. * GPU eltwise tests fix. * psroi unrolled to the primary state; code clean (#12) * PSROIPooling layer with C++ optimizations * Minor fix for compatibility with CPUTestsBase for fuse_permute_reorder test. * Code clean & psroi rollbacked Co-authored-by: Maksim Kutakov <[email protected]> Co-authored-by: Maksim Kutakov <[email protected]> Co-authored-by: Yury Gaydaychuk <[email protected]>
mryzhov · Dec 11, 2020 · 3b9fcd8 · 3b9fcd8
1 parent 97d0c49
commit 3b9fcd8
Show file tree

Hide file tree

Showing 105 changed files with 3,129 additions and 900 deletions.
diff --git a/inference-engine/src/legacy_api/src/ie_cnn_layer_builder_ngraph.cpp b/inference-engine/src/legacy_api/src/ie_cnn_layer_builder_ngraph.cpp
@@ -353,6 +353,9 @@ CNNLayer::Ptr NodeConverter<ngraph::op::Convert>::createLayer(const std::shared_
     case Precision::FP16:
         precision_str = "FP16";
         break;
+    case Precision::BF16:
+        precision_str = "BF16";
+        break;
     case Precision::FP32:
         precision_str = "FP32";
         break;

diff --git a/inference-engine/src/legacy_api/src/ngraph_ops/interp.cpp b/inference-engine/src/legacy_api/src/ngraph_ops/interp.cpp
diff --git a/inference-engine/src/mkldnn_plugin/bf16transformer.cpp b/inference-engine/src/mkldnn_plugin/bf16transformer.cpp
@@ -11,6 +11,7 @@
 #include <chrono>
 #include <legacy/details/ie_cnn_network_tools.h>
 #include <legacy/ie_util_internal.hpp>
+#include <legacy/graph_tools.hpp>
 #include "ngraph/type/bfloat16.hpp"
 
 using namespace MKLDNNPlugin;
@@ -23,7 +24,7 @@ void precisionColoringBF16(const CNNLayerPtr layer,
     if (layer && !layer->insData.empty() && layer->input()) {
         printed_properties.insert(printed_properties.begin(),
                                   std::pair<std::string, std::string>("Precision",
-                                                                      layer->input()->getPrecision() == Precision::FP32 ? "FP32" : "BF16"));
+                                   layer->input()->getPrecision() == Precision::FP32 ? "FP32" : "BF16"));
 
         if (layer->input()->getPrecision() == Precision::FP32) {
             node_properties.emplace_back("fillcolor", "#5A5DF0");
@@ -55,20 +56,31 @@ void BF16Transformer::convertToBFloat16(InferenceEngine::CNNNetwork &network) {
     InputsDataMap inputs = network.getInputsInfo();
     OutputsDataMap outputs = network.getOutputsInfo();
     for (auto iter : sortedLayers) {
+        if (CaselessEq<std::string>()(iter->type, "convolution")) {
+            auto dims = iter->insData[0].lock()->getDims();
+            if ((dims.size() == 4 || dims.size() == 5) && (dims[1] == 1 || dims[1] == 3))
+                continue;
+        }
+
         //  check, if memory output node needs to be transformed
         if (iter->type == "Memory" && iter->outData.size() == 0 &&
             iter->insData[0].lock()->getPrecision() == Precision::FP32) {
-            auto curPrec = iter->insData[0].lock()->getPrecision();
             iter->insData[0].lock()->setPrecision(Precision::BF16);
         }
+
         for (size_t o = 0; o < iter->outData.size(); o++) {
             if (inputs.find(iter->outData[o]->getName()) == inputs.end()
                 && outputs.find(iter->outData[o]->getName()) == outputs.end()
+                && !CaselessEq<std::string>()(iter->type, "const")
                 && iter->outData[o]->getPrecision() == Precision::FP32) {
                 iter->outData[o]->setPrecision(Precision::BF16);
             }
         }
     }
+
+    // insert convert after input if necessary
+    insertConvertAfterInput(network);
+
     // convert all edges back to FP32 on demand
     optimizeToFloat(network);
 }
@@ -255,3 +267,120 @@ InferenceEngine::MemoryBlob::Ptr BF16Transformer::convertBF16ToFloat(InferenceEn
     }
     return weightsFP32;
 }
+void BF16Transformer::addLayerToCNNNetworkAfterData(
+        DataPtr parentOutData,
+        CNNLayer::Ptr layer,
+        const std::string& nextLayerName,
+        ICNNNetwork& net,
+        const int childInsDataIndex) {
+    CNNNetworkImpl* netImpl = dynamic_cast<CNNNetworkImpl*>(&net);
+    if (netImpl == nullptr) {
+        THROW_IE_EXCEPTION << "unexpected network type";
+    }
+
+    CNNLayerPtr nextLayer;
+    if (!nextLayerName.empty()) {
+        netImpl->getLayerByName(nextLayerName.c_str(), nextLayer, nullptr);
+    }
+
+    if (layer && (nextLayerName.empty() || (parentOutData == nullptr) || (childInsDataIndex != -1) ||
+                  (getInputTo(parentOutData).find(nextLayerName) != getInputTo(parentOutData).end()))) {
+        auto getTensorDesc = [](CNNLayerPtr& nextLayer) {
+            const DataPtr insData = nextLayer->insData[0].lock();
+            return insData->getTensorDesc();
+        };
+
+        const TensorDesc& parentTensorDesc = parentOutData != nullptr ? parentOutData->getTensorDesc() : getTensorDesc(nextLayer);
+        DataPtr newEdgeAfterLayer(new Data(layer->name, parentTensorDesc));
+        newEdgeAfterLayer->setName(layer->name);
+        getCreatorLayer(newEdgeAfterLayer) = layer;
+        getInputTo(newEdgeAfterLayer).clear();
+
+
+        if (netImpl == nullptr) {
+            THROW_IE_EXCEPTION << "unexpected network type";
+        }
+        netImpl->addData(layer->name.c_str(), newEdgeAfterLayer);
+        IE_SUPPRESS_DEPRECATED_START
+        netImpl->addLayer(layer);
+        IE_SUPPRESS_DEPRECATED_END
+
+        if (parentOutData != nullptr) {
+            getInputTo(parentOutData)[layer->name] = layer;
+            layer->insData.push_back(parentOutData);
+        }
+        layer->outData.push_back(newEdgeAfterLayer);
+
+        if (!nextLayerName.empty()) {
+            // CNNLayerPtr nextLayer = getInputTo(parentOutData)[nextLayerName];
+            getInputTo(newEdgeAfterLayer)[nextLayerName] = nextLayer;
+
+            if (parentOutData != nullptr) {
+                getInputTo(parentOutData).erase(nextLayerName);
+
+                if (childInsDataIndex == -1) {
+                    for (size_t i = 0; i < nextLayer->insData.size(); i++) {
+                        if (nextLayer->insData[i].lock() == parentOutData) {
+                            nextLayer->insData[i] = newEdgeAfterLayer;
+                        }
+                    }
+                } else {
+                    nextLayer->insData[childInsDataIndex] = newEdgeAfterLayer;
+                }
+            } else {
+                nextLayer->insData.push_back(newEdgeAfterLayer);
+            }
+        } else {
+            CNNLayerPtr parent = getCreatorLayer(parentOutData).lock();
+            if (parent == nullptr) {
+                THROW_IE_EXCEPTION << "parent data is absent";
+            }
+            netImpl->removeOutput(parent->name);
+            netImpl->addData(layer->name.c_str(), newEdgeAfterLayer);
+            netImpl->addOutput(layer->name);
+        }
+    } else {
+        THROW_IE_EXCEPTION << "Invalid argument";
+    }
+}
+
+void BF16Transformer::insertConvertAfterInput(InferenceEngine::CNNNetwork &network) {
+    auto inputLayers = InferenceEngine::CNNNetGetAllInputLayers(network);
+    for (auto inputIter : inputLayers) {
+        for (size_t o = 0; o < inputIter->outData.size(); o++) {
+            for (auto bfInitIter : getInputTo(inputIter->outData[o])) {
+                if (inputIter->outData[o]->getPrecision() == Precision::BF16) {
+                    // we don't need to enforce bf16-mode for the next layer
+                    break;
+                }
+                auto bfInitLayer = bfInitIter.second;
+                if (_initbf16.find(bfInitLayer->type) != _initbf16.end()) {
+                    if (CaselessEq<std::string>()(bfInitLayer->type, "convolution")) {
+                        // TODO: have to be removed after adding suitable implementation for convolution
+                        break;
+                    }
+                    // insert convert
+                    std::string layerName = inputIter->outData[o]->getName();
+                    LayerParams cnnLayerParams{layerName, "Convert", Precision::FP32};
+                    auto lay = std::make_shared<InferenceEngine::CNNLayer>(cnnLayerParams);
+                    std::map<std::string, std::string> par = {{"name",      layerName},
+                                                              {"type",      "Convert"},
+                                                              {"precision", "FP32"}};
+                    lay->params = par;
+                    CNNLayerPtr convertLayer(lay);
+                    BF16Transformer::addLayerToCNNNetworkAfterData(inputIter->outData[o], convertLayer, bfInitLayer->name,
+                                                                   network);
+                    // compute input port id for bfInitLayer
+                    for (size_t i = 0; i < bfInitLayer->insData.size(); i++) {
+                        if (bfInitLayer->insData[i].lock()->getName() == inputIter->outData[o]->getName()) {
+                            // set conv input as bf
+                            bfInitLayer->insData[i].lock()->setPrecision(Precision::BF16);
+                            break;
+                        }
+                    }
+                    break;
+                }
+            }
+        }
+    }
+}
diff --git a/inference-engine/src/mkldnn_plugin/bf16transformer.h b/inference-engine/src/mkldnn_plugin/bf16transformer.h
@@ -8,15 +8,22 @@
 #include <caseless.hpp>
 #include <string>
 #include <set>
+#include <legacy/details/ie_cnn_network_tools.h>
 
 namespace MKLDNNPlugin {
 
 class BF16Transformer {
     const InferenceEngine::details::caseless_set<std::string> _initbf16 =
-        { "convolution", "fullyconnected", "innerproduct", "gemm" };
+        { "convolution", "fullyconnected", "innerproduct", "gemm", "RegionYolo" };
     const InferenceEngine::details::caseless_set<std::string> _complementbf16 =
-        { "relu", "tanh", "elu", "square", "abs", "sqrt", "linear", "bounded_relu", "soft_relu", "logistic",
-          "exp", "gelu", "clamp", "swish", "prelu", "pooling", "norm", "gather", "memory" };
+        { "relu", "tanh", "elu", "square", "abs", "sqrt", "linear", "bounded_relu", "soft_relu", "normalize",
+          "sigmoid", "ReLU6", "not", "activation", "HSwish", "mish", "logistic", "mod", "resample",
+          "exp", "gelu", "clamp", "swish", "prelu", "pooling", "norm", "gather", "memory", "mvn", "crop", "activation",
+          "broadcast", "convert", "BatchToSpace", "DepthToSpace", "ExtractImagePatches", "concat", "power", "lrn",
+          "permute", "ScatterUpdate", "ScatterElementsUpdate", "ScatterNDUpdate", "depthwise",
+          "select", "ShuffleChannels", "SpaceToBatch", "SpaceToDepth", "squeeze", "StridedSlice", "unsqueeze", "eltwise",
+          "ReduceAnd", "ReduceOr", "ReduceMax", "ReduceMin" };
+
     const InferenceEngine::details::caseless_set<std::string> _multiinput =
         { "concat", "eltwise" };
     //  prevent fallback to fp32 without considering both input and output nodes
@@ -33,6 +40,13 @@ class BF16Transformer {
     */
     bool tryToMarkFP32(InferenceEngine::DataPtr data, const std::set<InferenceEngine::DataPtr> &immutable);
 
+    /**
+    * Because of singularity of input node, layer, following input doesn't support bf16 itself.
+    * We fix it by insertion of convert layer, which has to be replaced to reorder in graph optimizer.
+    *
+    */
+    void insertConvertAfterInput(InferenceEngine::CNNNetwork &network);
+
 public:
     /**
      * Restores Float point data types on edges which goes to non supported layers
@@ -61,6 +75,16 @@ class BF16Transformer {
     */
     void convertToBFloat16(InferenceEngine::CNNNetwork &network);
 
+    /**
+     * inserts given layer after current tensor
+     */
+    static void addLayerToCNNNetworkAfterData(
+            InferenceEngine::DataPtr parentOutData,
+            InferenceEngine::CNNLayerPtr layer,
+            const std::string& nextLayerName,
+            InferenceEngine::ICNNNetwork& net,
+            const int childInsDataIndex = -1);
+
     InferenceEngine::MemoryBlob::Ptr convertBF16ToFloat(InferenceEngine::MemoryBlob::Ptr);
 };
 

diff --git a/inference-engine/src/mkldnn_plugin/mkldnn_graph_optimizer.cpp b/inference-engine/src/mkldnn_plugin/mkldnn_graph_optimizer.cpp
@@ -145,6 +145,9 @@ void MKLDNNGraphOptimizer::ApplyImplSpecificGraphOptimizations(MKLDNNGraph &grap
     graph.RemoveDroppedNodes();
 
 #if defined (COMPILED_CPU_MKLDNN_REORDER_NODE)
+    ChangeConvertToReorder(graph);
+    graph.RemoveDroppedNodes();
+
     DropDoubleReorders(graph);
     graph.RemoveDroppedNodes();
 
@@ -1918,6 +1921,55 @@ void MKLDNNGraphOptimizer::DropConvertReorder(MKLDNNGraph& graph) {
         }
     }
 }
+
+void MKLDNNGraphOptimizer::ChangeConvertToReorder(MKLDNNGraph& graph) {
+    std::vector<Precision> continuousPrecisions{
+            Precision::BF16,
+            Precision::FP32
+    };
+    for (int ind = 0; ind < graph.GetNodes().size(); ind++) {
+        auto convertCandidate = graph.GetNodes().at(ind);
+        std::string nodeType = convertCandidate->getTypeStr();
+        if (!InferenceEngine::details::CaselessEq<std::string>()(nodeType, "convert")) {
+            continue;
+        }
+        auto inputPrecision = convertCandidate->getCnnLayer()->insData[0].lock()->getPrecision();
+        auto outputPrecision = convertCandidate->getCnnLayer()->outData[0]->getPrecision();
+        if (std::find(continuousPrecisions.begin(), continuousPrecisions.end(), inputPrecision) == continuousPrecisions.end() ||
+            std::find(continuousPrecisions.begin(), continuousPrecisions.end(), outputPrecision) == continuousPrecisions.end()) {
+            continue;
+        }
+        std::unordered_set<std::string> uniqueLayerNames;
+        for (auto node : graph.GetNodes()) {
+            uniqueLayerNames.insert(node->getCnnLayer()->name);
+        }
+        auto parentEdge = convertCandidate->getParentEdges()[0].lock();
+        auto parentNode = parentEdge->getParent();
+        auto &childEdge = convertCandidate->getChildEdgeAt(0);
+        auto childNode = childEdge->getChild();
+        std::string basicLayerName = childEdge->getParent()->getName() + "_" +
+                                     MKLDNNExtensionUtils::getReorderArgs(convertCandidate->getCnnLayer()->insData[0].lock()->getTensorDesc(),
+                                                                          convertCandidate->getCnnLayer()->outData[0]->getTensorDesc()) +
+                                     "_" + childEdge->getChild()->getName();
+        std::string layerName = basicLayerName;
+        int idx = 0;
+        while (uniqueLayerNames.find(layerName) != uniqueLayerNames.end()) {
+            idx++;
+            layerName = basicLayerName + "_" + std::to_string(idx);
+        }
+        // create temporary edge
+        auto oldParentOutputPort = parentEdge->getInputNum();
+        auto oldChildInputPort = childEdge->getOutputNum();
+        MKLDNNEdgePtr tempEdge(new MKLDNNEdge(parentNode, childNode, oldParentOutputPort, oldChildInputPort));
+
+        graph.InsertReorder(tempEdge, layerName, convertCandidate->getCnnLayer()->insData[0].lock()->getTensorDesc(),
+                            convertCandidate->getCnnLayer()->outData[0]->getTensorDesc(), false);
+        parentNode->removeEdge(parentEdge);
+        parentEdge->drop();
+        childEdge->drop();
+        graph.DropNode(convertCandidate);
+    }
+}
 #endif
 
 void MKLDNNGraphOptimizer::RemoveIOScaleShifts(MKLDNNGraph &graph) {

diff --git a/inference-engine/src/mkldnn_plugin/mkldnn_graph_optimizer.h b/inference-engine/src/mkldnn_plugin/mkldnn_graph_optimizer.h
@@ -46,6 +46,7 @@ class MKLDNNGraphOptimizer {
 #if defined (COMPILED_CPU_MKLDNN_REORDER_NODE)
     void DropDoubleReorders(MKLDNNGraph& graph);
     void DropConvertReorder(MKLDNNGraph& graph);
+    void ChangeConvertToReorder(MKLDNNGraph &graph);
 #endif
     void FuseConvolutionAndZeroPoints(MKLDNNGraph &graph);
     void FuseBroadcastAndEltwise(MKLDNNGraph &graph);

diff --git a/inference-engine/src/mkldnn_plugin/mkldnn_infer_request.cpp b/inference-engine/src/mkldnn_plugin/mkldnn_infer_request.cpp
@@ -105,6 +105,7 @@ void MKLDNNPlugin::MKLDNNInferRequest::PushInputData() {
             // these precisions are supported by mkldnn, so we push the blob directly
             case InferenceEngine::Precision::I8:
             case InferenceEngine::Precision::I32:
+            case InferenceEngine::Precision::BF16:
             case InferenceEngine::Precision::FP32: {
                 break;
             }

diff --git a/inference-engine/src/mkldnn_plugin/mkldnn_plugin.cpp b/inference-engine/src/mkldnn_plugin/mkldnn_plugin.cpp
@@ -278,6 +278,7 @@ Engine::LoadExeNetworkImpl(const InferenceEngine::ICNNNetwork &network, const st
             input_precision != InferenceEngine::Precision::I16 &&
             input_precision != InferenceEngine::Precision::I8 &&
             input_precision != InferenceEngine::Precision::U8 &&
+            input_precision != InferenceEngine::Precision::BF16 &&
             input_precision != InferenceEngine::Precision::BOOL &&
             input_precision != InferenceEngine::Precision::I64 &&
             input_precision != InferenceEngine::Precision::U64) {

diff --git a/inference-engine/src/mkldnn_plugin/nodes/argmax.cpp b/inference-engine/src/mkldnn_plugin/nodes/argmax.cpp
@@ -27,7 +27,7 @@ class ArgMaxImpl: public ExtLayerBase {
             conf.axis_index_ = conf.has_axis_ ?
                                 std::stoi(layer->params.at("axis")) :0;
 
-            addConfig(layer, {DataConfigurator(ConfLayout::PLN)}, {DataConfigurator(ConfLayout::PLN)});
+            addConfig(layer, {DataConfigurator(ConfLayout::PLN, Precision::FP32)}, {DataConfigurator(ConfLayout::PLN, Precision::FP32)});
         } catch (InferenceEngine::details::InferenceEngineException &ex) {
             errorMsg = ex.what();
         }

diff --git a/inference-engine/src/mkldnn_plugin/nodes/base.hpp b/inference-engine/src/mkldnn_plugin/nodes/base.hpp
@@ -60,8 +60,8 @@ class ExtLayerBase: public ILayerExecImpl {
         explicit DataConfigurator(ConfLayout l):
             layout(l) {}
 
-        DataConfigurator(ConfLayout l, bool constant, int inplace = -1):
-            layout(l), constant(constant), inplace(inplace) {}
+        DataConfigurator(ConfLayout l, bool constant, int inplace = -1, Precision::ePrecision prc = Precision::UNSPECIFIED):
+            layout(l), constant(constant), inplace(inplace), prc(prc) {}
 
         DataConfigurator(ConfLayout l, Precision::ePrecision prc):
             layout(l), prc(prc) {}
@@ -128,14 +128,7 @@ class ExtLayerBase: public ILayerExecImpl {
                 conf.layout = ConfLayout::PLN;
             }
 
-            // All extension layers support only FP32 precision!
-            // fixing of BF16 precisions where they are - layers naturally support only FP32
-            // if we see BF16, that means another floating point format which will be converted by reorder
-            // added by current mkl-dnn cpu plugin when it figure out diff in data types on input and output of edges
             InferenceEngine::Precision precision = (conf.prc == Precision::UNSPECIFIED) ? data_desc.getPrecision() : Precision(conf.prc);
-            if (precision == Precision::BF16) {
-                precision = Precision::FP32;
-            }
             if (conf.layout == ConfLayout::ANY) {
                 dataConfig.desc = TensorDesc(precision, data_dims, InferenceEngine::Layout::ANY);
             } else {

diff --git a/inference-engine/src/mkldnn_plugin/nodes/broadcast.cpp b/inference-engine/src/mkldnn_plugin/nodes/broadcast.cpp
@@ -31,7 +31,7 @@ class BroadcastImpl: public ExtLayerBase {
 
             LayerConfig config;
             DataConfig dataConfig, shapeConfig;
-            Precision dataPrecision = layer->outData[0]->getTensorDesc().getPrecision();
+            Precision dataPrecision = layer->insData[BROADCAST_INPUT].lock()->getTensorDesc().getPrecision();
             const SizeVector& data_dims = layer->insData[BROADCAST_INPUT].lock()->getTensorDesc().getDims();
             dataConfig.desc = TensorDesc(dataPrecision, data_dims,
                                          layer->insData[BROADCAST_INPUT].lock()->getTensorDesc().getLayout());