Skip to content

Commit

Permalink
[CPU BF16] Bfloat16 inference optimizations (openvinotoolkit#2633)
Browse files Browse the repository at this point in the history
* [CPU BF16] Greedy mode was added

* [IE TESTS][BF16] Added support for operations with bf16 precision in the single layer tests.

* Added cpu specific bfloat16 single layer tests for the jit_eltwise primitive.

* [CPU TESTS] Activation and logical single layer tests fixes.

* [IE TESTS] Fix activation single layer tests run.

* [IE TESTS][CPU] CPUTestBase further refactoring.

* [CPU BF16] Support for Bfloat16 type was added to the MVN layer. (#3)

* [CPU BF16] MVN layer bfloat16 compatibility.

* [CPU BF16] MVN bfloat16 minor fixes.

* [CPU BF16] MVN node exception about BF16 support replaced with precision redefinition.

* [CPU BF16] MVN layer bloat16 support fixed for quantization operations and blocking layout.

* [CPU] Input and output precision checks were added to MVN layer.

* [IE TESTS][CPU BF16] Most of the bloat16 tests have been fixed.

* Bf16 crop layer (#4)

* [IE TESTS][CPU] Cpu specific test for the Crop layer has been created.

* [IE TESTS][CPU] Deprecated Crop single layer test removed.

* [CPU BF16] Bfloat16 precision was added to the Crop layer.

* [CPU BF16] Crop layer minor code improvements.

* [IE TESTS][CPU] Crop layer test added 2D tensor tests.

* [IE TESTS][CPU] Crop layer test, obsolete comment removed.

* [IE TESTS][CPU] Fixed CropIE include path.

* Crop test fix for older gcc compiler.

* [CPU BF16] Reduce layer extended with bfloat16 support.

* [IE TESTS][CPU] CPU specific single layer test for Reduce operation.

* BF16 optimized layers

* [CPU BF16] Bfloat16 custom type added to the MKLDNN plugin.

* [CPU BF16] Mem alignment to 16 bytes added to bfloat16 class union.

* [IE TESTS][CPU] Permute cpu specific single layer test and minor cpu tests fixes

* MVN cpu single layer tests extended with nhwc ndhwc layouts.

* Mod mode removed from Eltwise cpu single layer test.

* Permute cpu specific single layer test.

* Smoke keyword was added to the CPU single layer tests.

* Normalize node was modified for BF16 support

* [CPU BF16] The RegionYolo layer has been extended with the bfloat16 type support.

* Resample node was extended with BF16

* Select layer was enabled with BF16

* psroi supports bf16 (#7)

* reorders replaces converts (#9)

* BF16 planar pooling was enabled

* [CPU BF16] Cpu_convert added to the RegionYOLO node.

* [IE TESTS][CPU] Crop single layer test has been rewritten using the StridedSlice operation.

* [IE TESTS][CPU] Covert layer test extended with bf16 precision.

* [CPU BF16] The bfloat16 class was renamed bfloat16_t and some refactoring has been done.

* [CPU BF16] RegionYOLO and Softmax were aligned with the review.

* [IE TESTS CPU] CPU single layer tests refactored according to the review suggestions.

* [IE TESTS CPU] The Reduce CPU single layer test was extended with different mem orders.

* [IE TESTS CPU] Minor fixes after the review.

* [IE TESTS CPU] Common plugin configuration has been moved to PreparePluginConfiguration function.

* Minor changes after review

* StridedSlice, Select, ScaleShift notes were resolved

* Fixes to the Reduce operation cpu test and minor fixes related to the review.

* GPU eltwise tests fix.

* psroi unrolled to the primary state; code clean (#12)

* PSROIPooling layer with C++ optimizations

* Minor fix for compatibility with CPUTestsBase for fuse_permute_reorder test.

* Code clean & psroi rollbacked

Co-authored-by: Maksim Kutakov <[email protected]>
Co-authored-by: Maksim Kutakov <[email protected]>
Co-authored-by: Yury Gaydaychuk <[email protected]>
  • Loading branch information
4 people authored Nov 27, 2020
1 parent b7d5590 commit 2667bff
Show file tree
Hide file tree
Showing 105 changed files with 3,129 additions and 900 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -353,6 +353,9 @@ CNNLayer::Ptr NodeConverter<ngraph::op::Convert>::createLayer(const std::shared_
case Precision::FP16:
precision_str = "FP16";
break;
case Precision::BF16:
precision_str = "BF16";
break;
case Precision::FP32:
precision_str = "FP32";
break;
Expand Down
Empty file modified inference-engine/src/legacy_api/src/ngraph_ops/interp.cpp
100644 → 100755
Empty file.
133 changes: 131 additions & 2 deletions inference-engine/src/mkldnn_plugin/bf16transformer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
#include <chrono>
#include <legacy/details/ie_cnn_network_tools.h>
#include <legacy/ie_util_internal.hpp>
#include <legacy/graph_tools.hpp>
#include "ngraph/type/bfloat16.hpp"

using namespace MKLDNNPlugin;
Expand All @@ -23,7 +24,7 @@ void precisionColoringBF16(const CNNLayerPtr layer,
if (layer && !layer->insData.empty() && layer->input()) {
printed_properties.insert(printed_properties.begin(),
std::pair<std::string, std::string>("Precision",
layer->input()->getPrecision() == Precision::FP32 ? "FP32" : "BF16"));
layer->input()->getPrecision() == Precision::FP32 ? "FP32" : "BF16"));

if (layer->input()->getPrecision() == Precision::FP32) {
node_properties.emplace_back("fillcolor", "#5A5DF0");
Expand Down Expand Up @@ -55,20 +56,31 @@ void BF16Transformer::convertToBFloat16(InferenceEngine::CNNNetwork &network) {
InputsDataMap inputs = network.getInputsInfo();
OutputsDataMap outputs = network.getOutputsInfo();
for (auto iter : sortedLayers) {
if (CaselessEq<std::string>()(iter->type, "convolution")) {
auto dims = iter->insData[0].lock()->getDims();
if ((dims.size() == 4 || dims.size() == 5) && (dims[1] == 1 || dims[1] == 3))
continue;
}

// check, if memory output node needs to be transformed
if (iter->type == "Memory" && iter->outData.size() == 0 &&
iter->insData[0].lock()->getPrecision() == Precision::FP32) {
auto curPrec = iter->insData[0].lock()->getPrecision();
iter->insData[0].lock()->setPrecision(Precision::BF16);
}

for (size_t o = 0; o < iter->outData.size(); o++) {
if (inputs.find(iter->outData[o]->getName()) == inputs.end()
&& outputs.find(iter->outData[o]->getName()) == outputs.end()
&& !CaselessEq<std::string>()(iter->type, "const")
&& iter->outData[o]->getPrecision() == Precision::FP32) {
iter->outData[o]->setPrecision(Precision::BF16);
}
}
}

// insert convert after input if necessary
insertConvertAfterInput(network);

// convert all edges back to FP32 on demand
optimizeToFloat(network);
}
Expand Down Expand Up @@ -255,3 +267,120 @@ InferenceEngine::MemoryBlob::Ptr BF16Transformer::convertBF16ToFloat(InferenceEn
}
return weightsFP32;
}
void BF16Transformer::addLayerToCNNNetworkAfterData(
DataPtr parentOutData,
CNNLayer::Ptr layer,
const std::string& nextLayerName,
ICNNNetwork& net,
const int childInsDataIndex) {
CNNNetworkImpl* netImpl = dynamic_cast<CNNNetworkImpl*>(&net);
if (netImpl == nullptr) {
THROW_IE_EXCEPTION << "unexpected network type";
}

CNNLayerPtr nextLayer;
if (!nextLayerName.empty()) {
netImpl->getLayerByName(nextLayerName.c_str(), nextLayer, nullptr);
}

if (layer && (nextLayerName.empty() || (parentOutData == nullptr) || (childInsDataIndex != -1) ||
(getInputTo(parentOutData).find(nextLayerName) != getInputTo(parentOutData).end()))) {
auto getTensorDesc = [](CNNLayerPtr& nextLayer) {
const DataPtr insData = nextLayer->insData[0].lock();
return insData->getTensorDesc();
};

const TensorDesc& parentTensorDesc = parentOutData != nullptr ? parentOutData->getTensorDesc() : getTensorDesc(nextLayer);
DataPtr newEdgeAfterLayer(new Data(layer->name, parentTensorDesc));
newEdgeAfterLayer->setName(layer->name);
getCreatorLayer(newEdgeAfterLayer) = layer;
getInputTo(newEdgeAfterLayer).clear();


if (netImpl == nullptr) {
THROW_IE_EXCEPTION << "unexpected network type";
}
netImpl->addData(layer->name.c_str(), newEdgeAfterLayer);
IE_SUPPRESS_DEPRECATED_START
netImpl->addLayer(layer);
IE_SUPPRESS_DEPRECATED_END

if (parentOutData != nullptr) {
getInputTo(parentOutData)[layer->name] = layer;
layer->insData.push_back(parentOutData);
}
layer->outData.push_back(newEdgeAfterLayer);

if (!nextLayerName.empty()) {
// CNNLayerPtr nextLayer = getInputTo(parentOutData)[nextLayerName];
getInputTo(newEdgeAfterLayer)[nextLayerName] = nextLayer;

if (parentOutData != nullptr) {
getInputTo(parentOutData).erase(nextLayerName);

if (childInsDataIndex == -1) {
for (size_t i = 0; i < nextLayer->insData.size(); i++) {
if (nextLayer->insData[i].lock() == parentOutData) {
nextLayer->insData[i] = newEdgeAfterLayer;
}
}
} else {
nextLayer->insData[childInsDataIndex] = newEdgeAfterLayer;
}
} else {
nextLayer->insData.push_back(newEdgeAfterLayer);
}
} else {
CNNLayerPtr parent = getCreatorLayer(parentOutData).lock();
if (parent == nullptr) {
THROW_IE_EXCEPTION << "parent data is absent";
}
netImpl->removeOutput(parent->name);
netImpl->addData(layer->name.c_str(), newEdgeAfterLayer);
netImpl->addOutput(layer->name);
}
} else {
THROW_IE_EXCEPTION << "Invalid argument";
}
}

void BF16Transformer::insertConvertAfterInput(InferenceEngine::CNNNetwork &network) {
auto inputLayers = InferenceEngine::CNNNetGetAllInputLayers(network);
for (auto inputIter : inputLayers) {
for (size_t o = 0; o < inputIter->outData.size(); o++) {
for (auto bfInitIter : getInputTo(inputIter->outData[o])) {
if (inputIter->outData[o]->getPrecision() == Precision::BF16) {
// we don't need to enforce bf16-mode for the next layer
break;
}
auto bfInitLayer = bfInitIter.second;
if (_initbf16.find(bfInitLayer->type) != _initbf16.end()) {
if (CaselessEq<std::string>()(bfInitLayer->type, "convolution")) {
// TODO: have to be removed after adding suitable implementation for convolution
break;
}
// insert convert
std::string layerName = inputIter->outData[o]->getName();
LayerParams cnnLayerParams{layerName, "Convert", Precision::FP32};
auto lay = std::make_shared<InferenceEngine::CNNLayer>(cnnLayerParams);
std::map<std::string, std::string> par = {{"name", layerName},
{"type", "Convert"},
{"precision", "FP32"}};
lay->params = par;
CNNLayerPtr convertLayer(lay);
BF16Transformer::addLayerToCNNNetworkAfterData(inputIter->outData[o], convertLayer, bfInitLayer->name,
network);
// compute input port id for bfInitLayer
for (size_t i = 0; i < bfInitLayer->insData.size(); i++) {
if (bfInitLayer->insData[i].lock()->getName() == inputIter->outData[o]->getName()) {
// set conv input as bf
bfInitLayer->insData[i].lock()->setPrecision(Precision::BF16);
break;
}
}
break;
}
}
}
}
}
30 changes: 27 additions & 3 deletions inference-engine/src/mkldnn_plugin/bf16transformer.h
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,22 @@
#include <caseless.hpp>
#include <string>
#include <set>
#include <legacy/details/ie_cnn_network_tools.h>

namespace MKLDNNPlugin {

class BF16Transformer {
const InferenceEngine::details::caseless_set<std::string> _initbf16 =
{ "convolution", "fullyconnected", "innerproduct", "gemm" };
{ "convolution", "fullyconnected", "innerproduct", "gemm", "RegionYolo" };
const InferenceEngine::details::caseless_set<std::string> _complementbf16 =
{ "relu", "tanh", "elu", "square", "abs", "sqrt", "linear", "bounded_relu", "soft_relu", "logistic",
"exp", "gelu", "clamp", "swish", "prelu", "pooling", "norm", "gather", "memory" };
{ "relu", "tanh", "elu", "square", "abs", "sqrt", "linear", "bounded_relu", "soft_relu", "normalize",
"sigmoid", "ReLU6", "not", "activation", "HSwish", "mish", "logistic", "mod", "resample",
"exp", "gelu", "clamp", "swish", "prelu", "pooling", "norm", "gather", "memory", "mvn", "crop", "activation",
"broadcast", "convert", "BatchToSpace", "DepthToSpace", "ExtractImagePatches", "concat", "power", "lrn",
"permute", "ScatterUpdate", "ScatterElementsUpdate", "ScatterNDUpdate", "depthwise",
"select", "ShuffleChannels", "SpaceToBatch", "SpaceToDepth", "squeeze", "StridedSlice", "unsqueeze", "eltwise",
"ReduceAnd", "ReduceOr", "ReduceMax", "ReduceMin" };

const InferenceEngine::details::caseless_set<std::string> _multiinput =
{ "concat", "eltwise" };
// prevent fallback to fp32 without considering both input and output nodes
Expand All @@ -33,6 +40,13 @@ class BF16Transformer {
*/
bool tryToMarkFP32(InferenceEngine::DataPtr data, const std::set<InferenceEngine::DataPtr> &immutable);

/**
* Because of singularity of input node, layer, following input doesn't support bf16 itself.
* We fix it by insertion of convert layer, which has to be replaced to reorder in graph optimizer.
*
*/
void insertConvertAfterInput(InferenceEngine::CNNNetwork &network);

public:
/**
* Restores Float point data types on edges which goes to non supported layers
Expand Down Expand Up @@ -61,6 +75,16 @@ class BF16Transformer {
*/
void convertToBFloat16(InferenceEngine::CNNNetwork &network);

/**
* inserts given layer after current tensor
*/
static void addLayerToCNNNetworkAfterData(
InferenceEngine::DataPtr parentOutData,
InferenceEngine::CNNLayerPtr layer,
const std::string& nextLayerName,
InferenceEngine::ICNNNetwork& net,
const int childInsDataIndex = -1);

InferenceEngine::MemoryBlob::Ptr convertBF16ToFloat(InferenceEngine::MemoryBlob::Ptr);
};

Expand Down
52 changes: 52 additions & 0 deletions inference-engine/src/mkldnn_plugin/mkldnn_graph_optimizer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,9 @@ void MKLDNNGraphOptimizer::ApplyImplSpecificGraphOptimizations(MKLDNNGraph &grap
graph.RemoveDroppedNodes();

#if defined (COMPILED_CPU_MKLDNN_REORDER_NODE)
ChangeConvertToReorder(graph);
graph.RemoveDroppedNodes();

DropDoubleReorders(graph);
graph.RemoveDroppedNodes();

Expand Down Expand Up @@ -1918,6 +1921,55 @@ void MKLDNNGraphOptimizer::DropConvertReorder(MKLDNNGraph& graph) {
}
}
}

void MKLDNNGraphOptimizer::ChangeConvertToReorder(MKLDNNGraph& graph) {
std::vector<Precision> continuousPrecisions{
Precision::BF16,
Precision::FP32
};
for (int ind = 0; ind < graph.GetNodes().size(); ind++) {
auto convertCandidate = graph.GetNodes().at(ind);
std::string nodeType = convertCandidate->getTypeStr();
if (!InferenceEngine::details::CaselessEq<std::string>()(nodeType, "convert")) {
continue;
}
auto inputPrecision = convertCandidate->getCnnLayer()->insData[0].lock()->getPrecision();
auto outputPrecision = convertCandidate->getCnnLayer()->outData[0]->getPrecision();
if (std::find(continuousPrecisions.begin(), continuousPrecisions.end(), inputPrecision) == continuousPrecisions.end() ||
std::find(continuousPrecisions.begin(), continuousPrecisions.end(), outputPrecision) == continuousPrecisions.end()) {
continue;
}
std::unordered_set<std::string> uniqueLayerNames;
for (auto node : graph.GetNodes()) {
uniqueLayerNames.insert(node->getCnnLayer()->name);
}
auto parentEdge = convertCandidate->getParentEdges()[0].lock();
auto parentNode = parentEdge->getParent();
auto &childEdge = convertCandidate->getChildEdgeAt(0);
auto childNode = childEdge->getChild();
std::string basicLayerName = childEdge->getParent()->getName() + "_" +
MKLDNNExtensionUtils::getReorderArgs(convertCandidate->getCnnLayer()->insData[0].lock()->getTensorDesc(),
convertCandidate->getCnnLayer()->outData[0]->getTensorDesc()) +
"_" + childEdge->getChild()->getName();
std::string layerName = basicLayerName;
int idx = 0;
while (uniqueLayerNames.find(layerName) != uniqueLayerNames.end()) {
idx++;
layerName = basicLayerName + "_" + std::to_string(idx);
}
// create temporary edge
auto oldParentOutputPort = parentEdge->getInputNum();
auto oldChildInputPort = childEdge->getOutputNum();
MKLDNNEdgePtr tempEdge(new MKLDNNEdge(parentNode, childNode, oldParentOutputPort, oldChildInputPort));

graph.InsertReorder(tempEdge, layerName, convertCandidate->getCnnLayer()->insData[0].lock()->getTensorDesc(),
convertCandidate->getCnnLayer()->outData[0]->getTensorDesc(), false);
parentNode->removeEdge(parentEdge);
parentEdge->drop();
childEdge->drop();
graph.DropNode(convertCandidate);
}
}
#endif

void MKLDNNGraphOptimizer::RemoveIOScaleShifts(MKLDNNGraph &graph) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ class MKLDNNGraphOptimizer {
#if defined (COMPILED_CPU_MKLDNN_REORDER_NODE)
void DropDoubleReorders(MKLDNNGraph& graph);
void DropConvertReorder(MKLDNNGraph& graph);
void ChangeConvertToReorder(MKLDNNGraph &graph);
#endif
void FuseConvolutionAndZeroPoints(MKLDNNGraph &graph);
void FuseBroadcastAndEltwise(MKLDNNGraph &graph);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ void MKLDNNPlugin::MKLDNNInferRequest::PushInputData() {
// these precisions are supported by mkldnn, so we push the blob directly
case InferenceEngine::Precision::I8:
case InferenceEngine::Precision::I32:
case InferenceEngine::Precision::BF16:
case InferenceEngine::Precision::FP32: {
break;
}
Expand Down
1 change: 1 addition & 0 deletions inference-engine/src/mkldnn_plugin/mkldnn_plugin.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -278,6 +278,7 @@ Engine::LoadExeNetworkImpl(const InferenceEngine::ICNNNetwork &network, const st
input_precision != InferenceEngine::Precision::I16 &&
input_precision != InferenceEngine::Precision::I8 &&
input_precision != InferenceEngine::Precision::U8 &&
input_precision != InferenceEngine::Precision::BF16 &&
input_precision != InferenceEngine::Precision::BOOL &&
input_precision != InferenceEngine::Precision::I64 &&
input_precision != InferenceEngine::Precision::U64) {
Expand Down
2 changes: 1 addition & 1 deletion inference-engine/src/mkldnn_plugin/nodes/argmax.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ class ArgMaxImpl: public ExtLayerBase {
conf.axis_index_ = conf.has_axis_ ?
std::stoi(layer->params.at("axis")) :0;

addConfig(layer, {DataConfigurator(ConfLayout::PLN)}, {DataConfigurator(ConfLayout::PLN)});
addConfig(layer, {DataConfigurator(ConfLayout::PLN, Precision::FP32)}, {DataConfigurator(ConfLayout::PLN, Precision::FP32)});
} catch (InferenceEngine::details::InferenceEngineException &ex) {
errorMsg = ex.what();
}
Expand Down
11 changes: 2 additions & 9 deletions inference-engine/src/mkldnn_plugin/nodes/base.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,8 @@ class ExtLayerBase: public ILayerExecImpl {
explicit DataConfigurator(ConfLayout l):
layout(l) {}

DataConfigurator(ConfLayout l, bool constant, int inplace = -1):
layout(l), constant(constant), inplace(inplace) {}
DataConfigurator(ConfLayout l, bool constant, int inplace = -1, Precision::ePrecision prc = Precision::UNSPECIFIED):
layout(l), constant(constant), inplace(inplace), prc(prc) {}

DataConfigurator(ConfLayout l, Precision::ePrecision prc):
layout(l), prc(prc) {}
Expand Down Expand Up @@ -128,14 +128,7 @@ class ExtLayerBase: public ILayerExecImpl {
conf.layout = ConfLayout::PLN;
}

// All extension layers support only FP32 precision!
// fixing of BF16 precisions where they are - layers naturally support only FP32
// if we see BF16, that means another floating point format which will be converted by reorder
// added by current mkl-dnn cpu plugin when it figure out diff in data types on input and output of edges
InferenceEngine::Precision precision = (conf.prc == Precision::UNSPECIFIED) ? data_desc.getPrecision() : Precision(conf.prc);
if (precision == Precision::BF16) {
precision = Precision::FP32;
}
if (conf.layout == ConfLayout::ANY) {
dataConfig.desc = TensorDesc(precision, data_dims, InferenceEngine::Layout::ANY);
} else {
Expand Down
2 changes: 1 addition & 1 deletion inference-engine/src/mkldnn_plugin/nodes/broadcast.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ class BroadcastImpl: public ExtLayerBase {

LayerConfig config;
DataConfig dataConfig, shapeConfig;
Precision dataPrecision = layer->outData[0]->getTensorDesc().getPrecision();
Precision dataPrecision = layer->insData[BROADCAST_INPUT].lock()->getTensorDesc().getPrecision();
const SizeVector& data_dims = layer->insData[BROADCAST_INPUT].lock()->getTensorDesc().getDims();
dataConfig.desc = TensorDesc(dataPrecision, data_dims,
layer->insData[BROADCAST_INPUT].lock()->getTensorDesc().getLayout());
Expand Down
Loading

0 comments on commit 2667bff

Please sign in to comment.