diff --git a/docs/MO_DG/img/compressed_int8_Convolution_weights.png b/docs/MO_DG/img/compressed_int8_Convolution_weights.png index ea3c831b1cc2cb..f4333b5e1a7999 100644 --- a/docs/MO_DG/img/compressed_int8_Convolution_weights.png +++ b/docs/MO_DG/img/compressed_int8_Convolution_weights.png @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:6c9ddc759bc419268f4c23089b91a9e3373114a1d36b01d6fe62a5e87b5c0ad4 -size 59827 +oid sha256:4b14b03ebb6a00b5f52a8404282f83d4ad214c8d04aea74738027a775c4ef545 +size 100581 diff --git a/docs/MO_DG/img/expanded_int8_Convolution_weights.png b/docs/MO_DG/img/expanded_int8_Convolution_weights.png index 918e2376a482fe..f250f509191eec 100644 --- a/docs/MO_DG/img/expanded_int8_Convolution_weights.png +++ b/docs/MO_DG/img/expanded_int8_Convolution_weights.png @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:59890c0c4a6d1c721dfaca22f0c1d0b305401f75dcd30418f858382830be2d31 -size 49598 +oid sha256:cbfadd457b4d943ffb46906a7daf03516e971fe49d2806cd32c84c5015178f03 +size 92819 diff --git a/docs/MO_DG/prepare_model/convert_model/IR_suitable_for_INT8_inference.md b/docs/MO_DG/prepare_model/convert_model/IR_suitable_for_INT8_inference.md index eda5d768c47fed..fa4bdb50554913 100644 --- a/docs/MO_DG/prepare_model/convert_model/IR_suitable_for_INT8_inference.md +++ b/docs/MO_DG/prepare_model/convert_model/IR_suitable_for_INT8_inference.md @@ -2,36 +2,36 @@ ## Introduction -Inference Engine CPU plugin can infer models in the 8-bit integer (INT8) precision. -For details, refer to [INT8 inference on the CPU](../../../IE_DG/Int8Inference.md). +Inference Engine CPU and GPU plugin can infer models in the low precision. +For details, refer to [Low Precision Inference on the CPU](../../../IE_DG/Int8Inference.md). -Intermediate Representation (IR) should be specifically formed to be suitable for INT8 inference. -Such an IR is called an INT8 IR and you can generate it in two ways: -- [Quantize model with the Post-Training Optimization tool](@ref pot_README) -- Use the Model Optimizer for TensorFlow\* pre-TFLite models (`.pb` model file with `FakeQuantize*` operations) +Intermediate Representation (IR) should be specifically formed to be suitable for low precision inference. +Such an IR is called a Low Precision IR and you can generate it in two ways: +- [Quantize regular IR with the Post-Training Optimization tool](@ref pot_README) +- Use the Model Optimizer for a model pretrained for Low Precision inference: TensorFlow\* pre-TFLite models (`.pb` model file with `FakeQuantize*` operations) and ONNX\* quantized models. +Both Tensorflow and ONNX quantized models could be prepared by [Neural Network Compression Framework](https://github.com/openvinotoolkit/nncf/blob/develop/README.md) -For an operation to be executed in INT8, it must have `FakeQuantize` operations as inputs with the `levels` attribute set to `255` or `256`. +For an operation to be executed in INT8, it must have `FakeQuantize` operations as inputs. See the [specification of `FakeQuantize` operation](../../../ops/quantization/FakeQuantize_1.md) for details. -To see the list of supported INT8 layers, refer to [INT8 inference on the CPU](../../../IE_DG/Int8Inference.md). To execute the `Convolution` operation in INT8 on CPU, both data and weight inputs should have `FakeQuantize` as an input operation: ![](../../img/expanded_int8_Convolution_weights.png) -INT8 IR is also suitable for FP32 and FP16 inference if a chosen plugin supports all operations of the IR, because the only difference between an INT8 IR and FP16 or FP32 IR is the existence of `FakeQuantize` in the INT8 IR. -Plugins with INT8 inference support recognize these sub-graphs and quantize them during the inference time. -Plugins without INT8 support execute all operations, including `FakeQuantize`, as is in the FP32 or FP16 precision. +Low pecision IR is also suitable for FP32 and FP16 inference if a chosen plugin supports all operations of the IR, because the only difference between a Low Precision IR and FP16 or FP32 IR is the existence of `FakeQuantize` in the Low Precision IR. +Plugins with Low Precision Inference support recognize these sub-graphs and quantize them during the inference time. +Plugins without Low Precision support execute all operations, including `FakeQuantize`, as is in the FP32 or FP16 precision. Accordingly, the presence of FakeQuantize operations in the IR is a recommendation for a plugin on how to quantize particular operations in the model. -If capable, a plugin accepts the recommendation and performs INT8 inference, otherwise the plugin ignores the recommendation and executes a model in the floating-point precision. +If capable, a plugin accepts the recommendation and performs Low Precision Inference, otherwise, the plugin ignores the recommendation and executes a model in the floating-point precision. -## Compressed INT8 Weights +## Compressed Low Precision Weights Weighted operations, like `Convolution`, `MatMul`, and others, store weights as floating-point `Constant` in the graph followed by the `FakeQuantize` operation. `Constant` followed by the `FakeQuantize` operation could be optimized memory-wise due to the `FakeQuantize` operation semantics. -The resulting weights sub-graph stores weights in INT8 `Constant`, which gets unpacked back to floating point with the `Convert` operation. -Weights compression leaves `FakeQuantize` output arithmetically the same and weights storing takes four times less memory. +The resulting weights sub-graph stores weights in Low Precision `Constant`, which gets unpacked back to floating point with the `Convert` operation. +Weights compression replaces `FakeQuantize` with optional `Subtract` and `Multiply` operation leaving output arithmetically the same and weights storing takes four times less memory. See the visualization of `Convolution` with the compressed weights: ![](../../img/compressed_int8_Convolution_weights.png) -Both Model Optimizer and Post-Training Optimization tool generate a compressed IR by default. To generate an expanded INT8 IR, use `--disable_weights_compression`. \ No newline at end of file +Both Model Optimizer and Post-Training Optimization tool generate a compressed IR by default.