Low Precision IR documentation (openvinotoolkit#5791)

* Low Precision IR documentation * Apply suggestions from code review Co-authored-by: Anastasiya Ageeva <[email protected]> Co-authored-by: Anastasiya Ageeva <[email protected]>
rnugmanx · Aug 26, 2021 · 25a5e64 · 25a5e64
1 parent fb96aa0
commit 25a5e64
Show file tree

Hide file tree

Showing 3 changed files with 20 additions and 20 deletions.
diff --git a/docs/MO_DG/img/compressed_int8_Convolution_weights.png b/docs/MO_DG/img/compressed_int8_Convolution_weights.png
diff --git a/docs/MO_DG/img/expanded_int8_Convolution_weights.png b/docs/MO_DG/img/expanded_int8_Convolution_weights.png
diff --git a/docs/MO_DG/prepare_model/convert_model/IR_suitable_for_INT8_inference.md b/docs/MO_DG/prepare_model/convert_model/IR_suitable_for_INT8_inference.md
@@ -2,36 +2,36 @@
 
 ## Introduction
 
-Inference Engine CPU plugin can infer models in the 8-bit integer (INT8) precision. 
-For details, refer to [INT8 inference on the CPU](../../../IE_DG/Int8Inference.md).
+Inference Engine CPU and GPU plugin can infer models in the low precision. 
+For details, refer to [Low Precision Inference on the CPU](../../../IE_DG/Int8Inference.md).
 
-Intermediate Representation (IR) should be specifically formed to be suitable for INT8 inference. 
-Such an IR is called an INT8 IR and you can generate it in two ways:
-- [Quantize model with the Post-Training Optimization tool](@ref pot_README)
-- Use the Model Optimizer for TensorFlow\* pre-TFLite models (`.pb` model file with `FakeQuantize*` operations)
+Intermediate Representation (IR) should be specifically formed to be suitable for low precision inference. 
+Such an IR is called a Low Precision IR and you can generate it in two ways:
+- [Quantize regular IR with the Post-Training Optimization tool](@ref pot_README)
+- Use the Model Optimizer for a model pretrained for Low Precision inference: TensorFlow\* pre-TFLite models (`.pb` model file with `FakeQuantize*` operations) and ONNX\* quantized models.
+Both Tensorflow and ONNX quantized models could be prepared by [Neural Network Compression Framework](https://github.com/openvinotoolkit/nncf/blob/develop/README.md) 
 
-For an operation to be executed in INT8, it must have `FakeQuantize` operations as inputs with the `levels` attribute set to `255` or `256`. 
+For an operation to be executed in INT8, it must have `FakeQuantize` operations as inputs.
 See the [specification of `FakeQuantize` operation](../../../ops/quantization/FakeQuantize_1.md) for details. 
-To see the list of supported INT8 layers, refer to [INT8 inference on the CPU](../../../IE_DG/Int8Inference.md).
 
 To execute the `Convolution` operation in INT8 on CPU, both data and weight inputs should have `FakeQuantize` as an input operation:
 ![](../../img/expanded_int8_Convolution_weights.png)
 
-INT8 IR is also suitable for FP32 and FP16 inference if a chosen plugin supports all operations of the IR, because the only difference between an INT8 IR and FP16 or FP32 IR is the existence of `FakeQuantize` in the INT8 IR. 
-Plugins with INT8 inference support recognize these sub-graphs and quantize them during the inference time. 
-Plugins without INT8 support execute all operations, including `FakeQuantize`, as is in the FP32 or FP16 precision.   
+Low pecision IR is also suitable for FP32 and FP16 inference if a chosen plugin supports all operations of the IR, because the only difference between a Low Precision IR and FP16 or FP32 IR is the existence of `FakeQuantize` in the Low Precision IR. 
+Plugins with Low Precision Inference support recognize these sub-graphs and quantize them during the inference time. 
+Plugins without Low Precision support execute all operations, including `FakeQuantize`, as is in the FP32 or FP16 precision.   
 
 Accordingly, the presence of FakeQuantize operations in the IR is a recommendation for a plugin on how to quantize particular operations in the model. 
-If capable, a plugin accepts the recommendation and performs INT8 inference, otherwise the plugin ignores the recommendation and executes a model in the floating-point precision. 
+If capable, a plugin accepts the recommendation and performs Low Precision Inference, otherwise, the plugin ignores the recommendation and executes a model in the floating-point precision. 
 
-## Compressed INT8 Weights
+## Compressed Low Precision Weights
 
 Weighted operations, like `Convolution`, `MatMul`, and others, store weights as floating-point `Constant` in the graph followed by the `FakeQuantize` operation. 
 `Constant` followed by the `FakeQuantize` operation could be optimized memory-wise due to the `FakeQuantize` operation semantics. 
-The resulting weights sub-graph stores weights in INT8 `Constant`, which gets unpacked back to floating point with the `Convert` operation. 
-Weights compression leaves `FakeQuantize` output arithmetically the same and weights storing takes four times less memory.
+The resulting weights sub-graph stores weights in Low Precision `Constant`, which gets unpacked back to floating point with the `Convert` operation. 
+Weights compression replaces `FakeQuantize` with optional `Subtract` and `Multiply` operation leaving output arithmetically the same and weights storing takes four times less memory.
 
 See the visualization of `Convolution` with the compressed weights:
 ![](../../img/compressed_int8_Convolution_weights.png)
 
-Both Model Optimizer and Post-Training Optimization tool generate a compressed IR by default. To generate an expanded INT8 IR, use `--disable_weights_compression`.
+Both Model Optimizer and Post-Training Optimization tool generate a compressed IR by default.