Skip to content

Commit

Permalink
Low Precision IR documentation (openvinotoolkit#5791)
Browse files Browse the repository at this point in the history
* Low Precision IR documentation

* Apply suggestions from code review

Co-authored-by: Anastasiya Ageeva <[email protected]>

Co-authored-by: Anastasiya Ageeva <[email protected]>
  • Loading branch information
2 people authored and rnugmanx committed Aug 26, 2021
1 parent fb96aa0 commit 25a5e64
Show file tree
Hide file tree
Showing 3 changed files with 20 additions and 20 deletions.
4 changes: 2 additions & 2 deletions docs/MO_DG/img/compressed_int8_Convolution_weights.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/MO_DG/img/expanded_int8_Convolution_weights.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -2,36 +2,36 @@

## Introduction

Inference Engine CPU plugin can infer models in the 8-bit integer (INT8) precision.
For details, refer to [INT8 inference on the CPU](../../../IE_DG/Int8Inference.md).
Inference Engine CPU and GPU plugin can infer models in the low precision.
For details, refer to [Low Precision Inference on the CPU](../../../IE_DG/Int8Inference.md).

Intermediate Representation (IR) should be specifically formed to be suitable for INT8 inference.
Such an IR is called an INT8 IR and you can generate it in two ways:
- [Quantize model with the Post-Training Optimization tool](@ref pot_README)
- Use the Model Optimizer for TensorFlow\* pre-TFLite models (`.pb` model file with `FakeQuantize*` operations)
Intermediate Representation (IR) should be specifically formed to be suitable for low precision inference.
Such an IR is called a Low Precision IR and you can generate it in two ways:
- [Quantize regular IR with the Post-Training Optimization tool](@ref pot_README)
- Use the Model Optimizer for a model pretrained for Low Precision inference: TensorFlow\* pre-TFLite models (`.pb` model file with `FakeQuantize*` operations) and ONNX\* quantized models.
Both Tensorflow and ONNX quantized models could be prepared by [Neural Network Compression Framework](https://github.com/openvinotoolkit/nncf/blob/develop/README.md)

For an operation to be executed in INT8, it must have `FakeQuantize` operations as inputs with the `levels` attribute set to `255` or `256`.
For an operation to be executed in INT8, it must have `FakeQuantize` operations as inputs.
See the [specification of `FakeQuantize` operation](../../../ops/quantization/FakeQuantize_1.md) for details.
To see the list of supported INT8 layers, refer to [INT8 inference on the CPU](../../../IE_DG/Int8Inference.md).

To execute the `Convolution` operation in INT8 on CPU, both data and weight inputs should have `FakeQuantize` as an input operation:
![](../../img/expanded_int8_Convolution_weights.png)

INT8 IR is also suitable for FP32 and FP16 inference if a chosen plugin supports all operations of the IR, because the only difference between an INT8 IR and FP16 or FP32 IR is the existence of `FakeQuantize` in the INT8 IR.
Plugins with INT8 inference support recognize these sub-graphs and quantize them during the inference time.
Plugins without INT8 support execute all operations, including `FakeQuantize`, as is in the FP32 or FP16 precision.
Low pecision IR is also suitable for FP32 and FP16 inference if a chosen plugin supports all operations of the IR, because the only difference between a Low Precision IR and FP16 or FP32 IR is the existence of `FakeQuantize` in the Low Precision IR.
Plugins with Low Precision Inference support recognize these sub-graphs and quantize them during the inference time.
Plugins without Low Precision support execute all operations, including `FakeQuantize`, as is in the FP32 or FP16 precision.

Accordingly, the presence of FakeQuantize operations in the IR is a recommendation for a plugin on how to quantize particular operations in the model.
If capable, a plugin accepts the recommendation and performs INT8 inference, otherwise the plugin ignores the recommendation and executes a model in the floating-point precision.
If capable, a plugin accepts the recommendation and performs Low Precision Inference, otherwise, the plugin ignores the recommendation and executes a model in the floating-point precision.

## Compressed INT8 Weights
## Compressed Low Precision Weights

Weighted operations, like `Convolution`, `MatMul`, and others, store weights as floating-point `Constant` in the graph followed by the `FakeQuantize` operation.
`Constant` followed by the `FakeQuantize` operation could be optimized memory-wise due to the `FakeQuantize` operation semantics.
The resulting weights sub-graph stores weights in INT8 `Constant`, which gets unpacked back to floating point with the `Convert` operation.
Weights compression leaves `FakeQuantize` output arithmetically the same and weights storing takes four times less memory.
The resulting weights sub-graph stores weights in Low Precision `Constant`, which gets unpacked back to floating point with the `Convert` operation.
Weights compression replaces `FakeQuantize` with optional `Subtract` and `Multiply` operation leaving output arithmetically the same and weights storing takes four times less memory.

See the visualization of `Convolution` with the compressed weights:
![](../../img/compressed_int8_Convolution_weights.png)

Both Model Optimizer and Post-Training Optimization tool generate a compressed IR by default. To generate an expanded INT8 IR, use `--disable_weights_compression`.
Both Model Optimizer and Post-Training Optimization tool generate a compressed IR by default.

0 comments on commit 25a5e64

Please sign in to comment.