diff --git a/docs/_static/images/DEVELOPMENT_FLOW_V3_crunch.svg b/docs/_static/images/DEVELOPMENT_FLOW_V3_crunch.svg index 8d7cb0cde475b2..06260ca93d50cb 100644 --- a/docs/_static/images/DEVELOPMENT_FLOW_V3_crunch.svg +++ b/docs/_static/images/DEVELOPMENT_FLOW_V3_crunch.svg @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:40dfcea78a329e99d4bf2ead79678df96f1d392dcc4f278cf0b30ad3e8c7c795 -size 207081 +oid sha256:33de308a6476f054ae4d0b1ca356659003c8ba36cf9583f08963663259c0c1d4 +size 263357 diff --git a/docs/_static/images/WHAT_TO_USE.svg b/docs/_static/images/WHAT_TO_USE.svg index 17656ed944fbb1..5a87c4558221db 100644 --- a/docs/_static/images/WHAT_TO_USE.svg +++ b/docs/_static/images/WHAT_TO_USE.svg @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:285a60ebb9d6b7b4c5a8394e7c6eeed66ecf009f36b36a74c51bb28e6ee699ca -size 271583 +oid sha256:b71a90fd9ec78356eef5ef0c9d80831c1439fbfc05d42fc0ad648f4b5aa151aa +size 286982 diff --git a/docs/_static/images/workflow_simple.svg b/docs/_static/images/workflow_simple.svg index ddf10534e20799..93714e17321539 100644 --- a/docs/_static/images/workflow_simple.svg +++ b/docs/_static/images/workflow_simple.svg @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:12af46d0d211361d9f195720329ee76f9d7c50d4ec8378cd33ef47595345795b -size 32625 +oid sha256:84fc7114eef9ad310d72abc5d8f59b076d30031e0a42f18d518acc02e19bcc8d +size 59755 diff --git a/docs/home.rst b/docs/home.rst index 6ccdff8252b51b..3cbff831e0d91a 100644 --- a/docs/home.rst +++ b/docs/home.rst @@ -69,7 +69,7 @@ You can integrate and offload to accelerators additional operations for pre- and Model Quantization and Compression ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Boost your model’s speed even further with quantization and other state-of-the-art compression techniques available in OpenVINO’s Post-Training Optimization Tool and Neural Network Compression Framework. These techniques also reduce your model size and memory requirements, allowing it to be deployed on resource-constrained edge hardware. +Boost your model’s speed even further with quantization and other state-of-the-art compression techniques available in OpenVINO’s Neural Network Compression Framework. These techniques also reduce your model size and memory requirements, allowing it to be deployed on resource-constrained edge hardware. .. panels:: :card: homepage-panels diff --git a/docs/optimization_guide/model_optimization_guide.md b/docs/optimization_guide/model_optimization_guide.md index 5936cf9c9b7563..5f7d501e09b845 100644 --- a/docs/optimization_guide/model_optimization_guide.md +++ b/docs/optimization_guide/model_optimization_guide.md @@ -8,40 +8,30 @@ ptq_introduction tmo_introduction - (Experimental) Protecting Model -Model optimization is an optional offline step of improving final model performance by applying special optimization methods, such as quantization, pruning, preprocessing optimization, etc. OpenVINO provides several tools to optimize models at different steps of model development: +Model optimization is an optional offline step of improving the final model performance and reducing the model size by applying special optimization methods, such as 8-bit quantization, pruning, etc. OpenVINO offers two optimization paths implemented in `Neural Network Compression Framework (NNCF) `__: -- :doc:`Model Optimizer ` implements most of the optimization parameters to a model by default. Yet, you are free to configure mean/scale values, batch size, RGB vs BGR input channels, and other parameters to speed up preprocess of a model (:doc:`Embedding Preprocessing Computation `). +- :doc:`Post-training Quantization ` is designed to optimize the inference of deep learning models by applying the post-training 8-bit integer quantization that does not require model retraining or fine-tuning. -- :doc:`Post-training Quantization ` is designed to optimize inference of deep learning models by applying post-training methods that do not require model retraining or fine-tuning, for example, post-training 8-bit integer quantization. +- :doc:`Training-time Optimization `, a suite of advanced methods for training-time model optimization within the DL framework, such as PyTorch and TensorFlow 2.x. It supports methods like Quantization-aware Training, Structured and Unstructured Pruning, etc. -- :doc:`Training-time Optimization `, a suite of advanced methods for training-time model optimization within the DL framework, such as PyTorch and TensorFlow 2.x. It supports methods, like Quantization-aware Training and Filter Pruning. NNCF-optimized models can be inferred with OpenVINO using all the available workflows. +.. note:: OpenVINO also supports optimized models (for example, quantized) from source frameworks such as PyTorch, TensorFlow, and ONNX (in Q/DQ format). No special steps are required in this case and optimized models can be converted to the OpenVINO Intermediate Representation format (IR) right away. +Post-training Quantization is the fastest way to optimize a model and should be applied first, but it is limited in terms of achievable accuracy-performance trade-off. In case of poor accuracy or performance after Post-training Quantization, Training-time Optimization can be used as an option. -Detailed workflow: -################## - -To understand which development optimization tool you need, refer to the diagram: +Once the model is optimized using the aforementioned methods, it can be used for inference using the regular OpenVINO inference workflow. No changes to the inference code are required. .. image:: _static/images/DEVELOPMENT_FLOW_V3_crunch.svg -Post-training methods are limited in terms of achievable accuracy-performance trade-off for optimizing models. In this case, training-time optimization with NNCF is an option. - -Once the model is optimized using the aforementioned tools it can be used for inference using the regular OpenVINO inference workflow. No changes to the inference code are required. - .. image:: _static/images/WHAT_TO_USE.svg -Post-training methods are limited in terms of achievable accuracy, which may degrade for certain scenarios. In such cases, training-time optimization with NNCF may give better results. - -Once the model has been optimized using the aforementioned tools, it can be used for inference using the regular OpenVINO inference workflow. No changes to the code are required. - -If you are not familiar with model optimization methods, refer to :doc:`post-training methods `. - Additional Resources #################### +- :doc:`Post-training Quantization ` +- :doc:`Training-time Optimization ` - :doc:`Deployment optimization ` +- `HuggingFace Optimum Intel `__ @endsphinxdirective diff --git a/docs/optimization_guide/nncf/filter_pruning.md b/docs/optimization_guide/nncf/filter_pruning.md index 7633d2e2400751..68ee00c406d1a9 100644 --- a/docs/optimization_guide/nncf/filter_pruning.md +++ b/docs/optimization_guide/nncf/filter_pruning.md @@ -5,15 +5,15 @@ Introduction #################### -Filter pruning is an advanced optimization method which allows reducing computational complexity of the model by removing -redundant or unimportant filters from convolutional operations of the model. This removal is done in two steps: +Filter pruning is an advanced optimization method that allows reducing the computational complexity of the model by removing +redundant or unimportant filters from the convolutional operations of the model. This removal is done in two steps: 1. Unimportant filters are zeroed out by the NNCF optimization with fine-tuning. 2. Zero filters are removed from the model during the export to OpenVINO Intermediate Representation (IR). -Filter Pruning method from the NNCF can be used stand-alone but we usually recommend to stack it with 8-bit quantization for +Filter Pruning method from the NNCF can be used stand-alone but we usually recommend stacking it with 8-bit quantization for two reasons. First, 8-bit quantization is the best method in terms of achieving the highest accuracy-performance trade-offs so stacking it with filter pruning can give even better performance results. Second, applying quantization along with filter pruning does not hurt accuracy a lot since filter pruning removes noisy filters from the model which narrows down values @@ -37,17 +37,21 @@ Here, we show the basic steps to modify the training script for the model and us In this step, NNCF-related imports are added in the beginning of the training script: -.. tab:: PyTorch +.. tab-set:: - .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_torch.py - :language: python - :fragment: [imports] + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_torch.py + :language: python + :fragment: [imports] + + .. tab-item:: TensorFlow 2 + :sync: tensorflow -.. tab:: TensorFlow 2 - - .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_tf.py - :language: python - :fragment: [imports] + .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_tf.py + :language: python + :fragment: [imports] 2. Create NNCF configuration ++++++++++++++++++++++++++++ @@ -55,26 +59,30 @@ In this step, NNCF-related imports are added in the beginning of the training sc Here, you should define NNCF configuration which consists of model-related parameters (`"input_info"` section) and parameters of optimization methods (`"compression"` section). -.. tab:: PyTorch - - .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_torch.py - :language: python - :fragment: [nncf_congig] - -.. tab:: TensorFlow 2 - - .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_tf.py - :language: python - :fragment: [nncf_congig] - -Here is a brief description of the required parameters of the Filter Pruning method. For full description refer to the +.. tab-set:: + + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_torch.py + :language: python + :fragment: [nncf_congig] + + .. tab-item:: TensorFlow 2 + :sync: tensorflow + + .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_tf.py + :language: python + :fragment: [nncf_congig] + +Here is a brief description of the required parameters of the Filter Pruning method. For a full description refer to the `GitHub `__ page. * ``pruning_init`` - initial pruning rate target. For example, value ``0.1`` means that at the begging of training, convolutions that can be pruned will have 10% of their filters set to zero. * ``pruning_target`` - pruning rate target at the end of the schedule. For example, the value ``0.5`` means that at the epoch with the number of ``num_init_steps + pruning_steps``, convolutions that can be pruned will have 50% of their filters set to zero. -* ``pruning_steps` - the number of epochs during which the pruning rate target is increased from ``pruning_init` to ``pruning_target`` value. We recommend to keep the highest learning rate during this period. +* ``pruning_steps` - the number of epochs during which the pruning rate target is increased from ``pruning_init` to ``pruning_target`` value. We recommend keeping the highest learning rate during this period. 3. Apply optimization methods @@ -86,39 +94,44 @@ that can be used the same way as the original model. It is worth noting that opt so that the model undergoes a set of corresponding transformations and can contain additional operations required for the optimization. +.. tab-set:: -.. tab:: PyTorch - - .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_torch.py - :language: python - :fragment: [wrap_model] - -.. tab:: TensorFlow 2 - - .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_tf.py - :language: python - :fragment: [wrap_model] + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_torch.py + :language: python + :fragment: [wrap_model] + + .. tab-item:: TensorFlow 2 + :sync: tensorflow + .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_tf.py + :language: python + :fragment: [wrap_model] 4. Fine-tune the model ++++++++++++++++++++++ This step assumes that you will apply fine-tuning to the model the same way as it is done for the baseline model. In the case of Filter Pruning method we recommend using the training schedule and learning rate similar to what was used for the training -of original model. - +of the original model. -.. tab:: PyTorch +.. tab-set:: - .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_torch.py - :language: python - :fragment: [tune_model] + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_torch.py + :language: python + :fragment: [tune_model] + + .. tab-item:: TensorFlow 2 + :sync: tensorflow -.. tab:: TensorFlow 2 - - .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_tf.py - :language: python - :fragment: [tune_model] + .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_tf.py + :language: python + :fragment: [tune_model] 5. Multi-GPU distributed training @@ -127,38 +140,43 @@ of original model. In the case of distributed multi-GPU training (not DataParallel), you should call ``compression_ctrl.distributed()`` before the fine-tuning that will inform optimization methods to do some adjustments to function in the distributed mode. - -.. tab:: PyTorch - - .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_torch.py - :language: python - :fragment: [distributed] - -.. tab:: TensorFlow 2 - - .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_tf.py - :language: python - :fragment: [distributed] - - +.. tab-set:: + + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_torch.py + :language: python + :fragment: [distributed] + + .. tab-item:: TensorFlow 2 + :sync: tensorflow + + .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_tf.py + :language: python + :fragment: [distributed] + 6. Export quantized model +++++++++++++++++++++++++ When fine-tuning finishes, the quantized model can be exported to the corresponding format for further inference: ONNX in the case of PyTorch and frozen graph - for TensorFlow 2. +.. tab-set:: -.. tab:: PyTorch - - .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_torch.py - :language: python - :fragment: [export] - -.. tab:: TensorFlow 2 + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_torch.py + :language: python + :fragment: [export] + + .. tab-item:: TensorFlow 2 + :sync: tensorflow - .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_tf.py - :language: python - :fragment: [export] + .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_tf.py + :language: python + :fragment: [export] These were the basic steps to applying the QAT method from the NNCF. However, it is required in some cases to save/load model @@ -170,37 +188,43 @@ checkpoints during the training. Since NNCF wraps the original model with its ow To save model checkpoint use the following API: +.. tab-set:: -.. tab:: PyTorch - - .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_torch.py - :language: python - :fragment: [save_checkpoint] - -.. tab:: TensorFlow 2 - - .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_tf.py - :language: python - :fragment: [save_checkpoint] + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_torch.py + :language: python + :fragment: [save_checkpoint] + + .. tab-item:: TensorFlow 2 + :sync: tensorflow + .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_tf.py + :language: python + :fragment: [save_checkpoint] + 8. (Optional) Restore from checkpoint +++++++++++++++++++++++++++++++++++++ To restore the model from checkpoint you should use the following API: -.. tab:: PyTorch - - .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_torch.py - :language: python - :fragment: [load_checkpoint] - -.. tab:: TensorFlow 2 +.. tab-set:: - .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_tf.py - :language: python - :fragment: [load_checkpoint] + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_torch.py + :language: python + :fragment: [load_checkpoint] + + .. tab-item:: TensorFlow 2 + :sync: tensorflow + .. doxygensnippet:: docs/optimization_guide/nncf/code/pruning_tf.py + :language: python + :fragment: [load_checkpoint] For more details on saving/loading checkpoints in the NNCF, see the following `documentation `__. @@ -208,19 +232,19 @@ For more details on saving/loading checkpoints in the NNCF, see the following Deploying pruned model ###################### -The pruned model requres an extra step that should be done to get performance improvement. This step involves removal of the -zero filters from the model. This is done at the model conversion step using :doc:`Model Optimizer ` tool when model is converted from the framework representation (ONNX, TensorFlow, etc.) to OpenVINO Intermediate Representation. +The pruned model requires an extra step that should be done to get a performance improvement. This step involves the removal of the +zero filters from the model. This is done at the model conversion step using :doc:`Model Optimizer ` tool when the model is converted from the framework representation (ONNX, TensorFlow, etc.) to OpenVINO Intermediate Representation. -* To remove zero filters from the pruned model add the following parameter to the model convertion command: ``--transform=Pruning`` +* To remove zero filters from the pruned model add the following parameter to the model conversion command: ``--transform=Pruning`` -After that the model can be deployed with OpenVINO in the same way as the baseline model. +After that, the model can be deployed with OpenVINO in the same way as the baseline model. For more details about model deployment with OpenVINO, see the corresponding :doc:`documentation `. Examples #################### -* `PyTorch Image Classiication example `__ +* `PyTorch Image Classification example `__ * `TensorFlow Image Classification example `__ diff --git a/docs/optimization_guide/nncf/introduction.md b/docs/optimization_guide/nncf/introduction.md index a4fcbbead198b4..98ee061a3bcb01 100644 --- a/docs/optimization_guide/nncf/introduction.md +++ b/docs/optimization_guide/nncf/introduction.md @@ -13,7 +13,7 @@ Introduction #################### -Training-time model compression improves model performance by applying optimizations (such as quantization) during the training. The training process minimizes the loss associated with the lower-precision optimizations, so it is able to maintain the model’s accuracy while reducing its latency and memory footprint. Generally, training-time model optimization results in better model performance and accuracy than :doc:`post-training optimization `, but it can require more effort to set up. +Training-time model compression improves model performance by applying optimizations (such as quantization) during the training. The training process minimizes the loss associated with the lower-precision optimizations, so it is able to maintain the model’s accuracy while reducing its latency and memory footprint. Generally, training-time model optimization results in better model performance and accuracy than :doc:`post-training optimization `, but it can require more effort to set up. OpenVINO provides the Neural Network Compression Framework (NNCF) tool for implementing compression algorithms on models to improve their performance. NNCF is a Python library that integrates into PyTorch and TensorFlow training pipelines to add training-time compression methods to the pipeline. To apply training-time compression methods with NNCF, you need: @@ -51,7 +51,7 @@ To install the latest released version via pip manager run the following command To install with specific frameworks, use the `pip install nncf[extras]` command, where extras is a list of possible extras, for example, `torch`, `tf`, `onnx`. -To install the latest NNCF version from source follow the instruction on `GitHub `__. +To install the latest NNCF version from source, follow the instruction on `GitHub `__. .. note:: @@ -86,7 +86,7 @@ Filter pruning algorithms compress models by zeroing out the output filters of c Experimental methods -------------------- -NNCF also provides state-of-the-art compression techniques that are still in experimental stages of development and are only recommended for expert developers. These include: +NNCF also provides state-of-the-art compression techniques that are still in the experimental stages of development and are only recommended for expert developers. These include: - Mixed-precision quantization - Sparsity @@ -99,14 +99,14 @@ Recommended Workflow Using compression-aware training requires a training pipeline, an annotated dataset, and compute resources (such as CPUs or GPUs). If you don't already have these set up and available, it can be easier to start post-training quantization to quickly see quantized results. Then you can use compression-aware training if the model isn't accurate enough. We recommend the following workflow for compressing models with NNCF: -1. :doc:`Perform post-training quantization ` on your model and then compare performance to the original model. +1. :doc:`Perform post-training quantization ` on your model and then compare performance to the original model. 2. If the accuracy is too degraded, use :doc:`Quantization-aware Training ` to increase accuracy while still achieving faster inference time. 3. If the quantized model is still too slow, use :doc:`Filter Pruning ` to further improve the model’s inference speed. Additional Resources #################### -- :doc:`Quantizing Models Post-training ` +- :doc:`Quantizing Models Post-training ` - `NNCF GitHub repository `__ - `NNCF FAQ `__ - `Quantization Aware Training with NNCF and PyTorch `__ diff --git a/docs/optimization_guide/nncf/ptq/basic_quantization_flow.md b/docs/optimization_guide/nncf/ptq/basic_quantization_flow.md index fb5ab52aa0211d..7f2d807421e082 100644 --- a/docs/optimization_guide/nncf/ptq/basic_quantization_flow.md +++ b/docs/optimization_guide/nncf/ptq/basic_quantization_flow.md @@ -5,10 +5,10 @@ Introduction #################### -The basic quantization flow is the simplest way to apply 8-bit quantization to the model. It is available for models in the following frameworks: PyTorch, TensorFlow 2.x, ONNX, and OpenVINO. The basic quantization flow is based on the following steps: +The basic quantization flow is the simplest way to apply 8-bit quantization to the model. It is available for models in the following frameworks: OpenVINO, PyTorch, TensorFlow 2.x, and ONNX. The basic quantization flow is based on the following steps: * Set up an environment and install dependencies. -* Prepare the **calibration dataset** that is used to estimate quantization parameters of the activations within the model. +* Prepare a representative **calibration dataset** that is used to estimate quantization parameters of the activations within the model, for example, of 300 samples. * Call the quantization API to apply 8-bit quantization to the model. Set up an Environment @@ -29,78 +29,117 @@ Install all the packages required to instantiate the model object, for example, Prepare a Calibration Dataset ############################# -At this step, create an instance of the ``nncf.Dataset`` class that represents the calibration dataset. The ``nncf.Dataset`` class can be a wrapper over the framework dataset object that is used for model training or validation. The class constructor receives the dataset object and the transformation function. For example, if you use PyTorch, you can pass an instance of the ``torch.utils.data.DataLoader`` object. +At this step, create an instance of the ``nncf.Dataset`` class that represents the calibration dataset. The ``nncf.Dataset`` class can be a wrapper over the framework dataset object that is used for model training or validation. The class constructor receives the dataset object and an optional transformation function. The transformation function is a function that takes a sample from the dataset and returns data that can be passed to the model for inference. For example, this function can take a tuple of a data tensor and labels tensor, and return the former while ignoring the latter. The transformation function is used to avoid modifying the dataset code to make it compatible with the quantization API. The function is applied to each sample from the dataset before passing it to the model for inference. The following code snippet shows how to create an instance of the ``nncf.Dataset`` class: -.. tab:: PyTorch +.. tab-set:: - .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_torch.py - :language: python - :fragment: [dataset] + .. tab-item:: OpenVINO + :sync: ov + + .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_openvino.py + :language: python + :fragment: [dataset] -.. tab:: ONNX + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_torch.py + :language: python + :fragment: [dataset] - .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_onnx.py - :language: python - :fragment: [dataset] + .. tab-item:: ONNX + :sync: onnx -.. tab:: OpenVINO + .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_onnx.py + :language: python + :fragment: [dataset] - .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_openvino.py - :language: python - :fragment: [dataset] + .. tab-item:: TensorFlow + :sync: tensorflow -.. tab:: TensorFlow + .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_tensorflow.py + :language: python + :fragment: [dataset] - .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_tensorflow.py - :language: python - :fragment: [dataset] +If there is no framework dataset object, you can create your own entity that implements the ``Iterable`` interface in Python, for example the list of images, and returns data samples feasible for inference. In this case, a transformation function is not required. -If there is no framework dataset object, you can create your own entity that implements the ``Iterable`` interface in Python and returns data samples feasible for inference. In this case, a transformation function is not required. - - -Run a Quantized Model +Quantize a Model ##################### Once the dataset is ready and the model object is instantiated, you can apply 8-bit quantization to it: -.. tab:: PyTorch - - .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_torch.py - :language: python - :fragment: [quantization] - -.. tab:: ONNX - - .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_onnx.py - :language: python - :fragment: [quantization] - -.. tab:: OpenVINO - - .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_openvino.py - :language: python - :fragment: [quantization] - -.. tab:: TensorFlow - - .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_tensorflow.py - :language: python - :fragment: [quantization] - - -.. note:: The ``model`` is an instance of the ``torch.nn.Module`` class for PyTorch, ``onnx.ModelProto`` for ONNX, and ``openvino.runtime.Model`` for OpenVINO. - -After that the model can be exported into th OpenVINO Intermediate Representation if needed and run faster with OpenVINO. - +.. tab-set:: + + .. tab-item:: OpenVINO + :sync: ov + + .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_openvino.py + :language: python + :fragment: [quantization] + + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_torch.py + :language: python + :fragment: [quantization] + + .. tab-item:: ONNX + :sync: onnx + + .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_onnx.py + :language: python + :fragment: [quantization] + + .. tab-item:: TensorFlow + :sync: tensorflow + + .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_tensorflow.py + :language: python + :fragment: [quantization] + + +After that the model can be converted into the OpenVINO Intermediate Representation (IR) if needed, compiled and run with OpenVINO: + +.. tab-set:: + + .. tab-item:: OpenVINO + :sync: ov + + .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_openvino.py + :language: python + :fragment: [inference] + + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_torch.py + :language: python + :fragment: [inference] + + .. tab-item:: ONNX + :sync: onnx + + .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_onnx.py + :language: python + :fragment: [inference] + + .. tab-item:: TensorFlow + :sync: tensorflow + + .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_tensorflow.py + :language: python + :fragment: [inference] + Tune quantization parameters ############################ -``nncf.quantize()`` function has several parameters that allow to tune quantization process to get more accurate model. Below is the list of parameters and their description: +``nncf.quantize()`` function has several optional parameters that allow tuning the quantization process to get a more accurate model. Below is the list of parameters and their description: -* ``model_type`` - used to specify quantization scheme required for specific type of the model. For example, **Transformer** models (BERT, distillBERT, etc.) require a special quantization scheme to preserve accuracy after quantization. +* ``model_type`` - used to specify quantization scheme required for specific type of the model. ``Transformer`` is the only supported special quantization scheme to preserve accuracy after quantization of Transformer models (BERT, DistilBERT, etc.). ``None`` is default, i.e. no specific scheme is defined. .. code-block:: sh @@ -115,7 +154,7 @@ Tune quantization parameters nncf.quantize(model, dataset, preset=nncf.Preset.MIXED) -* ``fast_bias_correction`` - enables more accurate bias (error) correction algorithm that can be used to improve accuracy of the model. This parameter is available only for OpenVINO representation. ``True`` is used by default. +* ``fast_bias_correction`` - when set to ``False``, enables a more accurate bias (error) correction algorithm that can be used to improve the accuracy of the model. This parameter is available only for OpenVINO and ONNX representations. ``True`` is used by default to minimize quantization time. .. code-block:: sh @@ -127,7 +166,7 @@ Tune quantization parameters nncf.quantize(model, dataset, subset_size=1000) -* ``ignored_scope`` - this parameter can be used to exclude some layers from quantization process. For example, if you want to exclude the last layer of the model from quantization. Below are some examples of how to use this parameter: +* ``ignored_scope`` - this parameter can be used to exclude some layers from the quantization process to preserve the model accuracy. For example, when you want to exclude the last layer of the model from quantization. Below are some examples of how to use this parameter: * Exclude by layer name: @@ -150,12 +189,24 @@ Tune quantization parameters regex = '.*layer_.*' nncf.quantize(model, dataset, ignored_scope=nncf.IgnoredScope(patterns=regex)) +* ``target_device`` - defines the target device, the specificity of which will be taken into account during optimization. The following values are supported: ``ANY`` (default), ``CPU``, ``CPU_SPR``, ``GPU``, and ``VPU``. + + .. code-block:: sh + + nncf.quantize(model, dataset, target_device=nncf.TargetDevice.CPU) + +* ``advanced_parameters`` - used to specify advanced quantization parameters for fine-tuning the quantization algorithm. Defined by `nncf.quantization.advanced_parameters `__ NNCF submodule. ``None`` is default. If the accuracy of the quantized model is not satisfactory, you can try to use the :doc:`Quantization with accuracy control ` flow. -See also -#################### +Examples of how to apply NNCF post-training quantization: +############################################################ -* `Example of basic quantization flow in PyTorch `__ +* `Post-Training Quantization of MobileNet v2 OpenVINO Model `__ +* `Post-Training Quantization of YOLOv8 OpenVINO Model `__ +* `Post-Training Quantization of MobileNet v2 PyTorch Model `__ +* `Post-Training Quantization of SSD PyTorch Model `__ +* `Post-Training Quantization of MobileNet v2 ONNX Model `__ +* `Post-Training Quantization of MobileNet v2 TensorFlow Model `__ @endsphinxdirective diff --git a/docs/optimization_guide/nncf/ptq/code/ptq_aa_openvino.py b/docs/optimization_guide/nncf/ptq/code/ptq_aa_openvino.py index f23a8583d606b1..d759695ea8a4ba 100644 --- a/docs/optimization_guide/nncf/ptq/code/ptq_aa_openvino.py +++ b/docs/optimization_guide/nncf/ptq/code/ptq_aa_openvino.py @@ -45,5 +45,19 @@ def validate(model: openvino.runtime.CompiledModel, calibration_dataset=calibration_dataset, validation_dataset=validation_dataset, validation_fn=validate, - max_drop=0.01) + max_drop=0.01, + drop_type=nncf.DropType.ABSOLUTE) #! [quantization] + +#! [inference] +import openvino.runtime as ov + +# compile the model to transform quantized operations to int8 +model_int8 = ov.compile_model(quantized_model) + +input_fp32 = ... # FP32 model input +res = model_int8(input_fp32) + +# save the model +ov.serialize(quantized_model, "quantized_model.xml") +#! [inference] diff --git a/docs/optimization_guide/nncf/ptq/code/ptq_onnx.py b/docs/optimization_guide/nncf/ptq/code/ptq_onnx.py index 7f44932c5672fc..fa9d1e35734d14 100644 --- a/docs/optimization_guide/nncf/ptq/code/ptq_onnx.py +++ b/docs/optimization_guide/nncf/ptq/code/ptq_onnx.py @@ -20,3 +20,20 @@ def transform_fn(data_item): quantized_model = nncf.quantize(model, calibration_dataset) #! [quantization] + +#! [inference] +import openvino.runtime as ov +from openvino.tools.mo import convert_model + +# convert ONNX model to OpenVINO model +ov_quantized_model = convert_model(quantized_model) + +# compile the model to transform quantized operations to int8 +model_int8 = ov.compile_model(ov_quantized_model) + +input_fp32 = ... # FP32 model input +res = model_int8(input_fp32) + +# save the model +ov.serialize(ov_quantized_model, "quantized_model.xml") +#! [inference] diff --git a/docs/optimization_guide/nncf/ptq/code/ptq_openvino.py b/docs/optimization_guide/nncf/ptq/code/ptq_openvino.py index c65309cfb245f2..eb2f89edf36098 100644 --- a/docs/optimization_guide/nncf/ptq/code/ptq_openvino.py +++ b/docs/optimization_guide/nncf/ptq/code/ptq_openvino.py @@ -19,3 +19,16 @@ def transform_fn(data_item): quantized_model = nncf.quantize(model, calibration_dataset) #! [quantization] + +#! [inference] +import openvino.runtime as ov + +# compile the model to transform quantized operations to int8 +model_int8 = ov.compile_model(quantized_model) + +input_fp32 = ... # FP32 model input +res = model_int8(input_fp32) + +# save the model +ov.serialize(quantized_model, "quantized_model.xml") +#! [inference] diff --git a/docs/optimization_guide/nncf/ptq/code/ptq_tensorflow.py b/docs/optimization_guide/nncf/ptq/code/ptq_tensorflow.py index 55433dda14277a..fcb29c8741e5da 100644 --- a/docs/optimization_guide/nncf/ptq/code/ptq_tensorflow.py +++ b/docs/optimization_guide/nncf/ptq/code/ptq_tensorflow.py @@ -19,3 +19,20 @@ def transform_fn(data_item): quantized_model = nncf.quantize(model, calibration_dataset) #! [quantization] + +#! [inference] +import openvino.runtime as ov +from openvino.tools.mo import convert_model + +# convert TensorFlow model to OpenVINO model +ov_quantized_model = convert_model(quantized_model) + +# compile the model to transform quantized operations to int8 +model_int8 = ov.compile_model(ov_quantized_model) + +input_fp32 = ... # FP32 model input +res = model_int8(input_fp32) + +# save the model +ov.serialize(ov_quantized_model, "quantized_model.xml") +#! [inference] diff --git a/docs/optimization_guide/nncf/ptq/code/ptq_torch.py b/docs/optimization_guide/nncf/ptq/code/ptq_torch.py index 3305fc4d4f02b2..a65f0998622042 100644 --- a/docs/optimization_guide/nncf/ptq/code/ptq_torch.py +++ b/docs/optimization_guide/nncf/ptq/code/ptq_torch.py @@ -19,3 +19,25 @@ def transform_fn(data_item): quantized_model = nncf.quantize(model, calibration_dataset) #! [quantization] + +#! [inference] +import openvino.runtime as ov +from openvino.tools.mo import convert_model + +input_fp32 = ... # FP32 model input + +# export PyTorch model to ONNX model +onnx_model_path = "model.onnx" +torch.onnx.export(quantized_model, input_fp32, onnx_model_path) + +# convert ONNX model to OpenVINO model +ov_quantized_model = convert_model(onnx_model_path) + +# compile the model to transform quantized operations to int8 +model_int8 = ov.compile_model(ov_quantized_model) + +res = model_int8(input_fp32) + +# save the model +ov.serialize(ov_quantized_model, "quantized_model.xml") +#! [inference] diff --git a/docs/optimization_guide/nncf/ptq/ptq_introduction.md b/docs/optimization_guide/nncf/ptq/ptq_introduction.md deleted file mode 100644 index 2cd880b50602f8..00000000000000 --- a/docs/optimization_guide/nncf/ptq/ptq_introduction.md +++ /dev/null @@ -1,26 +0,0 @@ -# Post-training Quantization with NNCF (new) {#nncf_ptq_introduction} - -@sphinxdirective - -.. toctree:: - :maxdepth: 1 - :hidden: - - basic_quantization_flow - quantization_w_accuracy_control - - -Neural Network Compression Framework (NNCF) provides a new post-training quantization API available in Python that is aimed at reusing the code for model training or validation that is usually available with the model in the source framework, for example, PyTorch or TensroFlow. The API is cross-framework and currently supports models representing in the following frameworks: PyTorch, TensorFlow 2.x, ONNX, and OpenVINO. - -This API has two main capabilities to apply 8-bit post-training quantization: - -* :doc:`Basic quantization ` - the simplest quantization flow that allows to apply 8-bit integer quantization to the model. -* :doc:`Quantization with accuracy control ` - the most advanced quantization flow that allows to apply 8-bit quantization to the model with accuracy control. - -Additional Resources -#################### - -* `NNCF GitHub `__ -* :doc:`Optimizing Models at Training Time ` - -@endsphinxdirective diff --git a/docs/optimization_guide/nncf/ptq/quantization_w_accuracy_control.md b/docs/optimization_guide/nncf/ptq/quantization_w_accuracy_control.md index fec080c0b0aafc..466d0af431e71d 100644 --- a/docs/optimization_guide/nncf/ptq/quantization_w_accuracy_control.md +++ b/docs/optimization_guide/nncf/ptq/quantization_w_accuracy_control.md @@ -1,4 +1,4 @@ -# Quantizing with accuracy control {#quantization_w_accuracy_control} +# Quantizing with Accuracy Control {#quantization_w_accuracy_control} @sphinxdirective @@ -7,58 +7,86 @@ Introduction This is the advanced quantization flow that allows to apply 8-bit quantization to the model with control of accuracy metric. This is achieved by keeping the most impactful operations within the model in the original precision. The flow is based on the :doc:`Basic 8-bit quantization ` and has the following differences: -* Beside the calibration dataset, a **validation dataset** is required to compute accuracy metric. They can refer to the same data in the simplest case. +* Besides the calibration dataset, a **validation dataset** is required to compute the accuracy metric. Both datasets can refer to the same data in the simplest case. * **Validation function**, used to compute accuracy metric is required. It can be a function that is already available in the source framework or a custom function. -* Since accuracy validation is run several times during the quantization process, quantization with accuracy control can take more time than the [Basic 8-bit quantization](@ref basic_quantization_flow) flow. +* Since accuracy validation is run several times during the quantization process, quantization with accuracy control can take more time than the :doc:`Basic 8-bit quantization ` flow. * The resulted model can provide smaller performance improvement than the :doc:`Basic 8-bit quantization ` flow because some of the operations are kept in the original precision. -.. note:: Currently, this flow is available only for models in OpenVINO representation. +.. note:: Currently, 8-bit quantization with accuracy control is available only for models in OpenVINO representation. The steps for the quantization with accuracy control are described below. -Prepare datasets -#################### +Prepare calibration and validation datasets +############################################ This step is similar to the :doc:`Basic 8-bit quantization ` flow. The only difference is that two datasets, calibration and validation, are required. -.. tab:: OpenVINO +.. tab-set:: - .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_aa_openvino.py - :language: python - :fragment: [dataset] + .. tab-item:: OpenVINO + :sync: ov + .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_aa_openvino.py + :language: python + :fragment: [dataset] Prepare validation function -########################### +############################ Validation funtion receives ``openvino.runtime.CompiledModel`` object and validation dataset and returns accuracy metric value. The following code snippet shows an example of validation function for OpenVINO model: -.. tab:: OpenVINO +.. tab-set:: - .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_aa_openvino.py - :language: python - :fragment: [validation] + .. tab-item:: OpenVINO + :sync: ov + .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_aa_openvino.py + :language: python + :fragment: [validation] Run quantization with accuracy control +####################################### + +``nncf.quantize_with_accuracy_control()`` function is used to run the quantization with accuracy control. The following code snippet shows an example of quantization with accuracy control for OpenVINO model: + +.. tab-set:: + + .. tab-item:: OpenVINO + :sync: ov + + .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_aa_openvino.py + :language: python + :fragment: [quantization] + +* ``max_drop`` defines the accuracy drop threshold. The quantization process stops when the degradation of accuracy metric on the validation dataset is less than the ``max_drop``. The default value is 0.01. NNCF will stop the quantization and report an error if the ``max_drop`` value can't be reached. + +* ``drop_type`` defines how the accuracy drop will be calculated: ``ABSOLUTE`` (used by default) or ``RELATIVE``. + +After that the model can be compiled and run with OpenVINO: + +.. tab-set:: -Now, you can run quantization with accuracy control. The following code snippet shows an example of quantization with accuracy control for OpenVINO model: + .. tab-item:: OpenVINO + :sync: ov -.. tab:: OpenVINO + .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_aa_openvino.py + :language: python + :fragment: [inference] - .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_aa_openvino.py - :language: python - :fragment: [quantization] +``nncf.quantize_with_accuracy_control()`` API supports all the parameters from :doc:`Basic 8-bit quantization ` API, to quantize a model with accuracy control and a custom configuration. +If the accuracy or performance of the quantized model is not satisfactory, you can try :doc:`Training-time Optimization ` as the next step. -``max_drop`` defines the accuracy drop threshold. The quantization process stops when the degradation of accuracy metric on the validation dataset is less than the ``max_drop``. +Examples of NNCF post-training quantization with control of accuracy metric: +############################################################################# -``nncf.quantize_with_accuracy_control()`` API supports all the parameters of ``nncf.quantize()`` API. For example, you can use ``nncf.quantize_with_accuracy_control()`` to quantize a model with a custom configuration. +* `Post-Training Quantization of Anomaly Classification OpenVINO model with control of accuracy metric `__ +* `Post-Training Quantization of YOLOv8 OpenVINO Model with control of accuracy metric `__ See also #################### -* :doc:`Optimizing Models at Training Time ` +* :doc:`Optimizing Models at Training Time ` @endsphinxdirective diff --git a/docs/optimization_guide/nncf/qat.md b/docs/optimization_guide/nncf/qat.md index 0ddf086921002c..1f0cb8ede9ff77 100644 --- a/docs/optimization_guide/nncf/qat.md +++ b/docs/optimization_guide/nncf/qat.md @@ -26,18 +26,21 @@ PyTorch or TensorFlow 2: In this step, you add NNCF-related imports in the beginning of the training script: -.. tab:: PyTorch - - .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_torch.py - :language: python - :fragment: [imports] - -.. tab:: TensorFlow 2 - - .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_tf.py - :language: python - :fragment: [imports] - +.. tab-set:: + + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_torch.py + :language: python + :fragment: [imports] + + .. tab-item:: TensorFlow 2 + :sync: tensorflow + + .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_tf.py + :language: python + :fragment: [imports] 2. Create NNCF configuration ++++++++++++++++++++++++++++ @@ -46,18 +49,22 @@ Here, you should define NNCF configuration which consists of model-related param of optimization methods (``"compression"`` section). For faster convergence, it is also recommended to register a dataset object specific to the DL framework. It will be used at the model creation step to initialize quantization parameters. -.. tab:: PyTorch - - .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_torch.py - :language: python - :fragment: [nncf_congig] +.. tab-set:: -.. tab:: TensorFlow 2 - - .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_tf.py - :language: python - :fragment: [nncf_congig] + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_torch.py + :language: python + :fragment: [nncf_congig] + + .. tab-item:: TensorFlow 2 + :sync: tensorflow + .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_tf.py + :language: python + :fragment: [nncf_congig] + 3. Apply optimization methods +++++++++++++++++++++++++++++ @@ -69,18 +76,22 @@ undergoes a set of corresponding transformations and can contain additional oper the case of QAT, the compression controller object is used for model export and, optionally, in distributed training as it will be shown below. -.. tab:: PyTorch - - .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_torch.py - :language: python - :fragment: [wrap_model] - -.. tab:: TensorFlow 2 +.. tab-set:: - .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_tf.py - :language: python - :fragment: [wrap_model] + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_torch.py + :language: python + :fragment: [wrap_model] + + .. tab-item:: TensorFlow 2 + :sync: tensorflow + .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_tf.py + :language: python + :fragment: [wrap_model] + 4. Fine-tune the model ++++++++++++++++++++++ @@ -89,17 +100,22 @@ This step assumes that you will apply fine-tuning to the model the same way as i case of QAT, it is required to train the model for a few epochs with a small learning rate, for example, 10e-5. In principle, you can skip this step which means that the post-training optimization will be applied to the model. -.. tab:: PyTorch +.. tab-set:: - .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_torch.py - :language: python - :fragment: [tune_model] + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_torch.py + :language: python + :fragment: [tune_model] + + .. tab-item:: TensorFlow 2 + :sync: tensorflow -.. tab:: TensorFlow 2 - - .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_tf.py - :language: python - :fragment: [tune_model] + .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_tf.py + :language: python + :fragment: [tune_model] + 5. Multi-GPU distributed training @@ -108,39 +124,47 @@ you can skip this step which means that the post-training optimization will be a In the case of distributed multi-GPU training (not DataParallel), you should call ``compression_ctrl.distributed()`` before the fine-tuning that will inform optimization methods to do some adjustments to function in the distributed mode. -.. tab:: PyTorch - - .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_torch.py - :language: python - :fragment: [distributed] - -.. tab:: TensorFlow 2 - - .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_tf.py - :language: python - :fragment: [distributed] - +.. tab-set:: + + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_torch.py + :language: python + :fragment: [distributed] + + .. tab-item:: TensorFlow 2 + :sync: tensorflow + + .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_tf.py + :language: python + :fragment: [distributed] + 6. Export quantized model +++++++++++++++++++++++++ When fine-tuning finishes, the quantized model can be exported to the corresponding format for further inference: ONNX in the case of PyTorch and frozen graph - for TensorFlow 2. -.. tab:: PyTorch - - .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_torch.py - :language: python - :fragment: [export] - -.. tab:: TensorFlow 2 +.. tab-set:: - .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_tf.py - :language: python - :fragment: [export] + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_torch.py + :language: python + :fragment: [export] + + .. tab-item:: TensorFlow 2 + :sync: tensorflow + .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_tf.py + :language: python + :fragment: [export] + .. note:: - The precision of weigths gets INT8 only after the step of model conversion to OpenVINO Intermediate Representation. + The precision of weights gets INT8 only after the step of model conversion to OpenVINO Intermediate Representation. You can expect the model footprint reduction only for that format. @@ -152,17 +176,21 @@ checkpoints during the training. Since NNCF wraps the original model with its ow To save model checkpoint use the following API: -.. tab:: PyTorch +.. tab-set:: - .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_torch.py - :language: python - :fragment: [save_checkpoint] + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_torch.py + :language: python + :fragment: [save_checkpoint] + + .. tab-item:: TensorFlow 2 + :sync: tensorflow -.. tab:: TensorFlow 2 - - .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_tf.py - :language: python - :fragment: [save_checkpoint] + .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_tf.py + :language: python + :fragment: [save_checkpoint] 8. (Optional) Restore from checkpoint @@ -170,18 +198,22 @@ To save model checkpoint use the following API: To restore the model from checkpoint you should use the following API: -.. tab:: PyTorch - - .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_torch.py - :language: python - :fragment: [load_checkpoint] - -.. tab:: TensorFlow 2 - - .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_tf.py - :language: python - :fragment: [load_checkpoint] - +.. tab-set:: + + .. tab-item:: PyTorch + :sync: pytorch + + .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_torch.py + :language: python + :fragment: [load_checkpoint] + + .. tab-item:: TensorFlow 2 + :sync: tensorflow + + .. doxygensnippet:: docs/optimization_guide/nncf/code/qat_tf.py + :language: python + :fragment: [load_checkpoint] + For more details on saving/loading checkpoints in the NNCF, see the following `documentation `__. diff --git a/docs/optimization_guide/ptq_introduction.md b/docs/optimization_guide/ptq_introduction.md index 1f218cb780c004..91a789a9b5174c 100644 --- a/docs/optimization_guide/ptq_introduction.md +++ b/docs/optimization_guide/ptq_introduction.md @@ -6,9 +6,10 @@ :maxdepth: 1 :hidden: + basic_quantization_flow + quantization_w_accuracy_control pot_introduction - nncf_ptq_introduction - + Post-training model optimization is the process of applying special methods that transform the model into a more hardware-friendly representation without retraining or fine-tuning. The most popular and widely-spread method here is 8-bit post-training quantization because it is: @@ -21,15 +22,18 @@ Post-training model optimization is the process of applying special methods that .. image:: _static/images/quantization_picture.svg -To apply post-training methods in OpenVINO, you need: +`Neural Network Compression Framework (NNCF) `__ provides a post-training quantization API available in Python that is aimed at reusing the code for model training or validation that is usually available with the model in the source framework, for example, PyTorch or TensroFlow. The NNCF API is cross-framework and currently supports models in the following frameworks: OpenVINO, PyTorch, TensorFlow 2.x, and ONNX. Currently, post-training quantization for models in OpenVINO Intermediate Representation is the most mature in terms of supported methods and models coverage. + +NNCF API has two main capabilities to apply 8-bit post-training quantization: -* A floating-point precision model, FP32 or FP16, converted into the OpenVINO Intermediate Representation (IR) format that can be run on CPU. -* A representative calibration dataset, representing a use case scenario, for example, of 300 samples. -* In case of accuracy constraints, a validation dataset and accuracy metrics should be available. +* :doc:`Basic quantization ` - the simplest quantization flow that allows applying 8-bit integer quantization to the model. A representative calibration dataset is only needed in this case. +* :doc:`Quantization with accuracy control ` - the most advanced quantization flow that allows applying 8-bit quantization to the model with accuracy control. Calibration and validation datasets, and a validation function to calculate the accuracy metric are needed in this case. -Currently, OpenVINO provides two workflows with post-training quantization capabilities: +Additional Resources +#################### -* :doc:`Post-training Quantization with POT ` - works with models in OpenVINO Intermediate Representation (IR) only. -* :doc:`Post-training Quantization with NNCF ` - cross-framework solution for model optimization that provides a new simple API for post-training quantization. +* :doc:`Optimizing Models at Training Time ` +* `NNCF GitHub `__ +* `Tutorial: Migrate quantization from POT API to NNCF API `__ @endsphinxdirective diff --git a/tools/pot/configs/README.md b/tools/pot/configs/README.md index c900db6ea38651..917fec539a2b48 100644 --- a/tools/pot/configs/README.md +++ b/tools/pot/configs/README.md @@ -41,7 +41,7 @@ Engine Parameters The main parameter is ``"type"`` which can take two possible options: ``"accuracy_checher"`` (default) or ``"simplified"``. It specifies the engine used for model inference and validation (if supported): -- **Simplified mode** engines. These engines can be used only with ``DefaultQuantization`` algorithm to get a fully quantized model. They do not use the Accuracy Checker tool and annotation. In this case, the following parameters are applicable: +- **Simplified mode** engines. These engines can be used only with the ``DefaultQuantization`` algorithm to get a fully quantized model. They do not use the Accuracy Checker tool and annotation. In this case, the following parameters are applicable: - ``"data_source"`` specifies the path to the directory​ where the calibration data is stored. - ``"layout"`` - (Optional) Layout of input data. Supported values: [``"NCHW"``, ``"NHWC"``, ``"CHW"``, ``"CWH"``]​. @@ -57,13 +57,13 @@ There are two options to define engine parameters in this mode: Compression Parameters ###################### -For more details about parameters of the concrete optimization algorithm, see descriptions of :doc:`Default Quantization ` and :doc:`Accuracy-aware Quantizatoin ` methods. +For more details on the parameters of a particular optimization algorithm, see descriptions of :doc:`Default Quantization ` and :doc:`Accuracy-aware Quantizatoin ` methods. Examples of the Configuration File ################################## For a quick start, many examples of configuration files are provided `here `__. -There, you can find ready-to-use configurations for the models from various domains: Computer Vision (Image Classification, Object Detection, Segmentation), Natural Language Processing, Recommendation Systems. We put configuration files for the models which require non-default configuration settings to get accurate results. +There, you can find ready-to-use configurations for the models from various domains: Computer Vision (Image Classification, Object Detection, Segmentation), Natural Language Processing, and Recommendation Systems. We put configuration files for the models which require non-default configuration settings to get accurate results. For details on how to run the Post-Training Optimization Tool with a sample configuration file, see the :doc:`example `. diff --git a/tools/pot/docs/AccuracyAwareQuantizationUsage.md b/tools/pot/docs/AccuracyAwareQuantizationUsage.md index 10551be7ee60fc..75964aea8cc8af 100644 --- a/tools/pot/docs/AccuracyAwareQuantizationUsage.md +++ b/tools/pot/docs/AccuracyAwareQuantizationUsage.md @@ -9,20 +9,20 @@ AccuracyAwareQuantization Method -The Accuracy-aware Quantization algorithm allows to perform quantization while maintaining accuracy within a pre-defined range. Note that it should be used only if the :doc:`Default Quantization ` introduces a significant accuracy degradation. The reason for it not being the primary choice is its potential for performance degradation, due to some layers getting reverted to the original precision. +The Accuracy-aware Quantization algorithm allows performing quantization while maintaining accuracy within a pre-defined range. Note that it should be used only if the :doc:`Default Quantization ` introduces a significant accuracy degradation. The reason for it not being the primary choice is its potential for performance degradation, due to some layers getting reverted to the original precision. To proceed with this article, make sure you have read how to use :doc:`Default Quantization `. .. note:: - The Accuracy-aware Quantization algorithm's behavior is different for the GNA ``target_device``. In this case it searches for the best configuration and selects between INT8 and INT16 precisions for weights of each layer. The algorithm works for the ``performance`` preset only. It is not useful for the ``accuracy`` preset, since the whole model is already in INT16 precision. + The Accuracy-aware Quantization algorithm's behavior is different for the GNA ``target_device``. In this case, it searches for the best configuration and selects between INT8 and INT16 precisions for the weights of each layer. The algorithm works for the ``performance`` preset only. It is not useful for the ``accuracy`` preset, since the whole model is already in INT16 precision. A script for Accuracy-aware Quantization includes four steps: 1. Prepare data and dataset interface. 2. Define accuracy metric. 3. Select quantization parameters. -4. Define and run quantization process. +4. Define and run the quantization process. Prepare data and dataset interface ################################## @@ -52,7 +52,7 @@ To control accuracy during optimization, the ``openvino.tools.pot.Metric`` inter Required attributes: - - ``direction`` - (``higher-better`` or ``higher-worse``) a string parameter defining whether metric value should be increased in accuracy-aware algorithms. + - ``direction`` - (``higher-better`` or ``higher-worse``) a string parameter defining whether the metric value should be increased in accuracy-aware algorithms. - ``type`` - a string representation of a metric type. For example, "accuracy" or "mean_iou". @endsphinxdirective @@ -177,7 +177,7 @@ The example code below shows a basic quantization workflow with accuracy control compressed_model = pipeline.run(model=model) # Step 6 (Optional): Compress model weights to quantized precision - # in order to reduce the size of the final .bin file. + # to reduce the size of the final .bin file. compress_model_weights(compressed_model) # Step 7: Save the compressed model to the desired path. diff --git a/tools/pot/docs/BestPractices.md b/tools/pot/docs/BestPractices.md index 2b56905efa2226..79f176ae7d3d4e 100644 --- a/tools/pot/docs/BestPractices.md +++ b/tools/pot/docs/BestPractices.md @@ -14,7 +14,7 @@ the fastest and easiest way to get a quantized model. It requires only some unan .. note:: - POT uses inference on the CPU during model optimization. It means that ability to infer the original floating-point model is essential for model optimization. It is also worth mentioning that in case of the 8-bit quantization, it is recommended to run POT on the same CPU architecture when optimizing for CPU or VNNI-based CPU when quantizing for a non-CPU device, such as GPU, VPU, or GNA. It should help to avoid the impact of the :doc:`saturation issue ` that occurs on AVX and SSE based CPU devices. + POT uses inference on the CPU during model optimization. It means that ability to infer the original floating-point model is essential for model optimization. In case of the 8-bit quantization, it is recommended to run POT on the same CPU architecture when optimizing for CPU or VNNI-based CPU when quantizing for a non-CPU device, such as GPU, VPU, or GNA. It should help to avoid the impact of the :doc:`saturation issue ` that occurs on AVX and SSE-based CPU devices. Improving accuracy after the Default Quantization @@ -29,10 +29,10 @@ Parameters of the Default Quantization algorithm with basic settings are present "params": { "preset": "performance", # Preset [performance, mixed] which controls # the quantization scheme. For the CPU: - # performance - symmetric quantization of weights and activations. + # performance - symmetric quantization of weights and activations. # mixed - symmetric weights and asymmetric activations. # accuracy - the same as "mixed" for CPU, GPU, and GNA devices; asymmetric weights and activations for VPU device. - "stat_subset_size": 300 # Size of subset to calculate activations statistics that can be used + "stat_subset_size": 300 # Size of the subset to calculate activations statistics that can be used # for quantization parameters calculation. } } @@ -46,7 +46,7 @@ There are two alternatives in case of substantial accuracy degradation after app Tuning Hyperparameters of the Default Quantization ++++++++++++++++++++++++++++++++++++++++++++++++++ -The Default Quantization algorithm provides multiple hyperparameters which can be used in order to improve accuracy results for the fully-quantized model. +The Default Quantization algorithm provides multiple hyperparameters which can be used to improve accuracy results for the fully-quantized model. Below is a list of best practices that can be applied to improve accuracy without a substantial performance reduction with respect to default settings: 1. The first recommended option is to change the ``preset`` from ``performance`` to ``mixed``. This enables asymmetric quantization of activations and can be helpful for models with non-ReLU activation functions, for example, YOLO, EfficientNet, etc. @@ -55,7 +55,7 @@ Below is a list of best practices that can be applied to improve accuracy withou .. note:: Changing this option can substantially increase quantization time in the POT tool. 3. Some model architectures require a special approach when being quantized. For example, Transformer-based models need to keep some operations in the original precision to preserve accuracy. That is why POT provides a ``model_type`` option to specify the model architecture. Now, only ``"transformer"`` type is available. Use it to quantize Transformer-based models, e.g. BERT. -4. Another important option is a `range_estimator`. It defines how to calculate the minimum and maximum of quantization range for weights and activations. For example, the following ``range_estimator`` for activations can improve the accuracy for Faster R-CNN based networks: +4. Another important option is a `range_estimator`. It defines how to calculate the minimum and maximum of quantization range for weights and activations. For example, the following ``range_estimator`` for activations can improve the accuracy for Faster R-CNN-based networks: .. code-block:: python @@ -66,7 +66,7 @@ Below is a list of best practices that can be applied to improve accuracy withou "stat_subset_size": 300 "activations": { # defines activation "range_estimator": { # defines how to estimate statistics - "max": { # right border of the quantizating floating-point range + "max": { # right border of the quantizing floating-point range "aggregator": "max", # use max(x) to aggregate statistics over calibration dataset "type": "abs_max" # use abs(max(x)) to get per-sample statistics } @@ -77,14 +77,14 @@ Below is a list of best practices that can be applied to improve accuracy withou 5. The next option is ``stat_subset_size``. It controls the size of the calibration dataset used by POT to collect statistics for quantization parameters initialization. It is assumed that this dataset should contain a sufficient number of representative samples. Thus, varying this parameter may affect accuracy (higher is better). However, we empirically found that 300 samples are sufficient to get representative statistics in most cases. -6. The last option is ``ignored_scope``. It allows excluding some layers from the quantization process, i.e. their inputs will not be quantized. It may be helpful for some patterns for which it is known in advance that they drop accuracy when executing in low-precision. For example, ``DetectionOutput`` layer of SSD model expressed as a subgraph should not be quantized to preserve the accuracy of Object Detection models. One of the sources for the ignored scope can be the Accuracy-aware algorithm which can revert layers back to the original precision (see details below). +6. The last option is ``ignored_scope``. It allows excluding some layers from the quantization process, i.e. their inputs will not be quantized. It may be helpful for some patterns for which it is known in advance that they drop accuracy when executing in low precision. For example, the ``DetectionOutput`` layer of the SSD model expressed as a subgraph should not be quantized to preserve the accuracy of Object Detection models. One of the sources for the ignored scope can be the Accuracy-aware algorithm which can revert layers to the original precision (see details below). Find all the possible options and their description in the configuration `specification file `__ in the POT directory. Accuracy-aware Quantization ########################### -When the steps above do not lead to the accurate quantized model, you may use the so-called :doc:`Accuracy-aware Quantization ` algorithm which leads to mixed-precision models. A fragment of Accuracy-aware Quantization configuration with default settings is shown below: +When the steps above do not lead to the accurate quantized model, you may use the :doc:`Accuracy-aware Quantization ` algorithm which leads to mixed-precision models. A fragment of Accuracy-aware Quantization configuration with default settings is shown below: .. code-block:: python @@ -102,7 +102,7 @@ Since the Accuracy-aware Quantization calls the Default Quantization at the firs .. note:: - In general, the potential increase in speed with the Accuracy-aware Quantization algorithm is not as high as with the Default Quantization, when the model gets fully quantized. + In general, the potential increase in speed with the Accuracy-aware Quantization algorithm is not as high as with the Default Quantization, when the model gets fully quantized. Reducing the performance gap of Accuracy-aware Quantization diff --git a/tools/pot/docs/CLI.md b/tools/pot/docs/CLI.md index f7d5c06b975c84..3bc5db068af39a 100644 --- a/tools/pot/docs/CLI.md +++ b/tools/pot/docs/CLI.md @@ -14,19 +14,19 @@ Introduction #################### -POT command-line interface (CLI) is aimed at optimizing models that are similar to the models from OpenVINO `Model Zoo `__ or if there is a valid :doc:`AccuracyChecker Tool ` configuration file for the model. Examples of AccuracyChecker configuration files can be found on `GitHub `__. Each model folder contains YAML configuration file that can be used with POT as is. +POT command-line interface (CLI) is aimed at optimizing models that are similar to the models from OpenVINO `Model Zoo `__ or if there is a valid :doc:`AccuracyChecker Tool ` configuration file for the model. Examples of AccuracyChecker configuration files can be found on `GitHub `__. Each model folder contains a YAML configuration file that can be used with POT as is. .. note:: - There is also the so-called :doc:`Simplified mode ` aimed at optimization of models from the Computer Vision domain and has a simple dataset preprocessing, like image resize and crop. In this case, you can also use POT CLI for optimization. However, the accuracy results are not guaranteed in this case. Moreover, you are also limited in the optimization methods choice since the accuracy measurement is not available. + There is also a :doc:`Simplified mode ` aimed at the optimization of models from the Computer Vision domain and has a simple dataset preprocessing like image resize and crop. In this case, you can also use POT CLI for optimization. However, the accuracy results are not guaranteed in this case. Moreover, you are also limited in the optimization methods choice since the accuracy measurement is not available. Run POT CLI #################### -There are two ways how to run POT via command line: +There are two ways how to run POT via the command line: -- **Basic usage for DefaultQuantization**. In this case you can run POT with basic setting just specifying all the options via command line. ``-q default`` stands for :doc:`DefaultQuantization ` algorithm: +- **Basic usage for DefaultQuantization**. In this case, you can run POT with basic settings just specifying all the options via the command line. ``-q default`` stands for :doc:`DefaultQuantization ` algorithm: .. code-block:: sh @@ -39,7 +39,7 @@ There are two ways how to run POT via command line: pot -q accuracy_aware -m -w --ac-config --max-drop 0.01 -- **Advanced usage**. In this case you should prepare a configuration file for the POT where you can specify advanced options for the optimization methods available. See :doc:`POT configuration file description ` for more details. +- **Advanced usage**. In this case, you should prepare a configuration file for the POT where you can specify advanced options for the optimization methods available. See :doc:`POT configuration file description ` for more details. To launch the command-line tool with the configuration file run: @@ -64,9 +64,9 @@ The following command-line options are available to run the tool: +-----------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Argument | Description | +=====================================================+=======================================================================================================================================================================================================+ -| ``-h``, ```--help``` | Optional. Show help message and exit. | +| ``-h``, ``--help`` | Optional. Show help message and exit. | +-----------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -| ```-q``, ```--quantize``` | Quantize model to 8 bits with specified quantization method: ``default`` or ``accuracy_aware``. | +| ``-q``, ``--quantize`` | Quantize model to 8 bits with specified quantization method: ``default`` or ``accuracy_aware``. | +-----------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ``--preset`` | Use ``performance`` for fully symmetric quantization or ``mixed`` preset for symmetric quantization of weight and asymmetric quantization of activations. Applicable only when ``-q`` option is used. | +-----------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ @@ -82,21 +82,21 @@ The following command-line options are available to run the tool: +-----------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ``--ac-config`` | Path to the Accuracy Checker configuration file. Applicable only when ``-q`` option is used. | +-----------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -| ``--max-drop`` | Optional. Maximum accuracy drop. Valid only for accuracy-aware quantization. Applicable only when ``-q`` option is used and ``accuracy_aware`` method is selected. | +| ``--max-drop`` | Optional. Maximum accuracy drop. Valid only for accuracy-aware quantization. Applicable only when ``-q`` option is used and the ``accuracy_aware`` method is selected. | +-----------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ``-c CONFIG``, ``--config CONFIG`` | Path to a config file with task- or model-specific parameters. | +-----------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -| ``-e``, ``--evaluate`` | Optional. Evaluate model on the whole dataset after optimization. | +| ``-e``, ``--evaluate`` | Optional. Evaluate the model on the whole dataset after optimization. | +-----------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ``--output-dir OUTPUT_DIR`` | Optional. A directory where results are saved. Default: ``./results``. | +-----------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -| ``-sm`, `--save-model`` | Optional. Save the original full-precision model. | +| ``-sm``, ``--save-model`` | Optional. Save the original full-precision model. | +-----------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -| ``-d`, `--direct-dump`` | Optional. Save results to the "optimized" subfolder within the specified output directory with no additional subpaths added at the end. | +| ``-d``, ``--direct-dump`` | Optional. Save results to the "optimized" subfolder within the specified output directory with no additional subpaths added at the end. | +-----------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ``--log-level {CRITICAL,ERROR,WARNING,INFO,DEBUG}`` | Optional. Log level to print. Default: INFO. | +-----------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -| ``--progress-bar`` | Optional. Disable CL logging and enable progress bar. | +| ``--progress-bar`` | Optional. Disable CL logging and enable the progress bar. | +-----------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ``--stream-output`` | Optional. Switch model quantization progress display to a multiline mode. Use with third-party components. | +-----------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ @@ -110,4 +110,4 @@ See Also * :doc:`Optimization with Simplified mode ` * :doc:`Post-Training Optimization Best Practices ` -@endsphinxdirective \ No newline at end of file +@endsphinxdirective diff --git a/tools/pot/docs/DefaultQuantizationUsage.md b/tools/pot/docs/DefaultQuantizationUsage.md index 028a49c6560cc0..f61ed4eb94391a 100644 --- a/tools/pot/docs/DefaultQuantizationUsage.md +++ b/tools/pot/docs/DefaultQuantizationUsage.md @@ -21,7 +21,7 @@ The script should include three basic steps: 1. Prepare data and dataset interface. 2. Select quantization parameters. -3. Define and run quantization process. +3. Define and run the quantization process. Prepare data and dataset interface ################################## @@ -29,9 +29,9 @@ Prepare data and dataset interface In most cases, it is required to implement only the ``openvino.tools.pot.DataLoader`` interface, which allows acquiring data from a dataset and applying model-specific pre-processing providing access by index. Any implementation should override the following methods: * The ``__len__()``, returns the size of the dataset. -* The ``__getitem__()``, provides access to the data by index in range of 0 to ``len(self)``. It can also encapsulate the logic of model-specific pre-processing. This method should return data in the ``(data, annotation)`` format, in which: +* The ``__getitem__()``, provides access to the data by index in the range of 0 to ``len(self)``. It can also encapsulate the logic of model-specific pre-processing. This method should return data in the ``(data, annotation)`` format, in which: - * The ``data`` is the input that is passed to the model at inference so that it should be properly preprocessed. It can be either the ``numpy.array`` object or a dictionary, where the key is the name of the model input and value is ``numpy.array`` which corresponds to this input. + * The ``data`` is the input that is passed to the model at inference so that it should be properly preprocessed. It can be either the ``numpy.array`` object or a dictionary, where the key is the name of the model input and the value is ``numpy.array`` which corresponds to this input. * The ``annotation`` is not used by the Default Quantization method. Therefore, this object can be ``None`` in this case. Framework data loading classes can be wrapped by the ``openvino.tools.pot.DataLoader`` interface which is usually straightforward. @@ -93,10 +93,10 @@ Default Quantization algorithm has mandatory and optional parameters which are d * ``"CPU_SPR"`` - to quantize models for CPU SPR (4th Generation Intel® Xeon® Scalable processor family) * ``"GNA"``, ``"GNA3"``, ``"GNA3.5"`` - to quantize models for GNA devices respectively. -* ``"stat_subset_size"`` - size of data subset to calculate activations statistics used for quantization. The whole dataset is used if no parameter is specified. It is recommended to use not less than 300 samples. -* ``"stat_batch_size"`` - size of batch to calculate activations statistics used for quantization. 1 if no parameter specified. +* ``"stat_subset_size"`` - size of the data subset to calculate activations statistics used for quantization. The whole dataset is used if no parameter is specified. It is recommended to use not less than 300 samples. +* ``"stat_batch_size"`` - size of the batch to calculate activations statistics used for quantization. 1 if no parameter is specified. -For full specification, see the the :doc:`Default Quantization method `. +For full specification, see the :doc:`Default Quantization method `. Run quantization #################### diff --git a/tools/pot/docs/E2eExample.md b/tools/pot/docs/E2eExample.md index 8170e8edbcc70f..d98f15948dec9c 100644 --- a/tools/pot/docs/E2eExample.md +++ b/tools/pot/docs/E2eExample.md @@ -2,7 +2,7 @@ @sphinxdirective -This tutorial describes an example of running post-training quantization for **MobileNet v2 model from PyTorch** framework, +This tutorial describes an example of running post-training quantization for the **MobileNet v2 model from PyTorch** framework, particularly by the DefaultQuantization algorithm. The example covers the following steps: @@ -32,7 +32,7 @@ Model Preparation omz_downloader --name mobilenet-v2-pytorch - After that the original full-precision model is located in ``/public/mobilenet-v2-pytorch/``. + After that, the original full-precision model is located in ``/public/mobilenet-v2-pytorch/``. 3. Convert the model to the OpenVINO™ Intermediate Representation (IR) format using :doc:`Model Converter ` tool: @@ -41,7 +41,7 @@ Model Preparation omz_converter --name mobilenet-v2-pytorch - After that the full-precision model in the IR format is located in ``/public/mobilenet-v2-pytorch/FP32/``. + After that, the full-precision model in the IR format is located in ``/public/mobilenet-v2-pytorch/FP32/``. For more information about the Model Optimizer, refer to its :doc:`documentation `. @@ -54,7 +54,7 @@ Check the performance of the full-precision model in the IR format using :doc:`D benchmark_app -m /public/mobilenet-v2-pytorch/FP32/mobilenet-v2-pytorch.xml -Note that the results might be different dependently on characteristics of your machine. On a machine with Intel® Core™ i9-10920X CPU @ 3.50GHz it is like: +Note that the results might be different depending on the characteristics of your machine. On a machine with Intel® Core™ i9-10920X CPU @ 3.50GHz it is like: .. code-block:: sh @@ -79,14 +79,14 @@ To download images: Note that the registration process might be quite long. -Note that the ImageNet size is 50 000 images and takes around 6.5 GB of the disk space. +Note that the ImageNet size is 50 000 images and takes around 6.5 GB of disk space. To download the annotation file: 1. Download `archive `__. 2. Unpack ``val.txt`` from the archive into ``/ImageNet/``. -After that the ``/ImageNet/`` dataset folder should have a lot of image files like ``ILSVRC2012_val_00000001.JPEG`` and the ``val.txt`` annotation file. +After that, the ``/ImageNet/`` dataset folder should have a lot of image files like ``ILSVRC2012_val_00000001.JPEG`` and the ``val.txt`` annotation file. Accuracy Validation of Full-Precision Model in IR Format ######################################################## @@ -134,7 +134,7 @@ Accuracy Validation of Full-Precision Model in IR Format where ``data_source: ./ImageNet`` is the dataset and ``annotation_file: ./ImageNet/val.txt`` - is the annotation file prepared on the previous step. For more information about + is the annotation file prepared in the previous step. For more information about the Accuracy Checker configuration file refer to :doc:`Accuracy Checker Tool documentation `. 3. Evaluate the accuracy of the full-precision model in the IR format by executing the following command in ```` : @@ -144,7 +144,7 @@ Accuracy Validation of Full-Precision Model in IR Format accuracy_check -c mobilenet_v2_pytorch.yaml -m ./public/mobilenet-v2-pytorch/FP32/ - The actual result should be like **71.81%** of the accuracy top-1 metric on VNNI based CPU. + The actual result should be like **71.81%** of the accuracy top-1 metric on VNNI-based CPU. Note that the results might be different on CPUs with different instruction sets. @@ -195,7 +195,7 @@ Model Quantization The quantized model is placed into the subfolder with your current date and time in the name under the ``./results/mobilenetv2_DefaultQuantization/`` directory. The accuracy validation of the quantized model is performed right after the quantization. - The actual result should be like **71.556%** of the accuracy top-1 metric on VNNI based CPU. + The actual result should be like **71.556%** of the accuracy top-1 metric on VNNI-based CPU. Note that the results might be different on CPUs with different instruction sets. @@ -210,7 +210,7 @@ Check the performance of the quantized model using :doc:`Deep Learning Benchmark where ```` is the path to the quantized model. -Note that the results might be different dependently on characteristics of your +Note that the results might be different depending on the characteristics of your machine. On a machine with Intel® Core™ i9-10920X CPU @ 3.50GHz it is like: .. code-block:: sh diff --git a/tools/pot/docs/FrequentlyAskedQuestions.md b/tools/pot/docs/FrequentlyAskedQuestions.md index 93b66083086613..e3f37db79037ab 100644 --- a/tools/pot/docs/FrequentlyAskedQuestions.md +++ b/tools/pot/docs/FrequentlyAskedQuestions.md @@ -2,9 +2,11 @@ @sphinxdirective +.. note:: Post-training Optimization Tool is deprecated since OpenVINO 2023.0. :doc:`Neural Network Compression Framework (NNCF) ` is recommended for the post-training quantization instead. + If your question is not covered below, use the `OpenVINO™ Community Forum page `__, where you can participate freely. -- :ref:`Is the Post-training Optimization Tool opensourced? ` +- :ref:`Is the Post-training Optimization Tool open-sourced? ` - :ref:`Can I quantize my model without a dataset? ` - :ref:`Can a model in any framework be quantized by the POT? ` - :ref:`What is a tradeoff when you go to low precision? ` @@ -16,14 +18,14 @@ If your question is not covered below, use the `OpenVINO™ Community Forum page - :ref:`When I execute POT CLI, I get "File "/workspace/venv/lib/python3.7/site-packages/nevergrad/optimization/base.py", line 35... SyntaxError: invalid syntax". What is wrong? ` - :ref:`What does a message "ModuleNotFoundError: No module named 'some\_module\_name'" mean? ` - :ref:`Is there a way to collect an intermediate IR when the AccuracyAware mechanism fails? ` -- :ref:`What do the messages "Output name: result_operation_name not found" or "Output node with result_operation_name is not found in graph" mean? ` +- :ref:`What do the messages "Output name: result_operation_name not found" or "Output node with result_operation_name is not found in the graph" mean? ` .. _opensourced-pot-faq: -Is the Post-training Optimization Tool (POT) opensourced? -+++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +Is the Post-training Optimization Tool (POT) open-sourced? +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -Yes, POT is developed on GitHub as a part of `openvinotoolkit/openvino ` under Apache-2.0 License. +Yes, POT is developed on GitHub as a part of `openvinotoolkit/openvino `__ under Apache-2.0 License. .. _dataset-pot-faq: @@ -38,7 +40,7 @@ If your dataset is not annotated, you can use :doc:`Default Quantization `. .. _noac-pot-faq: @@ -81,10 +83,10 @@ I get “RuntimeError: Cannot get memory” and “RuntimeError: Output data was These issues happen due to insufficient available amount of memory for statistics collection during the quantization process of a huge model or due to a very high resolution of input images in the quantization dataset. If you do not have a possibility to increase your RAM size, one of the following options can help: -- Set ``inplace_statistics`` parameters to ``True``. In that case the POT will change method collect statistics and use less memory. Note that such change might increase time required for quantization. -- Set ``eval_requests_number`` and ``stat_requests_number`` parameters to 1. In that case the POT will limit the number of infer requests by 1 and use less memory. -Note that such change might increase time required for quantization. -- Set ``use_fast_bias`` parameter to ``false``. In that case the POT will switch from the FastBiasCorrection algorithm to the full BiasCorrection algorithm +- Set ``inplace_statistics`` parameters to ``True``. In that case, the POT will change the method to collect statistics and use less memory. Note that such change might increase the time required for quantization. +- Set ``eval_requests_number`` and ``stat_requests_number`` parameters to 1. In that case, the POT will limit the number of infer requests by 1 and use less memory. +Note that such change might increase the time required for quantization. +- Set ``use_fast_bias`` parameter to ``false``. In that case, the POT will switch from the FastBiasCorrection algorithm to the full BiasCorrection algorithm which is usually more accurate and takes more time but requires less memory. See :doc:`Post-Training Optimization Best Practices ` for more details. - Reshape your model to a lower resolution and resize the size of images in the dataset. Note that such change might impact the accuracy. @@ -124,8 +126,8 @@ This error is reported when you have a Python version older than 3.7 in your env .. _nomodule-pot-faq: -What does a message "ModuleNotFoundError: No module named 'some\_module\_name'" mean? -+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +What does the message "ModuleNotFoundError: No module named 'some\_module\_name'" mean? +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ It means that some required python module is not installed in your environment. To install it, run ``pip install some_module_name``. @@ -141,6 +143,6 @@ You can add ``"dump_intermediate_model": true`` to the POT configuration file an What do the messages "Output name: result_operation_name not found" or "Output node with result_operation_name is not found in graph" mean? +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -Errors are caused by missing output nodes names in a graph when using the POT tool for model quantization. It might appear for some models only for IRs converted from ONNX models using new frontend (which is the default conversion path starting from 2022.1 release). To avoid such errors, use legacy MO frontend to convert a model to IR by passing the ``--use_legacy_frontend`` option. Then, use the produced IR for quantization. +Errors are caused by missing output nodes names in a graph when using the POT tool for model quantization. It might appear for some models only for IRs converted from ONNX models using the new frontend (which is the default conversion path starting from 2022.1 release). To avoid such errors, use the legacy MO frontend to convert a model to IR by passing the ``--use_legacy_frontend`` option. Then, use the produced IR for quantization. @endsphinxdirective diff --git a/tools/pot/docs/Introduction.md b/tools/pot/docs/Introduction.md index d36524bc090cae..0c6af74eca15d2 100644 --- a/tools/pot/docs/Introduction.md +++ b/tools/pot/docs/Introduction.md @@ -1,4 +1,4 @@ -# Post-training Quantization with POT {#pot_introduction} +# (Deprecated) Post-training Quantization with POT {#pot_introduction} @sphinxdirective @@ -13,14 +13,18 @@ Command-line Interface Examples pot_docs_FrequentlyAskedQuestions + (Experimental) Protecting Model + +.. note:: Post-training Optimization Tool is deprecated since OpenVINO 2023.0. :doc:`Neural Network Compression Framework (NNCF) ` is recommended for the post-training quantization instead. + For the needs of post-training optimization, OpenVINO™ provides a **Post-training Optimization Tool (POT)** which supports the **uniform integer quantization** method. This method allows moving from floating-point precision -to integer precision (for example, 8-bit) for weights and activations during the inference time. It helps to reduce +to integer precision (for example, 8-bit) for weights and activations during inference time. It helps to reduce the model size, memory footprint and latency, as well as improve the computational efficiency, using integer arithmetic. -During the quantization process the model undergoes the transformation process when additional operations, that contain +During the quantization process, the model undergoes the transformation process when additional operations, that contain quantization information, are inserted into the model. The actual transition to integer arithmetic happens at model inference. The post-training quantization algorithm takes samples from the representative dataset, inputs them into the network, @@ -88,6 +92,7 @@ Have more questions about post-training quantization or encountering errors usin Additional Resources ####################################### +* `Tutorial: Migrate quantization from POT API to NNCF API `__ * :doc:`Post-training Quantization Examples ` * :doc:`Quantization Best Practices ` * :doc:`Post-training Optimization Tool FAQ ` diff --git a/tools/pot/docs/SaturationIssue.md b/tools/pot/docs/SaturationIssue.md index e27debf1c51bd0..b776fe2eac043e 100644 --- a/tools/pot/docs/SaturationIssue.md +++ b/tools/pot/docs/SaturationIssue.md @@ -5,7 +5,7 @@ Introduction #################### -8-bit instructions of older Intel CPU generations (based on SSE, AVX-2, and AVX-512 instruction sets) are prone to so-called saturation (overflow) of the intermediate buffer when calculating the dot product, which is an essential part of Convolutional or MatMul operations. This saturation can lead to a drop in accuracy when running inference of 8-bit quantized models on the mentioned architectures. Additionally, it is impossible to predict if the issue occurs in a given setup, since most computations are executed in parallel during DL model inference, which makes this process non-deterministic. This is a common problem for models with non-ReLU activation functions and low level of redundancy (for example, optimized or efficient models). It can prevent deploying the model on legacy hardware or creating cross-platform applications. The problem does not occur on GPUs or CPUs with Intel Deep Learning Boost (VNNI) technology and further generations. +8-bit instructions of older Intel CPU generations (based on SSE, AVX-2, and AVX-512 instruction sets) are prone to so-called saturation (overflow) of the intermediate buffer when calculating the dot product, which is an essential part of Convolutional or MatMul operations. This saturation can lead to a drop in accuracy when running inference of 8-bit quantized models on the mentioned architectures. Additionally, it is impossible to predict if the issue occurs in a given setup since most computations are executed in parallel during DL model inference, which makes this process non-deterministic. This is a common problem for models with non-ReLU activation functions and low level of redundancy (for example, optimized or efficient models). It can prevent deploying the model on legacy hardware or creating cross-platform applications. The problem does not occur on GPUs or CPUs with Intel Deep Learning Boost (VNNI) technology and further generations. Saturation Problem Detection ############################ diff --git a/tools/pot/docs/SimplifiedMode.md b/tools/pot/docs/SimplifiedMode.md index ea543368c7c677..56f63bc971c175 100644 --- a/tools/pot/docs/SimplifiedMode.md +++ b/tools/pot/docs/SimplifiedMode.md @@ -5,7 +5,7 @@ Introduction #################### -Simplified mode is designed to make data preparation for the model optimization process easier. The mode is represented by an implementation of Engine interface from the POT API. It allows reading the data from an arbitrary folder specified by the user. For more details about POT API, refer to the corresponding :doc:`description `. Currently, Simplified mode is available only for image data in PNG or JPEG formats, stored in a single folder. It supports Computer Vision models with a single input or two inputs where the second is "image_info" (Faster R-CNN, Mask R-CNN, etc.). +Simplified mode is designed to make data preparation for the model optimization process easier. The mode is represented by an implementation of the Engine interface from the POT API. It allows reading the data from an arbitrary folder specified by the user. For more details about POT API, refer to the corresponding :doc:`description `. Currently, Simplified mode is available only for image data in PNG or JPEG formats, stored in a single folder. It supports Computer Vision models with a single input or two inputs where the second is "image_info" (Faster R-CNN, Mask R-CNN, etc.). .. note:: @@ -42,11 +42,11 @@ Example of generating 300 images with height = 224 and width = 256 and saving th datum generate -o ./dataset -k 300 --shape 224 256 -After that, ``OUTPUT_DIR`` can be provided to ``--data-source`` CLI option or to ``data_source`` config parameter. +After that, ``OUTPUT_DIR`` can be provided to the ``--data-source`` CLI option or to the ``data_source`` config parameter. There are two options to run POT in the Simplified mode: -* Using command-line options only. Here is an example for 8-bit quantization: +* Using command-line options only. Here is an example of 8-bit quantization: ``pot -q default -m -w --engine simplified --data-source `` @@ -58,13 +58,13 @@ There are two options to run POT in the Simplified mode: "type": "simplified", "layout": "NCHW", // Layout of input data. Supported ["NCHW", // "NHWC", "CHW", "CWH"] layout - "data_source": "PATH_TO_SOURCE" // You can specify path to the directory with images + "data_source": "PATH_TO_SOURCE" // You can specify a path to the directory with images // Also you can specify template for file names to filter images to load. // Templates are unix style (this option is valid only in Simplified mode) } -A template of configuration file for 8-bit quantization using Simplified mode can be found `at the following link `__. +A template of the configuration file for 8-bit quantization using Simplified mode can be found `at the following link `__. For more details about POT usage via CLI, refer to this :doc:`CLI document `. diff --git a/tools/pot/openvino/tools/pot/algorithms/quantization/accuracy_aware/README.md b/tools/pot/openvino/tools/pot/algorithms/quantization/accuracy_aware/README.md index 6973f7c0f7f543..3d7cc0f9310800 100644 --- a/tools/pot/openvino/tools/pot/algorithms/quantization/accuracy_aware/README.md +++ b/tools/pot/openvino/tools/pot/algorithms/quantization/accuracy_aware/README.md @@ -20,7 +20,7 @@ Here is an example of the definition of the Accuracy-aware Quantization method a .. code-block:: js { - "name": "AccuracyAwareQuantization", // the name of optimization algorithm + "name": "AccuracyAwareQuantization", // the name of the optimization algorithm "params": { ... } @@ -30,10 +30,10 @@ Here is an example of the definition of the Accuracy-aware Quantization method a Below are the descriptions of AccuracyAwareQuantization-specific parameters: - ``"ranking_subset_size"`` - size of a subset that is used to rank layers by their - contribution to the accuracy drop. Default value is ``300``, and more samples it + contribution to the accuracy drop. Default value is ``300``, and the more samples it has the better ranking, potentially. - ``"max_iter_num"`` - the maximum number of iterations of the algorithm. In other - words, the maximum number of layers that may be reverted back to floating-point + words, the maximum number of layers that may be reverted to floating-point precision. By default, it is limited by the overall number of quantized layers. - ``"maximal_drop"`` - the maximum accuracy drop which has to be achieved after the quantization. The default value is ``0.01`` (1%). @@ -47,7 +47,7 @@ Below are the descriptions of AccuracyAwareQuantization-specific parameters: - ``"base_algorithm"`` - name of the algorithm that is used to quantize a model at the beginning. The default value is "DefaultQuantization". - ``"convert_to_mixed_preset"`` - set to convert the model to "mixed" mode if the accuracy - criteria for the modelquantized with "performance" preset are not satisfied. + criteria for the model quantized with "performance" preset are not satisfied. This option can help to reduce number of layers that are reverted to floating-point precision.Keep in mind that this is an **experimental** feature. - ``"metrics"`` - an optional list of metrics that are taken into account during optimization. @@ -55,13 +55,13 @@ Below are the descriptions of AccuracyAwareQuantization-specific parameters: - ``"name"`` - name of the metric to optimize. - ``"baseline_value"`` - (optional parameter) a baseline metric value of the original - model. The validations onThe validation will be initiated entirely in the beginning if nothing specified. + model. The validations onThe validation will be initiated entirely in the beginning if nothing is specified. - ``"metric_subset_ratio"`` - a part of the validation set that is used to compare original full-precision and fully quantized models when creating a ranking subset in case of predefined metric values of the original model. The default value is ``0.5``. - ``"tune_hyperparams"`` - enables tuning of quantization parameters as a preliminary - step before reverting layers back to the floating-point precision. It can bring + step before reverting layers to the floating-point precision. It can bring an additional boost in performance and accuracy, at the cost of increased overall quantization time. The default value is ``False``. @@ -70,7 +70,7 @@ Additional Resources Example: -* `Quantization of Object Detection model with control of accuracy `__ +* `Quantization of Object Detection model with the control of accuracy `__ A template and full specification for AccuracyAwareQuantization algorithm for POT command-line interface: @@ -82,9 +82,9 @@ A template and full specification for AccuracyAwareQuantization algorithm for PO .. code-block:: javascript - /* This configuration file is the fastest way to get started with the accuracy aware + /* This configuration file is the fastest way to get started with the accuracy-aware quantization algorithm. It contains only mandatory options with commonly used - values. All other options can be considered as an advanced mode and requires + values. All other options can be considered as an advanced mode and require deep knowledge of the quantization process. An overall description of all possible parameters can be found in the accuracy_aware_quantization_spec.json */ diff --git a/tools/pot/openvino/tools/pot/algorithms/quantization/default/README.md b/tools/pot/openvino/tools/pot/algorithms/quantization/default/README.md index e685c503d001e4..d1f81a47200eaf 100644 --- a/tools/pot/openvino/tools/pot/algorithms/quantization/default/README.md +++ b/tools/pot/openvino/tools/pot/algorithms/quantization/default/README.md @@ -22,10 +22,10 @@ Default Quantization algorithm has mandatory and optional parameters. For more d Mandatory parameters ++++++++++++++++++++ -- ``"preset"`` - a preset which controls the quantization mode (symmetric and asymmetric). It can take two values: +- ``"preset"`` - a preset that controls the quantization mode (symmetric and asymmetric). It can take two values: - ``"performance"`` (default) - stands for symmetric quantization of weights and activations. This is the most efficient across all the HW. - - ``"mixed"`` - symmetric quantization of weights and asymmetric quantization of activations. This mode can be useful for quantization of NN, which has both negative and positive input values in quantizing operations, for example non-ReLU based CNN. + - ``"mixed"`` - symmetric quantization of weights and asymmetric quantization of activations. This mode can be useful for the quantization of NN, which has both negative and positive input values in quantizing operations, for example, non-ReLU based CNN. - ``"stat_subset_size"`` - size of a subset to calculate activations statistics used for quantization. The whole dataset is used if no parameter is specified. It is recommended to use not less than 300 samples. - ``"stat_batch_size"`` - size of a batch to calculate activations statistics used for quantization. It has a value of 1 if no parameter is specified. @@ -44,20 +44,20 @@ is an overall description of all possible parameters: - ``"operations"`` - list of operation types to exclude (expressed in OpenVINO IR notation). This list consists of the following tuples: - ``"type"`` - a type of ignored operation. - - ``"attributes"`` - if attributes are defined, they will be considered during the ignorance. They are defined bya dictionary of ``"": ""`` pairs. + - ``"attributes"`` - if attributes are defined, they will be considered during inference. They are defined by a dictionary of ``"": ""`` pairs. -- ``"weights"`` - this section describes quantization scheme for weights and the way to estimate the quantization range for that. It is worth noting that changing the quantization scheme may lead to inability to infer such mode on the existing HW. +- ``"weights"`` - this section describes the quantization scheme for weights and the way to estimate the quantization range for that. It is worth noting that changing the quantization scheme may lead to the inability to infer such mode on the existing HW. - ``"bits"`` - bit-width, the default value is "8". - ``"mode"`` - a quantization mode (symmetric or asymmetric). - - ``"level_low"`` - the minimum level in the integer range to quantize. The default is "0" for an unsigned range, and "-2^(bit-1)" for a signed one . + - ``"level_low"`` - the minimum level in the integer range to quantize. The default is "0" for an unsigned range, and "-2^(bit-1)" for a signed one. - ``"level_high"`` - the maximum level in the integer range to quantize. The default is "2^bits-1" for an unsigned range, and "2^(bit-1)-1" for a signed one. - ``"granularity"`` - quantization scale granularity. It can take the following values: - ``"pertensor"`` (default) - per-tensor quantization with one scale factor and zero-point. - ``"perchannel"`` - per-channel quantization with per-channel scale factor and zero-point. - - ``"range_estimator"`` - this section describes parameters of range estimator that is used in MinMaxQuantization method to get the quantization ranges and filter outliers based on the collected statistics. Below are the parameters that can be modified to get better accuracy results: + - ``"range_estimator"`` - this section describes the parameters of the range estimator that is used in the MinMaxQuantization method to get the quantization ranges and filter outliers based on the collected statistics. Below are the parameters that can be modified to get better accuracy results: - ``"max"`` - parameters to estimate top border of quantizing floating-point range: @@ -68,7 +68,7 @@ is an overall description of all possible parameters: - ``"outlier_prob"`` - outlier probability used in the "quantile" estimator. - - ``"min"`` - parameters to estimate bottom border of quantizing floating-point range: + - ``"min"`` - parameters to estimate the bottom border of quantizing floating-point range: - ``"type"`` - a type of the estimator: @@ -77,7 +77,7 @@ is an overall description of all possible parameters: - ``"outlier_prob"`` - outlier probability used in the "quantile" estimator. -- ``"activations"`` - this section describes quantization scheme for activations and the way to estimate the quantization range for that. As before, changing the quantization scheme may lead to inability to infer such mode on the existing HW: +- ``"activations"`` - this section describes the quantization scheme for activations and the way to estimate the quantization range for that. As before, changing the quantization scheme may lead to the inability to infer such mode on the existing HW: - ``"bits"`` - bit-width, the default value is "8". - ``"mode"`` - a quantization mode (symmetric or asymmetric). @@ -88,12 +88,12 @@ is an overall description of all possible parameters: - ``"pertensor"`` (default) - per-tensor quantization with one scale factor and zero-point. - ``"perchannel"`` - per-channel quantization with per-channel scale factor and zero-point. - - ``"range_estimator"`` - this section describes parameters of range estimator that is used in MinMaxQuantization method to get the quantization ranges and filter outliers based on the collected statistics. These are the parameters that can be modified to get better accuracy results: + - ``"range_estimator"`` - this section describes the parameters of the range estimator that is used in the MinMaxQuantization method to get the quantization ranges and filter outliers based on the collected statistics. These are the parameters that can be modified to get better accuracy results: - ``"preset"`` - preset that defines the same estimator for both top and bottom borders of quantizing floating-point range. Possible value is ``"quantile"``. - ``"max"`` - parameters to estimate top border of quantizing floating-point range: - - ``"aggregator"`` - a type of the function used to aggregate statistics obtained with the estimator over the calibration dataset to get a value of the top border: + - ``"aggregator"`` - a type of function used to aggregate statistics obtained with the estimator over the calibration dataset to get a value of the top border: - ``"mean"`` (default) - aggregates mean value. - ``"max"`` - aggregates max value. @@ -110,7 +110,7 @@ is an overall description of all possible parameters: - ``"outlier_prob"`` - outlier probability used in the "quantile" estimator. - - ``"min"`` - parameters to estimate bottom border of quantizing floating-point range: + - ``"min"`` - parameters to estimate the bottom border of quantizing floating-point range: - ``"type"`` - a type of the estimator: @@ -119,7 +119,7 @@ is an overall description of all possible parameters: - ``"outlier_prob"`` - outlier probability used in the "quantile" estimator. -- ``"use_layerwise_tuning"`` - enables layer-wise fine-tuning of model parameters (biases, Convolution/MatMul weights and FakeQuantize scales) by minimizing the mean squared error between original and quantized layer outputs. Enabling this option may increase compressed model accuracy, but will result in increased execution time and memory consumption. +- ``"use_layerwise_tuning"`` - enables layer-wise fine-tuning of model parameters (biases, Convolution/MatMul weights, and FakeQuantize scales) by minimizing the mean squared error between original and quantized layer outputs. Enabling this option may increase compressed model accuracy, but will result in increased execution time and memory consumption. Additional Resources #################### @@ -153,7 +153,7 @@ A template and full specification for DefaultQuantization algorithm for POT comm /* This configuration file is the fastest way to get started with the default quantization algorithm. It contains only mandatory options with commonly used - values. All other options can be considered as an advanced mode and requires + values. All other options can be considered as an advanced mode and require deep knowledge of the quantization process. An overall description of all possible parameters can be found in the default_quantization_spec.json */ @@ -185,7 +185,7 @@ A template and full specification for DefaultQuantization algorithm for POT comm // mode (symmetric, mixed (weights symmetric and activations asymmetric) // and fully asymmetric respectively) - "stat_subset_size": 300 // Size of subset to calculate activations statistics that can be used + "stat_subset_size": 300 // Size of the subset to calculate activations statistics that can be used // for quantization parameters calculation } } diff --git a/tools/pot/openvino/tools/pot/api/README.md b/tools/pot/openvino/tools/pot/api/README.md index e7ebdc98ce83e5..12af76c6230342 100644 --- a/tools/pot/openvino/tools/pot/api/README.md +++ b/tools/pot/openvino/tools/pot/api/README.md @@ -17,7 +17,7 @@ The base class for all DataLoaders. ``DataLoader`` loads data from a dataset and applies pre-processing to them providing access to the pre-processed data by index. -All subclasses should override ``__len__()`` function, which should return the size of the dataset, and ``__getitem__()``, +All subclasses should override the ``__len__()`` function, which should return the size of the dataset, and ``__getitem__()``, which supports integer indexing in the range of 0 to ``len(self)``. ``__getitem__()`` method can return data in one of the possible formats: .. code-block:: sh @@ -32,7 +32,7 @@ or (data, annotation, metadata) -``data`` is the input that is passed to the model at inference so that it should be properly preprocessed. ``data`` can be either ``numpy.array`` object or dictionary where the key is the name of the model input and value is ``numpy.array`` which corresponds to this input. The format of ``annotation`` should correspond to the expectations of the ``Metric`` class. ``metadata`` is an optional field that can be used to store additional information required for post-processing. +``data`` is the input that is passed to the model at inference so that it should be properly preprocessed. ``data`` can be either ``numpy.array`` object or dictionary where the key is the name of the model input and the value is ``numpy.array`` which corresponds to this input. The format of ``annotation`` should correspond to the expectations of the ``Metric`` class. ``metadata`` is an optional field that can be used to store additional information required for post-processing. Metric ++++++++++++++++++++ @@ -63,7 +63,7 @@ and methods: Required attributes: - - ``direction`` - (``higher-better`` or ``higher-worse``) a string parameter defining whether metric value should be increased in accuracy-aware algorithms. + - ``direction`` - (``higher-better`` or ``higher-worse``) a string parameter defining whether the metric value should be increased in accuracy-aware algorithms. - ``type`` - a string representation of metric type. For example, 'accuracy' or 'mean_iou'. Engine @@ -202,8 +202,8 @@ The POT Python* API provides the utility function to create and configure the pi Helpers and Internal Model Representation ######################################### -In order to simplify implementation of optimization pipelines we provide a set of ready-to-use helpers. Here we also -describe internal representation of the DL model and how to work with it. +To simplify the implementation of optimization pipelines we provide a set of ready-to-use helpers. Here we also +describe an internal representation of the DL model and how to work with it. IEEngine ++++++++++++++++++++ @@ -276,12 +276,12 @@ represented as an instance of this class. The cascaded model is stored as a list *Properties* - ``models`` - list of models of the cascaded model. -- ``is_cascade`` - returns True if the loaded model is cascaded model. +- ``is_cascade`` - returns True if the loaded model is a cascaded model. Read model from OpenVINO IR ++++++++++++++++++++++++++++++ -The Python POT API provides the utility function to load model from the OpenVINO™ Intermediate Representation (IR): +The Python POT API provides the utility function to load the model from the OpenVINO™ Intermediate Representation (IR): .. code-block:: sh @@ -334,10 +334,10 @@ The Python POT API provides the utility function to load model from the OpenVINO - ``CompressedModel`` instance -Save model to IR ----------------- +Save a model to IR +---------------------- -The Python POT API provides the utility function to save model in the OpenVINO™ Intermediate Representation (IR): +The Python POT API provides the utility function to save a model in the OpenVINO™ Intermediate Representation (IR): .. code-block:: sh @@ -349,7 +349,7 @@ The Python POT API provides the utility function to save model in the OpenVINO&t - ``model`` - ``CompressedModel`` instance. - ``save_path`` - path to save the model. - ``model_name`` - name under which the model will be saved. -- ``for_stat_collection`` - whether model is saved to be used for statistic collection or for normal inference (affects only cascaded models). If set to False, removes model prefixes from node names. +- ``for_stat_collection`` - whether the model is saved to be used for statistic collection or for inference (affects only cascaded models). If set to False, removes model prefixes from node names. *Returns* @@ -378,7 +378,7 @@ Base class for all Samplers. Sampler provides a way to iterate over the dataset. -All subclasses overwrite ``__iter__()`` method, providing a way to iterate over the dataset, and a ``__len__()`` method +All subclasses the ``__iter__()`` method, providing a way to iterate over the dataset, and a ``__len__()`` method that returns the length of the returned iterators. *Parameters* @@ -395,7 +395,7 @@ BatchSampler class openvino.tools.pot.samplers.batch_sampler.BatchSampler(data_loader, batch_size=1, subset_indices=None): Sampler provides an iterable over the dataset subset if ``subset_indices`` is specified -or over the whole dataset with given ``batch_size``. Returns a list of data items. +or over the whole dataset with a given ``batch_size``. Returns a list of data items. @endsphinxdirective diff --git a/tools/pot/openvino/tools/pot/api/samples/3d_segmentation/README.md b/tools/pot/openvino/tools/pot/api/samples/3d_segmentation/README.md index ad6110c9d45413..0c7b0c178efad8 100644 --- a/tools/pot/openvino/tools/pot/api/samples/3d_segmentation/README.md +++ b/tools/pot/openvino/tools/pot/api/samples/3d_segmentation/README.md @@ -3,15 +3,15 @@ @sphinxdirective This example demonstrates the use of the :doc:`Post-training Optimization Tool API ` for the task of quantizing a 3D segmentation model. -The `Brain Tumor Segmentation `__ model from PyTorch is used for this purpose. A custom ``DataLoader`` is created to load images in NIfTI format from `Medical Segmentation Decathlon BRATS 2017 `__ dataset for 3D semantic segmentation task and the implementation of Dice Index metric is used for the model evaluation. In addition, this example demonstrates how one can use image metadata obtained during image reading and preprocessing to post-process the model raw output. The code of the example is available on `GitHub `__. +The `Brain Tumor Segmentation `__ model from PyTorch is used for this purpose. A custom ``DataLoader`` is created to load images in NIfTI format from the `Medical Segmentation Decathlon BRATS 2017 `__ dataset for 3D semantic segmentation task and the implementation of the Dice Index metric is used for the model evaluation. In addition, this example demonstrates how one can use image metadata obtained during image reading and preprocessing to post-process the model raw output. The code of the example is available on `GitHub `__. -How to prepare the data +How to Prepare the Data ####################### To run this example, you will need to download the Brain Tumors 2017 part of the Medical Segmentation Decathlon image database http://medicaldecathlon.com/. 3D MRI data in NIfTI format can be found in the ``imagesTr`` folder, and segmentation masks are in ``labelsTr``. -How to Run the example +How to Run the Example ###################### 1. Launch :doc:`Model Downloader ` tool to download ``brain-tumor-segmentation-0002`` model from the Open Model Zoo repository. diff --git a/tools/pot/openvino/tools/pot/api/samples/README.md b/tools/pot/openvino/tools/pot/api/samples/README.md index 4de33a3622642e..1b521552c8533b 100644 --- a/tools/pot/openvino/tools/pot/api/samples/README.md +++ b/tools/pot/openvino/tools/pot/api/samples/README.md @@ -21,39 +21,39 @@ The following examples demonstrate the implementation of ``Engine``, ``Metric``, 1. :doc:`Quantizing Image Classification model ` - - Uses single ``MobilenetV2`` model from TensorFlow - - Implements ``DataLoader`` to load .JPEG images and annotations of Imagenet database + - Uses a single ``MobilenetV2`` model from TensorFlow + - Implements ``DataLoader`` to load .JPEG images and annotations of the Imagenet database - Implements ``Metric`` interface to calculate Accuracy at top-1 metric - Uses DefaultQuantization algorithm for quantization model 2. :doc:`Quantizing Object Detection Model with Accuracy Control ` - - Uses single ``MobileNetV1 FPN`` model from TensorFlow - - Implements ``Dataloader`` to load images of COCO database + - Uses asingle ``MobileNetV1 FPN`` model from TensorFlow + - Implements ``Dataloader`` to load images of the COCO database - Implements ``Metric`` interface to calculate ``mAP@[.5:.95]`` metric - Uses ``AccuracyAwareQuantization`` algorithm for quantization model 3. :doc:`Quantizing Semantic Segmentation Model ` - - Uses single ``DeepLabV3`` model from TensorFlow - - Implements ``DataLoader`` to load .JPEG images and annotations of Pascal VOC 2012 database + - Uses a single ``DeepLabV3`` model from TensorFlow + - Implements ``DataLoader`` to load .JPEG images and annotations of the Pascal VOC 2012 database - Implements ``Metric`` interface to calculate Mean Intersection Over Union metric - Uses DefaultQuantization algorithm for quantization model 4. :doc:`Quantizing 3D Segmentation Model ` - - Uses single ``Brain Tumor Segmentation`` model from PyTorch - - Implements ``DataLoader`` to load images in NIfTI format from Medical Segmentation Decathlon BRATS 2017 database + - Uses a single ``Brain Tumor Segmentation`` model from PyTorch + - Implements ``DataLoader`` to load images in NIfTI format from the Medical Segmentation Decathlon BRATS 2017 database - Implements ``Metric`` interface to calculate Dice Index metric - Demonstrates how to use image metadata obtained during data loading to post-process the raw model output - Uses DefaultQuantization algorithm for quantization model 5. :doc:`Quantizing Cascaded model ` - - Uses cascaded (composite) ``MTCNN`` model from Caffe that consists of three separate models in an OpenVINO™ Intermediate Representation (IR) - - Implements ``Dataloader`` to load .jpg images of WIDER FACE database + - Uses a cascaded (composite) ``MTCNN`` model from Caffe that consists of three separate models in an OpenVINO™ Intermediate Representation (IR) + - Implements ``Dataloader`` to load .jpg images of the WIDER FACE database - Implements ``Metric`` interface to calculate Recall metric - - Implements ``Engine`` class that is inherited from ``IEEngine`` to create a complex staged pipeline to sequentially execute each of the three stages of the MTCNN model, represented by multiple models in IR. It uses engine helpers to set model in OpenVINO Inference Engine and process raw model output for the correct statistics collection + - Implements ``Engine`` class that is inherited from ``IEEngine`` to create a complex staged pipeline to sequentially execute each of the three stages of the MTCNN model, represented by multiple models in IR. It uses engine helpers to set a model in OpenVINO Inference Engine and process raw model output for the correct statistics collection - Uses DefaultQuantization algorithm for quantization model 6. :doc:`Quantizing for GNA Device ` @@ -62,7 +62,7 @@ The following examples demonstrate the implementation of ``Engine``, ``Metric``, - Implements ``DataLoader`` to load data in .ark format - Uses DefaultQuantization algorithm for quantization model -After execution of each example above the quantized model is placed into the folder ``optimized``. The accuracy validation of the quantized model is performed right after the quantization. +After the execution of each example above, the quantized model is placed into the folder ``optimized``. The accuracy validation of the quantized model is performed right after the quantization. See the tutorials #################### diff --git a/tools/pot/openvino/tools/pot/api/samples/classification/README.md b/tools/pot/openvino/tools/pot/api/samples/classification/README.md index 5de99c8152880c..ff330fa2d37f73 100644 --- a/tools/pot/openvino/tools/pot/api/samples/classification/README.md +++ b/tools/pot/openvino/tools/pot/api/samples/classification/README.md @@ -6,14 +6,14 @@ This example demonstrates the use of the :doc:`Post-training Optimization Tool A The `MobilenetV2 `__ model from TensorFlow is used for this purpose. A custom ``DataLoader`` is created to load the `ImageNet `__ classification dataset and the implementation of Accuracy at top-1 metric is used for the model evaluation. The code of the example is available on `GitHub `__. -How to prepare the data +How to Prepare the Data ####################### To run this example, you need to `download `__ the validation part of the ImageNet image database and place it in a separate folder, -which will be later referred as ````. Annotations to images should be stored in a separate .txt file (````) in the format ``image_name label``. +which will be later referred to as ````. Annotations to images should be stored in a separate .txt file (````) in the format ``image_name label``. -How to Run the example +How to Run the Example ###################### 1. Launch :doc:`Model Downloader ` tool to download ``mobilenet-v2-1.0-224`` model from the Open Model Zoo repository. diff --git a/tools/pot/openvino/tools/pot/api/samples/face_detection/README.md b/tools/pot/openvino/tools/pot/api/samples/face_detection/README.md index 8b26a38b8ff88c..42648ffde196ff 100644 --- a/tools/pot/openvino/tools/pot/api/samples/face_detection/README.md +++ b/tools/pot/openvino/tools/pot/api/samples/face_detection/README.md @@ -4,12 +4,12 @@ This example demonstrates the use of the :doc:`Post-training Optimization Tool API ` for the task of quantizing a face detection model. The `MTCNN `__ model from Caffe is used for this purpose. -A custom ``DataLoader`` is created to load `WIDER FACE `__ dataset for a face detection task +A custom ``DataLoader`` is created to load the `WIDER FACE `__ dataset for a face detection task and the implementation of Recall metric is used for the model evaluation. In addition, this example demonstrates how one can implement an engine to infer a cascaded (composite) model that is represented by multiple submodels in an OpenVINO™ Intermediate Representation (IR) and has a complex staged inference pipeline. The code of the example is available on `GitHub `__. -How to prepare the data +How to Prepare the Data ####################### To run this example, you need to download the validation part of the Wider Face dataset http://shuoyang1213.me/WIDERFACE/. @@ -17,7 +17,7 @@ Images with faces divided into categories are placed in the ``WIDER_val/images`` Annotations in .txt format containing the coordinates of the face bounding boxes of the validation part of the dataset can be downloaded separately and are located in the ``wider_face_split/wider_face_val_bbx_gt.txt`` file. -How to Run the example +How to Run the Example ###################### 1. Launch :doc:`Model Downloader ` tool to download ``mtcnn`` model from the Open Model Zoo repository. diff --git a/tools/pot/openvino/tools/pot/api/samples/segmentation/README.md b/tools/pot/openvino/tools/pot/api/samples/segmentation/README.md index c70be837ac1c9c..b014d7976312d8 100644 --- a/tools/pot/openvino/tools/pot/api/samples/segmentation/README.md +++ b/tools/pot/openvino/tools/pot/api/samples/segmentation/README.md @@ -1,18 +1,20 @@ # Quantizing Semantic Segmentation Model {#pot_example_segmentation_README} +@sphinxdirective + This example demonstrates the use of the :doc:`Post-training Optimization Tool API ` for the task of quantizing a segmentation model. The `DeepLabV3 ` model from TensorFlow is used for this purpose. A custom `DataLoader` is created to load the `Pascal VOC 2012 `__ dataset for semantic segmentation task and the implementation of Mean Intersection Over Union metric is used for the model evaluation. The code of the example is available on `GitHub `__. -How to prepare the data +How to Prepare the Data ####################### To run this example, you will need to download the validation part of the Pascal VOC 2012 image database http://host.robots.ox.ac.uk/pascal/VOC/voc2012/#data. Images are placed in the ``JPEGImages`` folder, ImageSet file with the list of image names for the segmentation task can be found at ``ImageSets/Segmentation/val.txt`` and segmentation masks are kept in the ``SegmentationClass`` directory. -How to Run the example +How to Run the Example ###################### 1. Launch :doc:`Model Downloader ` tool to download ``deeplabv3`` model from the Open Model Zoo repository. @@ -37,3 +39,5 @@ How to Run the example Optional: you can specify .bin file of IR directly using the ``-w``, ``--weights`` options. + +@endsphinxdirective diff --git a/tools/pot/openvino/tools/pot/api/samples/speech/README.md b/tools/pot/openvino/tools/pot/api/samples/speech/README.md index d88f68bce68b7c..8fcc2be7197de0 100644 --- a/tools/pot/openvino/tools/pot/api/samples/speech/README.md +++ b/tools/pot/openvino/tools/pot/api/samples/speech/README.md @@ -2,18 +2,18 @@ @sphinxdirective -This example demonstrates the use of the :doc:`Post-training Optimization Tool API ` for the task of quantizing a speech model for :doc:`GNA ` device. Quantization for GNA is different from CPU quantization due to device specific: GNA supports quantized inputs in INT16 and INT32 (for activations) precision and quantized weights in INT8 and INT16 precision. +This example demonstrates the use of the :doc:`Post-training Optimization Tool API ` for the task of quantizing a speech model for :doc:`GNA ` device. Quantization for GNA is different from CPU quantization due to device specifics: GNA supports quantized inputs in INT16 and INT32 (for activations) precision and quantized weights in INT8 and INT16 precision. This example contains pre-selected quantization options based on the DefaultQuantization algorithm and created for models from `Kaldi `__ framework, and its data format. A custom ``ArkDataLoader`` is created to load the dataset from files with .ark extension for speech analysis task. -How to prepare the data +How to Prepare the Data ####################### To run this example, you will need to use the .ark files for each model input from your ````. For generating data from original formats to .ark, please, follow the `Kaldi data preparation tutorial `__. -How to Run the example +How to Run the Example ###################### 1. Launch :doc:`Model Optimizer ` with the necessary options (for details follow the :doc:`instructions for Kaldi ` to generate Intermediate Representation (IR) files for the model: @@ -32,14 +32,14 @@ How to Run the example Required parameters: - - ``-i``, ``--input_names`` option. Defines list of model inputs; - - ``-f``, ``--files_for_input`` option. Defines list of filenames (.ark) mapped with input names. You should define names without extension, for example: FILENAME_1, FILENAME_2 maps with INPUT_1, INPUT_2. + - ``-i``, ``--input_names`` option. Defines the list of model inputs; + - ``-f``, ``--files_for_input`` option. Defines the list of filenames (.ark) mapped with input names. You should define names without extension, for example: FILENAME_1, FILENAME_2 maps with INPUT_1, INPUT_2. Optional parameters: - - ``-p``, ``--preset`` option. Defines preset for quantization: ``performance`` for INT8 weights, ``accuracy`` for INT16 weights; - - ``-s``, ``--subset_size`` option. Defines subset size for calibration; - - ``-o``, ``--output`` option. Defines output folder for quantized model. + - ``-p``, ``--preset`` option. Defines preset for quantization: ``performance`` for INT8 weights, ``accuracy`` for INT16 weights; + - ``-s``, ``--subset_size`` option. Defines subset size for calibration; + - ``-o``, ``--output`` option. Defines output folder for the quantized model. 3. Validate your INT8 model using ``./speech_example`` from the Inference Engine examples. Follow the :doc:`speech example description link ` for details.