Add quantization API doc and oneDNN to migration guide (#20813)

* Add quantization API doc * Add oneDNN info to migration guide * Apply suggestions from code review Co-authored-by: bartekkuncer <[email protected]> * Apply review * Apply suggestions from code review Co-authored-by: bartekkuncer <[email protected]> * Apply suggestions from code review Co-authored-by: Andrzej Kotłowski <[email protected]> * Update link * Apply suggestions from code review Co-authored-by: bartekkuncer <[email protected]> Co-authored-by: Bartlomiej Gawrych <[email protected]> Co-authored-by: bartekkuncer <[email protected]> Co-authored-by: Andrzej Kotłowski <[email protected]>
apache · Feb 5, 2022 · 1cb4d1d · 1cb4d1d
1 parent 8d67dbb
commit 1cb4d1d
Show file tree

Hide file tree

Showing 4 changed files with 116 additions and 24 deletions.
diff --git a/docs/python_docs/python/api/contrib/index.rst b/docs/python_docs/python/api/contrib/index.rst
@@ -67,6 +67,11 @@ Contributed modules
 
       Functions for manipulating text data.
 
+   .. card::
+      :title: contrib.quantization
+      :link: quantization/index.html
+
+      Functions for precision reduction.
 
 .. toctree::
    :hidden:

diff --git a/docs/python_docs/python/api/contrib/quantization/index.rst b/docs/python_docs/python/api/contrib/quantization/index.rst
@@ -0,0 +1,23 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+   or more contributor license agreements.  See the NOTICE file
+   distributed with this work for additional information
+   regarding copyright ownership.  The ASF licenses this file
+   to you under the Apache License, Version 2.0 (the
+   "License"); you may not use this file except in compliance
+   with the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing,
+   software distributed under the License is distributed on an
+   "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+   KIND, either express or implied.  See the License for the
+   specific language governing permissions and limitations
+   under the License.
+
+contrib.quantization
+====================
+
+.. automodule:: mxnet.contrib.quantization
+    :members:
+    :autosummary:
diff --git a/docs/python_docs/python/tutorials/getting-started/gluon_migration_guide.md b/docs/python_docs/python/tutorials/getting-started/gluon_migration_guide.md
@@ -432,6 +432,67 @@ A new module called `mxnet.gluon.probability` has been introduced in Gluon 2.0.
 
 3. [Transformation](https://github.com/apache/incubator-mxnet/tree/master/python/mxnet/gluon/probability/transformation): implement invertible transformation with computable log det jacobians.
 
+##  oneDNN Integration
+### Operator Fusion
+In versions 1.x of MXNet pattern fusion in execution graph was enabled by default when using MXNet built with oneDNN library support and could have been disabled by setting 'MXNET_SUBGRAPH_BACKEND' environment flag to `None`. MXNet 2.0 introduced changes in forward inference flow which led to refactor of fusion mechanism. To fuse model in MXNet 2.0 there are two requirements:
+
+ - the model must be defined as a subclass of HybridBlock or Symbol,
+
+ - the model must have specific operator patterns which can be fused.
+
+Both HybridBlock and Symbol classes provide API to easily run fusion of operators. Adding only one line of code is needed to run fusion passes on model:
+```{.python}
+# on HybridBlock
+net.optimize_for(data, backend='ONEDNN')
+# on Symbol
+optimized_symbol = sym.optimize_for(backend='ONEDNN')
+```
+
+Controling which patterns should be fused still can be done by setting proper environment variables. See [**oneDNN Environment Variables**](#oneDNN-Environment-Variables)
+
+### INT8 Quantization / Precision reduction
+Quantization API was also refactored to be consistent with other new features and mechanisms. In comparison to MXNet 1.x releases, in MXNet 2.0 `quantize_net_v2` function has been removed and development focused mainly on `quantize_net` function to make it easier to use for end user and ultimately give him more flexibility.
+Quantization can be performed on either subclass of HybridBlock with `quantize_net` or Symbol with deprecated `quantize_model` (`quantize_model` is left only to provide backward compatibility and its usage is strongly discouraged).
+
+```{.python}
+import mxnet as mx
+from mxnet.contrib.quantization import quantize_net
+from mxnet.gluon.model_zoo.vision import resnet50_v1
+
+# load model
+net = resnet50_v1(pretrained=True)
+
+# prepare calibration data
+dummy_data = mx.nd.random.uniform(-1.0, 1.0, (batch_size, 3, 224, 224))
+calib_data_loader = mx.gluon.data.DataLoader(dummy_data, batch_size=batch_size)
+
+# quantization
+qnet = quantize_net(net, calib_mode='naive', calib_data=calib_data_loader)
+```
+`quantize_net` can be much more complex - all function attributes can be found in the [API](../../api/contrib/quantization/index.rst).
+
+### oneDNN Environment Variables
+In version 2.0 of MXNet all references to MKLDNN (former name of oneDNN) were replaced by ONEDNN. Below table lists all environment variables:
+
+|               MXNet 1.x              |                MXNet 2.0               |
+| ------------------------------------ | ---------------------------------------|
+|         MXNET_MKLDNN_ENABLED         |          MXNET_ONEDNN_ENABLED          |
+|         MXNET_MKLDNN_CACHE_NUM       |         MXNET_ONEDNN_CACHE_NUM         |
+|    MXNET_MKLDNN_FORCE_FC_AB_FORMAT   |     MXNET_ONEDNN_FORCE_FC_AB_FORMAT    |
+|         MXNET_MKLDNN_ENABLED         |          MXNET_ONEDNN_ENABLED          |
+|         MXNET_MKLDNN_DEBUG           |           MXNET_ONEDNN_DEBUG           |
+|         MXNET_USE_MKLDNN_RNN         |          MXNET_USE_ONEDNN_RNN          |
+|     MXNET_DISABLE_MKLDNN_CONV_OPT    |      MXNET_DISABLE_ONEDNN_CONV_OPT     |
+|    MXNET_DISABLE_MKLDNN_FUSE_CONV_BN |    MXNET_DISABLE_ONEDNN_FUSE_CONV_BN   |
+|  MXNET_DISABLE_MKLDNN_FUSE_CONV_RELU |   MXNET_DISABLE_ONEDNN_FUSE_CONV_RELU  |
+|  MXNET_DISABLE_MKLDNN_FUSE_CONV_SUM  |   MXNET_DISABLE_ONEDNN_FUSE_CONV_SUM   |
+|      MXNET_DISABLE_MKLDNN_FC_OPT     |       MXNET_DISABLE_ONEDNN_FC_OPT      |
+| MXNET_DISABLE_MKLDNN_FUSE_FC_ELTWISE |  MXNET_DISABLE_ONEDNN_FUSE_FC_ELTWISE  |
+| MXNET_DISABLE_MKLDNN_TRANSFORMER_OPT |  MXNET_DISABLE_ONEDNN_TRANSFORMER_OPT  |
+|                  n/a                 |   MXNET_DISABLE_ONEDNN_BATCH_DOT_FUSE  |
+|                  n/a                 |      MXNET_ONEDNN_FUSE_REQUANTIZE      |
+|                  n/a                 |      MXNET_ONEDNN_FUSE_DEQUANTIZE      |
+
 ## Appendix
 ### NumPy Array Deprecated Attributes
 |                   Deprecated Attributes               |    NumPy ndarray Equivalent    |

diff --git a/python/mxnet/contrib/quantization.py b/python/mxnet/contrib/quantization.py
@@ -46,7 +46,7 @@ def _quantize_params(qsym, params, min_max_dict):
     qsym : Symbol
         Quantized symbol from FP32 symbol.
     params : dict of str->NDArray
-    min_max_dict: dict of min/max pairs of layers' output
+    min_max_dict : dict of min/max pairs of layers' output
     """
     inputs_name = qsym.list_arguments()
     quantized_params = {}
@@ -110,11 +110,11 @@ def _quantize_symbol(sym, device, excluded_symbols=None, excluded_operators=None
         Names of the parameters that users want to quantize offline. It's always recommended to
         quantize parameters offline so that quantizing parameters during the inference can be
         avoided.
-    quantized_dtype: str
+    quantized_dtype : str
         The quantized destination type for input data.
-    quantize_mode: str
+    quantize_mode : str
         The mode that quantization pass to apply.
-    quantize_granularity: str
+    quantize_granularity : str
         The granularity of quantization, currently supports 'tensor-wise' and 'channel-wise'
         quantization. The default value is 'tensor-wise'.
     """
@@ -174,15 +174,16 @@ def __init__(self):
     def collect(self, name, op_name, arr):
         """Function which is registered to Block as monitor callback. Names of layers
         requiring calibration are stored in `self.include_layers` variable.
-            Parameters
-            ----------
-            name : str
-                Node name from which collected data comes from
-            op_name : str
-                Operator name from which collected data comes from. Single operator
-                can have multiple inputs/ouputs nodes - each should have different name
-            arr : NDArray
-                NDArray containing data of monitored node
+
+        Parameters
+        ----------
+        name : str
+            Node name from which collected data comes from.
+        op_name : str
+            Operator name from which collected data comes from. Single operator
+            can have multiple input/ouput nodes - each should have different name.
+        arr : NDArray
+            NDArray containing data of monitored node.
         """
 
     def post_collect(self):
@@ -227,8 +228,7 @@ def post_collect(self):
 
     @staticmethod
     def combine_histogram(old_hist, arr, new_min, new_max, new_th):
-        """ Collect layer histogram for arr and combine it with old histogram.
-        """
+        """Collect layer histogram for arr and combine it with old histogram."""
         (old_hist, old_hist_edges, old_min, old_max, old_th) = old_hist
         if new_th <= old_th:
             hist, _ = np.histogram(arr, bins=len(old_hist), range=(-old_th, old_th))
@@ -392,21 +392,22 @@ def quantize_model(sym, arg_params, aux_params, data_names=('data',),
     The backend quantized operators are only enabled for Linux systems. Please do not run
     inference using the quantized models on Windows for now.
     The quantization implementation adopts the TensorFlow's approach:
-    https://www.tensorflow.org/performance/quantization.
+    https://www.tensorflow.org/lite/performance/post_training_quantization.
     The calibration implementation borrows the idea of Nvidia's 8-bit Inference with TensorRT:
     http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf
     and adapts the method to MXNet.
 
     .. _`quantize_model_params`:
+
     Parameters
     ----------
-    sym : str or Symbol
+    sym : Symbol
         Defines the structure of a neural network for FP32 data types.
     arg_params : dict
         Dictionary of name to `NDArray`.
     aux_params : dict
         Dictionary of name to `NDArray`.
-    data_names : a list of strs
+    data_names : list of strings
         Data names required for creating a Module object to run forward propagation on the
         calibration dataset.
     device : Device
@@ -441,15 +442,15 @@ def quantize_model(sym, arg_params, aux_params, data_names=('data',),
         The mode that quantization pass to apply. Support 'full' and 'smart'.
         'full' means quantize all operator if possible.
         'smart' means quantization pass will smartly choice which operator should be quantized.
-    quantize_granularity: str
+    quantize_granularity : str
         The granularity of quantization, currently supports 'tensor-wise' and 'channel-wise'
         quantization. The default value is 'tensor-wise'.
     logger : Object
         A logging object for printing information during the process of quantization.
 
     Returns
     -------
-    quantized_model: tuple
+    quantized_model : tuple
         A tuple of quantized symbol, quantized arg_params, and aux_params.
     """
     warnings.warn('WARNING: This will be deprecated please use quantize_net with Gluon models')
@@ -582,9 +583,10 @@ def quantize_graph(sym, arg_params, aux_params, device=cpu(),
     and a collector for naive or entropy calibration.
     The backend quantized operators are only enabled for Linux systems. Please do not run
     inference using the quantized models on Windows for now.
+
     Parameters
     ----------
-    sym : str or Symbol
+    sym : Symbol
         Defines the structure of a neural network for FP32 data types.
     device : Device
         Defines the device that users want to run forward propagation on the calibration
@@ -616,7 +618,7 @@ def quantize_graph(sym, arg_params, aux_params, device=cpu(),
         The mode that quantization pass to apply. Support 'full' and 'smart'.
         'full' means quantize all operator if possible.
         'smart' means quantization pass will smartly choice which operator should be quantized.
-    quantize_granularity: str
+    quantize_granularity : str
         The granularity of quantization, currently supports 'tensor-wise' and 'channel-wise'
         quantization. The default value is 'tensor-wise'.
     LayerOutputCollector : subclass of CalibrationCollector
@@ -700,13 +702,14 @@ def quantize_graph(sym, arg_params, aux_params, device=cpu(),
     return qsym, qarg_params, aux_params, collector, calib_layers
 
 def calib_graph(qsym, arg_params, aux_params, collector,
-                calib_mode='entropy', logger=logging):
+                calib_mode='entropy', logger=None):
     """User-level API for calibrating a quantized model using a filled collector.
     The backend quantized operators are only enabled for Linux systems. Please do not run
     inference using the quantized models on Windows for now.
+
     Parameters
     ----------
-    qsym : str or Symbol
+    qsym : Symbol
         Defines the structure of a neural network for INT8 data types.
     arg_params : dict
         Dictionary of name to `NDArray`.