diff --git a/docs/python_docs/python/api/contrib/index.rst b/docs/python_docs/python/api/contrib/index.rst index 1319239abf94..f49a8e978548 100644 --- a/docs/python_docs/python/api/contrib/index.rst +++ b/docs/python_docs/python/api/contrib/index.rst @@ -67,6 +67,11 @@ Contributed modules Functions for manipulating text data. + .. card:: + :title: contrib.quantization + :link: quantization/index.html + + Functions for precision reduction. .. toctree:: :hidden: diff --git a/docs/python_docs/python/api/contrib/quantization/index.rst b/docs/python_docs/python/api/contrib/quantization/index.rst new file mode 100644 index 000000000000..a0f7ca56eefd --- /dev/null +++ b/docs/python_docs/python/api/contrib/quantization/index.rst @@ -0,0 +1,23 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +contrib.quantization +==================== + +.. automodule:: mxnet.contrib.quantization + :members: + :autosummary: diff --git a/docs/python_docs/python/tutorials/getting-started/gluon_migration_guide.md b/docs/python_docs/python/tutorials/getting-started/gluon_migration_guide.md index 81fa54dc3a8f..e142ccc32a68 100644 --- a/docs/python_docs/python/tutorials/getting-started/gluon_migration_guide.md +++ b/docs/python_docs/python/tutorials/getting-started/gluon_migration_guide.md @@ -432,6 +432,67 @@ A new module called `mxnet.gluon.probability` has been introduced in Gluon 2.0. 3. [Transformation](https://github.com/apache/incubator-mxnet/tree/master/python/mxnet/gluon/probability/transformation): implement invertible transformation with computable log det jacobians. +## oneDNN Integration +### Operator Fusion +In versions 1.x of MXNet pattern fusion in execution graph was enabled by default when using MXNet built with oneDNN library support and could have been disabled by setting 'MXNET_SUBGRAPH_BACKEND' environment flag to `None`. MXNet 2.0 introduced changes in forward inference flow which led to refactor of fusion mechanism. To fuse model in MXNet 2.0 there are two requirements: + + - the model must be defined as a subclass of HybridBlock or Symbol, + + - the model must have specific operator patterns which can be fused. + +Both HybridBlock and Symbol classes provide API to easily run fusion of operators. Adding only one line of code is needed to run fusion passes on model: +```{.python} +# on HybridBlock +net.optimize_for(data, backend='ONEDNN') +# on Symbol +optimized_symbol = sym.optimize_for(backend='ONEDNN') +``` + +Controling which patterns should be fused still can be done by setting proper environment variables. See [**oneDNN Environment Variables**](#oneDNN-Environment-Variables) + +### INT8 Quantization / Precision reduction +Quantization API was also refactored to be consistent with other new features and mechanisms. In comparison to MXNet 1.x releases, in MXNet 2.0 `quantize_net_v2` function has been removed and development focused mainly on `quantize_net` function to make it easier to use for end user and ultimately give him more flexibility. +Quantization can be performed on either subclass of HybridBlock with `quantize_net` or Symbol with deprecated `quantize_model` (`quantize_model` is left only to provide backward compatibility and its usage is strongly discouraged). + +```{.python} +import mxnet as mx +from mxnet.contrib.quantization import quantize_net +from mxnet.gluon.model_zoo.vision import resnet50_v1 + +# load model +net = resnet50_v1(pretrained=True) + +# prepare calibration data +dummy_data = mx.nd.random.uniform(-1.0, 1.0, (batch_size, 3, 224, 224)) +calib_data_loader = mx.gluon.data.DataLoader(dummy_data, batch_size=batch_size) + +# quantization +qnet = quantize_net(net, calib_mode='naive', calib_data=calib_data_loader) +``` +`quantize_net` can be much more complex - all function attributes can be found in the [API](../../api/contrib/quantization/index.rst). + +### oneDNN Environment Variables +In version 2.0 of MXNet all references to MKLDNN (former name of oneDNN) were replaced by ONEDNN. Below table lists all environment variables: + +| MXNet 1.x | MXNet 2.0 | +| ------------------------------------ | ---------------------------------------| +| MXNET_MKLDNN_ENABLED | MXNET_ONEDNN_ENABLED | +| MXNET_MKLDNN_CACHE_NUM | MXNET_ONEDNN_CACHE_NUM | +| MXNET_MKLDNN_FORCE_FC_AB_FORMAT | MXNET_ONEDNN_FORCE_FC_AB_FORMAT | +| MXNET_MKLDNN_ENABLED | MXNET_ONEDNN_ENABLED | +| MXNET_MKLDNN_DEBUG | MXNET_ONEDNN_DEBUG | +| MXNET_USE_MKLDNN_RNN | MXNET_USE_ONEDNN_RNN | +| MXNET_DISABLE_MKLDNN_CONV_OPT | MXNET_DISABLE_ONEDNN_CONV_OPT | +| MXNET_DISABLE_MKLDNN_FUSE_CONV_BN | MXNET_DISABLE_ONEDNN_FUSE_CONV_BN | +| MXNET_DISABLE_MKLDNN_FUSE_CONV_RELU | MXNET_DISABLE_ONEDNN_FUSE_CONV_RELU | +| MXNET_DISABLE_MKLDNN_FUSE_CONV_SUM | MXNET_DISABLE_ONEDNN_FUSE_CONV_SUM | +| MXNET_DISABLE_MKLDNN_FC_OPT | MXNET_DISABLE_ONEDNN_FC_OPT | +| MXNET_DISABLE_MKLDNN_FUSE_FC_ELTWISE | MXNET_DISABLE_ONEDNN_FUSE_FC_ELTWISE | +| MXNET_DISABLE_MKLDNN_TRANSFORMER_OPT | MXNET_DISABLE_ONEDNN_TRANSFORMER_OPT | +| n/a | MXNET_DISABLE_ONEDNN_BATCH_DOT_FUSE | +| n/a | MXNET_ONEDNN_FUSE_REQUANTIZE | +| n/a | MXNET_ONEDNN_FUSE_DEQUANTIZE | + ## Appendix ### NumPy Array Deprecated Attributes | Deprecated Attributes | NumPy ndarray Equivalent | diff --git a/python/mxnet/contrib/quantization.py b/python/mxnet/contrib/quantization.py index be0282fe8a81..4ad354a7d2e2 100644 --- a/python/mxnet/contrib/quantization.py +++ b/python/mxnet/contrib/quantization.py @@ -46,7 +46,7 @@ def _quantize_params(qsym, params, min_max_dict): qsym : Symbol Quantized symbol from FP32 symbol. params : dict of str->NDArray - min_max_dict: dict of min/max pairs of layers' output + min_max_dict : dict of min/max pairs of layers' output """ inputs_name = qsym.list_arguments() quantized_params = {} @@ -110,11 +110,11 @@ def _quantize_symbol(sym, device, excluded_symbols=None, excluded_operators=None Names of the parameters that users want to quantize offline. It's always recommended to quantize parameters offline so that quantizing parameters during the inference can be avoided. - quantized_dtype: str + quantized_dtype : str The quantized destination type for input data. - quantize_mode: str + quantize_mode : str The mode that quantization pass to apply. - quantize_granularity: str + quantize_granularity : str The granularity of quantization, currently supports 'tensor-wise' and 'channel-wise' quantization. The default value is 'tensor-wise'. """ @@ -174,15 +174,16 @@ def __init__(self): def collect(self, name, op_name, arr): """Function which is registered to Block as monitor callback. Names of layers requiring calibration are stored in `self.include_layers` variable. - Parameters - ---------- - name : str - Node name from which collected data comes from - op_name : str - Operator name from which collected data comes from. Single operator - can have multiple inputs/ouputs nodes - each should have different name - arr : NDArray - NDArray containing data of monitored node + + Parameters + ---------- + name : str + Node name from which collected data comes from. + op_name : str + Operator name from which collected data comes from. Single operator + can have multiple input/ouput nodes - each should have different name. + arr : NDArray + NDArray containing data of monitored node. """ def post_collect(self): @@ -227,8 +228,7 @@ def post_collect(self): @staticmethod def combine_histogram(old_hist, arr, new_min, new_max, new_th): - """ Collect layer histogram for arr and combine it with old histogram. - """ + """Collect layer histogram for arr and combine it with old histogram.""" (old_hist, old_hist_edges, old_min, old_max, old_th) = old_hist if new_th <= old_th: hist, _ = np.histogram(arr, bins=len(old_hist), range=(-old_th, old_th)) @@ -392,21 +392,22 @@ def quantize_model(sym, arg_params, aux_params, data_names=('data',), The backend quantized operators are only enabled for Linux systems. Please do not run inference using the quantized models on Windows for now. The quantization implementation adopts the TensorFlow's approach: - https://www.tensorflow.org/performance/quantization. + https://www.tensorflow.org/lite/performance/post_training_quantization. The calibration implementation borrows the idea of Nvidia's 8-bit Inference with TensorRT: http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf and adapts the method to MXNet. .. _`quantize_model_params`: + Parameters ---------- - sym : str or Symbol + sym : Symbol Defines the structure of a neural network for FP32 data types. arg_params : dict Dictionary of name to `NDArray`. aux_params : dict Dictionary of name to `NDArray`. - data_names : a list of strs + data_names : list of strings Data names required for creating a Module object to run forward propagation on the calibration dataset. device : Device @@ -441,7 +442,7 @@ def quantize_model(sym, arg_params, aux_params, data_names=('data',), The mode that quantization pass to apply. Support 'full' and 'smart'. 'full' means quantize all operator if possible. 'smart' means quantization pass will smartly choice which operator should be quantized. - quantize_granularity: str + quantize_granularity : str The granularity of quantization, currently supports 'tensor-wise' and 'channel-wise' quantization. The default value is 'tensor-wise'. logger : Object @@ -449,7 +450,7 @@ def quantize_model(sym, arg_params, aux_params, data_names=('data',), Returns ------- - quantized_model: tuple + quantized_model : tuple A tuple of quantized symbol, quantized arg_params, and aux_params. """ warnings.warn('WARNING: This will be deprecated please use quantize_net with Gluon models') @@ -582,9 +583,10 @@ def quantize_graph(sym, arg_params, aux_params, device=cpu(), and a collector for naive or entropy calibration. The backend quantized operators are only enabled for Linux systems. Please do not run inference using the quantized models on Windows for now. + Parameters ---------- - sym : str or Symbol + sym : Symbol Defines the structure of a neural network for FP32 data types. device : Device Defines the device that users want to run forward propagation on the calibration @@ -616,7 +618,7 @@ def quantize_graph(sym, arg_params, aux_params, device=cpu(), The mode that quantization pass to apply. Support 'full' and 'smart'. 'full' means quantize all operator if possible. 'smart' means quantization pass will smartly choice which operator should be quantized. - quantize_granularity: str + quantize_granularity : str The granularity of quantization, currently supports 'tensor-wise' and 'channel-wise' quantization. The default value is 'tensor-wise'. LayerOutputCollector : subclass of CalibrationCollector @@ -700,13 +702,14 @@ def quantize_graph(sym, arg_params, aux_params, device=cpu(), return qsym, qarg_params, aux_params, collector, calib_layers def calib_graph(qsym, arg_params, aux_params, collector, - calib_mode='entropy', logger=logging): + calib_mode='entropy', logger=None): """User-level API for calibrating a quantized model using a filled collector. The backend quantized operators are only enabled for Linux systems. Please do not run inference using the quantized models on Windows for now. + Parameters ---------- - qsym : str or Symbol + qsym : Symbol Defines the structure of a neural network for INT8 data types. arg_params : dict Dictionary of name to `NDArray`.