- Clone the repo (change the
FASTDIR
as preferred):
export FASTDIR=/workspace
cd $FASTDIR/git/
git clone https://github.com/aim-uofa/model-quantization
git clone https://github.com/blueardour/pytorch-utils
cd model-quantization
ln -s ../pytorch-utils utils
# create separate log and weight folders (optional, if symbol link not created, the script will create these folders under the project path)
#mkdir -p /data/pretrained/pytorch/model-quantization/{exp,weights}
#ln -s /data/pretrained/pytorch/model-quantization/exp .
#ln -s /data/pretrained/pytorch/model-quantization/weights .
-
Install prerequisite packages
cd $FASTDIR/git/model-quantization # python 3 is required pip install -r requirement.txt
Quantization for the classification task has no strict requirement on the pytorch version. However, other tasks such as detection and segmentation require a higher version pytorch.
detectron2
currently requiresTorch 1.4
+. Besides, the CUDA version on the machine is advised to keep the same with the one compiling the pytorch. -
Install Nvidia image pre-processing packages and mix precision training packages (optional, highly recommend)
This repo supports the Imagenet dataset and CIFAR dataset. Create necessary folders and prepare the datasets. Example:
# dataset
mkdir -p /data/cifar
mkdir -p /data/imagenet
# download imagnet and move the train and evaluation data in /data/imagenet/{train,val}, respectively.
# cifar dataset can be downloaded on the fly
Some of the quantization results are listed in result_cls.md. We provide pretrained models in google drive
Both training and testing employ the train.sh
script. Directly calling the main.py
is also possible.
bash train.sh config.xxxx
config.xxxx
is the configuration file, which contains network architecture, quantization related and training related parameters. For more about the supported options, refer below Training script options and config.md. Also refer to the examples in config
subfolder.
Training is often time-consuming . Try our start_on_terminate.sh
script which can be used to pend a second task. New round training will start automatically when the last training process is terminated.
# wait in a screen shell
screen -S next-round
bash start_on_terminate.sh [current training thread pid] [next round config.xxxx]
# Ctl+A D to detach screen to backend
Besides, tools.py
provides many useful functions for debug / verbose / model convertion. Refer tools.md for detailed usage.
See know issues
-
From 2020.07.28 Dynamic loading of the training options by policy file is supported.
-
Option parsing
Common options are parsed in
util/config.py
. Quantization related options are separate in themain.py
. -
Keyword (choosing quantization method)
The
--keyword
option is one of the most important variables to control the model architecture and quantization algorithm choice.We currently support quantization algorithms by adding the following options in the
keyword
:a.
lq
for LQ-Netsb.
pact
for PACTc.
dorefa
for DoReFa-Net. Besides, an additional keyword oflsq
for learned step size,non-uniform
for FATNN.d.
xnor
for XNOR-Net. Ifgamma
is combined with thexnor
in the keyword, a separated learnable scale coefficient is added (It becomes the XNor-net++). -
Keyword (structure control):
The network structure can be chosen by
--arch
or--model
. For ResNet, the official ResNet model is provided withpytorch-resnetxx
and more flexible ResNet architecture can be realized by setting the--arch
or--model
withresnetxx
. For the latter case, a lot of options can be combined to customize the network structure:a.
origin
exists / not exists inkeyword
is to choose whether the bi-real skip connection is preferred (Block-wise skip connection versus layer-wise skip connection).b.
bacs
orcbas
, etc, indicate the layer order in a ResNet block. For example,bacs
is a kind of pre-activation structure, typically in a ResNet block, first normalization layer, then activation layer, then convolutional layer and last skip connection layer. For pre-activation structure,preBN
is required for the first ResNet block. Refer resnet.md for more information.c. By default all layers except the first and last layers are quantized,
real_skip
can be added to keep the skip connection layers in ResNet to full precision, which is widely used in Xnor-net and Bi-Real net.d. For the normalization layer and activation layer, we also provide some
keyword
for different variants. For example,NRelU
means it does not include ReLU activation in the network andPRelU
indicates PReLU is employed. Refermodel/layer.py
for details.e. Padding and quantization order. I think it is an error if padding the feature map with 0 after quantization, especially in BNNs. From my perspective, the strategy makes BNNs become TNNs. Thus, I advocate to pad the feature map with zero first and then go through the quantization step. To keep compatible with the publication as well as providing a revised method,
padding_after_quant
can be set to control the order between padding and quantization. Refer line 445 inmodel/quant.py
for the implementation.f. Skip connection realization. Two choices are provided. One is the avgpooling with stride followed by a conv1x1 with stride=1. Another is just one conv1x1 with stride as demanded.
singleconv
inkeyword
is used for the choice.g.
fixup
is used to enable the architecture in Fixup Initialization.h. The option
base
which is a standalone option rather than a word in thekeyword
list is used to realize the branch configuration in Group-Net.Self-defined
keyword
is supported and can be easily realized according to the user's own desire. As introduced above, the options can be combined to build up different variant architectures. Examples can be found in theconfig
subfolder. -
Activation and weight quantization options
The script provides independent configurations for activations and weights respectively. We here explain some advanced options.
-
xx_quant_group
indicates the group amount for the quantization parameter along the channel dimension. -
xx_adaptive
in most cases, indicates the additional normalization operation which shows great potential to increase the performance. -
xx_grad_type
defines a custom gradient approximation method. In general, the quantization step is not differentiable, techniques such as the STE are used to approximate the gradient. Other types of approximation exist. Besides, in some works, it is advocated to add some scale coefficient to the gradient in order to stabilize the training.
-
-
Weight decay
Three major related options.
-
--wd
sets the default L2 weight decay value. -
Weight decay is originally proposed to avoid overfit for the large number of parameters. For some small tensors, for example the parameters in BatchNorm layer (as well as custom defined quantization parameters, such as clip-value), weight decay is advocated to be zero.
--decay_small
is for whether to decay those small tensors or not. -
--custom_decay_list
and--custom_decay
are combined for specific custom decay value to certain parameters. For example, in PACT, the clip_boundary can have its own independent weight decay for regularization.
-
-
Learning rate
-
multi-step decay
-
ploy decay
-
sgdr (with restart)
-
--custom_lr_list
and--custom_lr
are provided similarly with before mentioned weight decay to specific custom learning rate for certain parameters.
-
-
Mixed precision training options
--fp16
and--opt_level [O1]
are provided for mix precision training.-
FP32
-
FP16 with custom level, recommend
O1
level.
-