This is a model zoo project under Pytorch. In this repo I will implement some of basic classification models which have good performance on ImageNet. Then I will train them in most fair way as possible and try my best to get SOTA model on ImageNet. In this repo I'll only consider FP16.
- OS: Ubuntu 18.04
- CUDA: 10.1, CuDNN: 7.6
- Devices: I use 8 * RTX 2080ti(8 * V100 should be much better /cry). This project is in FP16 precision, it's recommend to use FP16 friendly devices like RTX series, V100. If you want to totally reproduce my research, you'd better use same batch size with me.
- Pytorch: >= 1.6.0 (Need torch.cuda.amp in version 1.6)
- TorchToolbox: stable version. Helper functions to make your code simpler and more readable, it's a optional tools if you don't want to use it just write them yourself.
- No necessary.
If you found any IO bottleneck please use LMDB format dataset. A good way is try both and find out which is more faster.
I provide conversion script here.
python distribute_train_script --params
Here is a example
python distribute_train_script.py --data-path /s4/piston/ImageNet --batch-size 256 --dtype float16 \
-j 48 --epochs 360 --lr 2.6 --warmup-epochs 5 --label-smoothing \
--no-wd --wd 0.00003 --model GhostNet --log-interval 150 --model-info \
--dist-url tcp://127.0.0.1:26548 --world-size 1 --rank 0
- Resume training
Try Nvidia-DALI- Multi-node(distributed) training by
Apex or BytePSPytorch - I may try AutoAugment.This project aims to train models by ourselves to observe and learn, it's impossible for me to train this, just copy feels meaningless.
model | epochs | dtype | batch size* | gpus | lr | tricks | Params(M)/FLOPs | top1/top5 | params/logs |
---|---|---|---|---|---|---|---|---|---|
resnet50 | 120 | FP16 | 128 | 8 | 0.4 | - | 25.6/4.1G | 77.36/- | Google Drive |
resnet101 | 120 | FP16 | 128 | 8 | 0.4 | - | 44.7/7.8G | 79.13/94.38 | Google Drive |
resnet50v2 | 120 | FP16 | 128 | 8 | 0.4 | - | 25.6/4.1G | 77.06/93.44 | Google Drive |
resnet101v2 | 120 | FP16 | 128 | 8 | 0.4 | - | 44.6/7.8G | 78.90/94.39 | Google Drive |
ResNext50_32x4d | 120 | FP16 | 128 | 8 | 0.4 | - | 25.1/4.2G | 79.00/94.39 | |
RegNetX4_0GF | 120 | FP16 | 128 | 8 | 0.4 | - | 22.2/4.0G | 78.40/94.04 | |
RegNetY4_0GF | 120 | FP16 | 128 | 8 | 0.4 | - | 22.1/4.0G | 79.22/94.57 | |
RegNetY6_4GF | 120 | FP16 | 128 | 8 | 0.4 | - | 31.2/6.4G | 79.69/94.82 | |
ResNeST50 | 120 | FP16 | 128 | 8 | 0.4 | - | 27.5/4.1G | 78.62/94.28 | |
mobilenetv1 | 150 | FP16 | 256 | 8 | 0.4 | - | 4.3/572.2M | 72.17/90.70 | Google Drive |
mobilenetv2 | 150 | FP16 | 256 | 8 | 0.4 | - | 3.5/305.3M | 71.94/90.59 | Google Drive |
mobilenetv3 Large | 360 | FP16 | 256 | 8 | 2.6 | Label smoothing No decay bias Dropout | 5.5/219M | 75.64/92.61 | Google Drive |
mobilenetv3 Small | 360 | FP16 | 256 | 8 | 2.6 | Label smoothing No decay bias Dropout | 3.0/57.8M | 67.83/87.78 | |
GhostNet1.3 | 360 | FP16 | 400 | 8 | 2.6 | Label smoothing No decay bias Dropout | 7.4/230.4M | 75.78/92.77 | Google Drive |
- I use nesterov SGD and cosine lr decay with 5 warmup epochs by default[2][3] (to save time), it's more common and effective.
- *Batch size is pre GPU holds. Total batch size should be (batch size * gpus).
- In progress.
Here are lots of tricks to improve accuracy during this years.(If you have another idea please open an issue.) I want to verify them in a fair way.
Tricks: RandomRotation, OctConv[14], Drop out, Label Smoothing[4], Sync BN, SwitchNorm[6], Mixup[17], no decay bias[7],
Cutout[5], Relu6[18], swish activation[10], Stochastic Depth[9], Lookahead Optimizer[11], Pre-active(ResnetV2)[12],
DCNv2[13], LIP[16].
- Delete line means make me out of memory.
Special: Zero-initialize the last BN, just call it 'Zero γ', only for post-active model.
I'll only use 120 epochs and 128*8 batch size to train them. I know some tricks may need train more time or larger batch size but it's not fair for others. You can think of it as a performance in the current situation.
model | epochs | dtype | batch size* | gpus | lr | tricks | degree | top1/top5 | improve | params/logs |
---|---|---|---|---|---|---|---|---|---|---|
resnet50 | 120 | FP16 | 128 | 8 | 0.4 | - | - | 77.36/- | baseline | Google Drive |
resnet50 | 120 | FP16 | 128 | 8 | 0.4 | Label smoothing | smoothing=0.1 | 77.78/93.80 | +0.42 | Google Drive |
resnet50 | 120 | FP16 | 128 | 8 | 0.4 | No decay bias | - | 77.28/93.61 | -0.08 | Google Drive |
resnet50 | 120 | FP16 | 128 | 8 | 0.4 | Sync BN | - | 77.31/93.49 | -0.05 | Google Drive |
resnet50 | 120 | FP16 | 128 | 8 | 0.4 | Mixup | alpha=0.2 | 77.49/93.73 | +0.13 | missing |
resnet50 | 120 | FP16 | 128 | 8 | 0.4 | RandomRotation | degree=15 | 76.64/93.28 | -1.15 | Google Drive |
resnet50 | 120 | FP16 | 128 | 8 | 0.4 | Cutout | read code | 77.44/93.62 | +0.08 | Google Drive |
resnet50 | 120 | FP16 | 128 | 8 | 0.4 | Dropout | rate=0.3 | 77.11/93.58 | -0.25 | Google Drive |
resnet50 | 120 | FP16 | 128 | 8 | 0.4 | Lookahead-SGD | - | 77.23/93.39 | -0.13 | Google Drive |
resnet50v2 | 120 | FP16 | 128 | 8 | 0.4 | pre-active | - | 77.06/93.44 | -0.30 | Google Drive |
oct_resnet50 | 120 | FP16 | 128 | 8 | 0.4 | OctConv | alpha=0.125 | - | - | |
resnet50 | 120 | FP16 | 128 | 8 | 0.4 | Relu6 | 77.28/93.5 | -0.08 | Google Drive | |
resnet50 | 120 | FP16 | 128 | 8 | 0.4 | - | 77.00/- | DDP baseline | ||
resnet50 | 120 | FP16 | 128 | 8 | 0.4 | Gradient Centralization | Conv only | 77.40/93.57 | +0.40 | |
resnet50 | 120 | FP16 | 128 | 8 | 0.4 | Zero γ | 77.24/- | +0.24 | ||
resnet50 | 120 | FP16 | 128 | 8 | 0.4 | No decay bias | 77.74/93.77 | +0.74 | ||
resnet50 | 120 | FP16 | 128 | 8 | 0.4 | RandAugment | n=2,m=9 | 76.44/93.18 | -0.96 | |
resnet50 | 120 | FP16 | 128 | 8 | 0.4 | AutoAugment | 76.50/93.23 | -0.50 |
- More epochs for
Mixup
,Cutout
,Dropout
may get better results. - Auto/Rand Augment may train 180 epochs better.
@misc{ModelZoo.pytorch,
title = {Basic deep conv neural network reproduce and explore},
author = {X.Yang},
URL = {https://github.com/PistonY/ModelZoo.pytorch},
year = {2019}
}
- [1] Deep Residual Learning for Image Recognition
- [2] Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour
- [3] Bag of Tricks for Image Classification with Convolutional Neural Networks
- [4] Rethinking the Inception Architecture for Computer Vision
- [5] Improved Regularization of Convolutional Neural Networks with Cutout
- [6] Differentiable Learning-to-Normalize via Switchable Normalization OpenSourse
- [7] Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes
- [8] MIXED PRECISION TRAINING
- [9] Deep Networks with Stochastic Depth
- [10] SEARCHING FOR ACTIVATION FUNCTIONS
- [11] Lookahead Optimizer: k steps forward, 1 step back
- [12] Identity Mappings in Deep Residual Networks
- [13] Deformable ConvNets v2: More Deformable, Better Results
- [14] Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution
- [15] Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups
- [16] LIP: Local Importance-based Pooling
- [17] mixup: BEYOND EMPIRICAL RISK MINIMIZATION
- [18] Gradient Centralization: A New Optimization Technique for Deep Neural Networks