A Pytorch implementation of DeepLabv3+.
Mean IoU and Overall Accuracy are calculated using confusion matrix. (please see metrics/stream_metrics.py for more details).
The model are trained with small batch size (8) and fixed batchnorm due to GPU memory limitations. It required 8GB to train deeplab on one Quadro P5000. Please try to use larger batch size and finetune batchnorm if you want better performance.
Backbone | OS (Train/Val) | Overall Acc | Mean IoU | Fix BN | Separable Conv |
---|---|---|---|---|---|
ResNet101 | 16/16 | 94.03% | 76.88% | Yes | No |
ResNet101 | 16/16 | 94.01% | 76.74% | Yes | Yes |
ResNet101 (Paper) | 16/16 | - | 78.85% | No | Yes |
- Pytorch
- Torchvision
- Numpy
- Pillow
- scikit-learn
- tqdm
- matplotlib
- visdom
You can run train.py with "--download" option to download and extract pascal voc dataset. The defaut path is './datasets/data' which should be like this:
/data
/VOCdevkit
/VOC2012
/SegmentationClass
/JPEGImages
...
...
/VOCtrainval_11-May-2012.tar
...
See chapter 4 of [2]
The original dataset contains 1464 (train), 1449 (val), and 1456 (test) pixel-level annotated images. We augment the dataset by the extra annotations provided by [76], resulting in 10582 (trainaug) training images. The performance is measured in terms of pixel intersection-over-union averaged across the 21 classes (mIOU).
./datasets/data/train_aug.txt includes names of 10582 trainaug images (val images are excluded). You need to download extra annatations from Dropbox or Tencent Weiyun. Those annotations come from DrSleep's repo.
Please extract trainaug files (SegmentationClassAug) to the VOC2012 directory.
/DATA_DIR
/VOCdevkit
/VOC2012
/SegmentationClass
/SegmentationClassAug
/JPEGImages
...
...
/VOCtrainval_11-May-2012.tar
...
Start visdom sever for visualization. Please remove '--enable_vis' if visualization is not needed.
# Run visdom server on port 8097
visdom -port 8097
Run train.py with "--year 2012_aug"
python train.py --backbone resnet101 --dataset voc --year 2012_aug --data_root ./datasets/data --lr 4e-4 --epochs 30 --batch_size 8 --use_separable_conv --fix_bn --crop_val --enable_vis --vis_env deeplab --vis_port 8097
Run train.py with "--year 2012"
python train.py --backbone resnet101 --dataset voc --year 2012 --data_root ./datasets/data --lr 4e-4 --epochs 30 --batch_size 8 --use_separable_conv --fix_bn --crop_val --enable_vis --vis_env deeplab --vis_port 8097
Results and images will be saved at ./results.
python test.py --backbone resnet101 --dataset voc --year 2012 --data_root ./datasets/data --batch_size 8 --use_separable_conv --crop_val --ckpt checkpoints/best_resnet101_voc.pkl --save_path ./results
-
If GPU memory is limited, try to reduce crop size or batch size. Note that batchnorm needs large bacth size. As an alternative, you can try group normalization (GN).
-
Multi-Grid are not introduced in this repo according to the paper. see 4.3 of [2].
Note that we do not employ the multi-grid method [77,78,23], which we found does not improve the performance.
-
About Data augmentation. see 4.1 of [1]
Data augmentation: We apply data augmentation by randomly scaling the input images (from 0.5 to 2.0) and randomly left-right flipping during training.
[1] Rethinking Atrous Convolution for Semantic Image Segmentation
[2] Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation