NOTE: This code is based on HRNet-Object-Detecton
-
SyncBN: using NVIDIA/apex SyncBN
-
Multi-scale training (pad): training detection models with SyncBN and multi-scale training will crash to terrible results (mAP=0.0 or no boxes are predicted). Fortunatedly, we've investigated a method to solve it by padding input images of different scales(
600*1000
,800*1333
,1000*1600
) to a fixed scale1000*1600
and keeping the original aspect ratios. -
Multi-scale training (SimpleDet version): we've implemented multi-scale training strategy used in SimpleDet
-
Multi-node & multi-gpu training: we've tested our code when training with multiple nodes (ONLY on AZURE!). Providing a MASTER IP and PORT. Training without SyncBN will reach normal results while training with SyncBN will fail.
- ResNet:
see configs/syncbn/faster_rcnn_r50_fpn_sync_1x.py
change normalize
in model config:
normalize = dict(type='SyncBN', frozen=False)
- HRNet
HRNet in this repo doesn't support normalize
but HRNet in mmdetection supports it.
see configs/hrnet/faster_rcnn_hrnetv2p_w18_sync_1x.py
change backbone.type
to SyncHighResolutionNet
.
- results
Backbone | LR Schd | SyncBN | mAP |
---|---|---|---|
HRNetV2-W18 | 1x | No | 36.1 |
HRNetV2-W18 | 1x | Yes | 37.2 |
- relevant files
mmdet/models/backbones/hrnet_sync.py
mmdet/models/utils/norm.py
see configs/hrnet/faster_rcnn_hrnetv2p_w18_syncbn_16batch_mstrain_pad_1x.py
- set maximum padding size
pad_size
- set scales for multi-scale training.
data = dict(
imgs_per_gpu=4,
workers_per_gpu=8,
pad_size=(1600, 1024),
train=dict(
type=dataset_type,
ann_file=data_root + 'annotations/instances_train2017.json',
img_prefix=data_root + 'images/train2017.zip',
img_scale=[(1600, 1000), (1000, 600), (1333, 800)],
img_norm_cfg=img_norm_cfg,
size_divisor=32,
flip_ratio=0.5,
with_mask=False,
with_crowd=True,
with_label=True),
- results
Backbone | LR Schd | SyncBN | MSTrain | mAP | model |
---|---|---|---|---|---|
HRNetV2-W18 | 1x | No | Yes | 36.5 | |
HRNetV2-W18 | 1x | Yes | Yes | 37.6 | model |
HRNetV2-W18 | 2x | Yes | Yes | 39.4 | model |
HRNetV2-W32 | 1x | Yes | Yes | 41.0 | model |
HRNetV2-W32 | 2x | Yes | Yes | 42.6 | model |
mmdet/datasets/loader/collate.py
mmdet/datasets/loader/build_loader.py
see configs/hrnet/faster_rcnn_hrnetv2p_w18_randresizecrop_1x.py
- set maximum padding size
pad_size=(1200,800)
- set scales for multi-scale training.
- add extra data augmentation
imgs_per_gpu=2,
workers_per_gpu=4,
pad_size=(1216, 800),
train=dict(
type=dataset_type,
ann_file=data_root + 'annotations/instances_train2017.json',
img_prefix=data_root + 'train2017.zip',
img_scale=(1200, 800),
img_norm_cfg=img_norm_cfg,
size_divisor=1,
extra_aug=dict(
rand_resize_crop=dict(
scales=[[1400, 600], [1400, 800], [1400, 1000]],
size=[1200, 800]
)),
flip_ratio=0.5,
with_mask=False,
with_crowd=True,
with_label=True),
- results
Backbone | LR Schd | SyncBN | MSTrain | mAP | model |
---|---|---|---|---|---|
HRNetV2-W18 | 1x | No | Yes | 36.3 | |
HRNetV2-W18 | 1x | Yes | Yes | 37.6 | model |
- relevant files
mmdet/datasets/loader/collate.py
mmdet/datasets/loader/build_loader.py
mmdet/datasets/zip_dataset.py
mmdet/datasets/extra_aug.py
Note: ONLY on AZURE!
- Set environment variables
export NCCL_SOCKET_IFNAME=eth0 设置network interface
export NCCL_DEBUG=INFO 提供更多的nccl信息
export NCCL_IB_DISABLE=1 关掉IB
- Running (different node has different
node_rank
)
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 tools/train.py <CONFIG-FILE> --launcher pytorch --master_ip_port tcp://xx.xx.xx.xx:1234
- relevant files
mmdet/apis/env.py