Accurately ranking the vast number of candidate detections is crucial for dense object detectors to achieve high performance. In this work, we propose to learn IoU-aware classification scores (IACS) that simultaneously represent the object presence confidence and localization accuracy, to produce a more accurate ranking of detections in dense object detectors. In particular, we design a new loss function, named Varifocal Loss (VFL), for training a dense object detector to predict the IACS, and a new efficient star-shaped bounding box feature representation (the features at nine yellow sampling points) for estimating the IACS and refining coarse bounding boxes. Combining these two new components and a bounding box refinement branch, we build a new IoU-aware dense object detector based on the FCOS+ATSS architecture, what we call VarifocalNet or VFNet for short. Extensive experiments on MS COCO benchmark show that our VFNet consistently surpasses the strong baseline by ~2.0 AP with different backbones. Our best model VFNet-X-1200 with Res2Net-101-DCN reaches a single-model single-scale AP of 55.1 on COCO test-dev
, achieving the state-of-the-art performance among various object detectors.
@inproceedings{zhang2020varifocalnet,
title={VarifocalNet: An IoU-aware Dense Object Detector},
author={Zhang, Haoyang and Wang, Ying and Dayoub, Feras and S{\"u}nderhauf, Niko},
booktitle={CVPR},
year={2021}
}
Backbone | Style | DCN | MS train | Lr schd | Inf time (fps) | box AP (val) | box AP (test-dev) | Download |
---|---|---|---|---|---|---|---|---|
R-50 | pytorch | N | N | 1x | 19.4 | 41.6 | 41.6 | model | log |
R-50 | pytorch | N | Y | 2x | 19.3 | 44.5 | 44.8 | model | log |
R-50 | pytorch | Y | Y | 2x | 16.3 | 47.8 | 48.0 | model | log |
R-101 | pytorch | N | N | 1x | 15.5 | 43.0 | 43.6 | model | log |
R-101 | pytorch | N | N | 2x | 15.6 | 43.5 | 43.9 | model | log |
R-101 | pytorch | N | Y | 2x | 15.6 | 46.2 | 46.7 | model | log |
R-101 | pytorch | Y | Y | 2x | 12.6 | 49.0 | 49.2 | model | log |
X-101-32x4d | pytorch | N | Y | 2x | 13.1 | 47.4 | 47.6 | model | log |
X-101-32x4d | pytorch | Y | Y | 2x | 10.1 | 49.7 | 50.0 | model | log |
X-101-64x4d | pytorch | N | Y | 2x | 9.2 | 48.2 | 48.5 | model | log |
X-101-64x4d | pytorch | Y | Y | 2x | 6.7 | 50.4 | 50.8 | model | log |
R2-101 | pytorch | N | Y | 2x | 13.0 | 49.2 | 49.3 | model | log |
R2-101 | pytorch | Y | Y | 2x | 10.3 | 51.1 | 51.3 | model | log |
Notes:
- The MS-train scale range is 1333x[480:960] (
range
mode) and the inference scale keeps 1333x800. - The R2-101 backbone is Res2Net-101.
- DCN means using
DCNv2
in both backbone and head. - The inference speed is tested with a Nvidia V100 GPU on HPC (log file).
We also provide the models of RetinaNet, FoveaBox, RepPoints and ATSS trained with the Focal Loss (FL) and our Varifocal Loss (VFL).
Method | Backbone | MS train | Lr schd | box AP (val) | Download |
---|---|---|---|---|---|
RetinaNet + FL | R-50 | N | 1x | 36.5 | model | log |
RetinaNet + VFL | R-50 | N | 1x | 37.4 | model | log |
FoveaBox + FL | R-50 | N | 1x | 36.3 | model | log |
FoveaBox + VFL | R-50 | N | 1x | 37.2 | model | log |
RepPoints + FL | R-50 | N | 1x | 38.3 | model | log |
RepPoints + VFL | R-50 | N | 1x | 39.7 | model | log |
ATSS + FL | R-50 | N | 1x | 39.3 | model | log |
ATSS + VFL | R-50 | N | 1x | 40.2 | model | log |
Notes:
- We use 4 P100 GPUs for the training of these models (except ATSS, 8x2) with a mini-batch size of 16 images (4 images per GPU), as we found 4x4 training yielded slightly better results compared to 8x2 training.
- You can find corresponding config files in configs/vfnet.
use_vfl
flag in those config files controls whether to use the Varifocal Loss in training or not.