Skip to content

microsoft/VisionLongformerForObjectDetection

Repository files navigation

Vision Longformer for Object Detection

This project provides the source code for the object detection part of vision longformer paper. It is based on detectron2.

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

The classification part of the code and checkpoints can be found here.

Updates

  • 03/29/2021: First version of vision longformer paper posted on Arxiv.
  • 05/17/2021: Performance improved by adding relative positional bias, inspired by Swin Transformer! First version of Object Detection code released.

Usage

Here is an example command for evaluating a pretrained vision-longformer small model on COCO

python -m pip install -e .

ln -s /mnt/data_storage datasets

DETECTRON2_DATASETS=datasets python train_net.py --num-gpus 1 --eval-only --config configs/msvit_maskrcnn_fpn_1x_small_sparse.yaml 
MODEL.TRANSFORMER.MSVIT.ARCH "l1,h3,d96,n1,s1,g1,p4,f7,a0_l2,h3,d192,n2,s1,g1,p2,f7,a0_l3,h6,d384,n8,s1,g1,p2,f7,a0_l4,h12,d768,n1,s1,g0,p2,f7,a0" 
SOLVER.AMP.ENABLED True 
MODEL.WEIGHTS /mnt/model_storage/msvit_det/visionlongformer/vilsmall/maskrcnn1x/model_final.pth

Here is an example training command for training the vision-longformer small model on COCO

python -m pip install -e .

ln -s /mnt/data_storage datasets

# convert the classification checkpoint into a detection checkpoint for initialization
python3 converter.py --source_model "/mnt/model_storage/msvit/visionlongformer/small1281_relative/model_best.pth"
--output_model msvit_pretrain.pth --config configs/msvit_maskrcnn_fpn_3xms_small_sparse.yaml
MODEL.TRANSFORMER.MSVIT.ARCH "l1,h3,d96,n1,s1,g1,p4,f7,a0_l2,h3,d192,n2,s1,g1,p2,f7,a0_l3,h6,d384,n8,s1,g1,p2,f7,a0_l4,h12,d768,n1,s1,g0,p2,f7,a0"

# train with the converted detection checkpoint as initialization
DETECTRON2_DATASETS=datasets python train_net.py --num-gpus 8 --config configs/msvit_maskrcnn_fpn_3xms_small_sparse.yaml
MODEL.WEIGHTS msvit_pretrain.pth MODEL.TRANSFORMER.DROP_PATH 0.2 MODEL.TRANSFORMER.MSVIT.ATTN_TYPE
longformerhand MODEL.TRANSFORMER.MSVIT.ARCH "l1,h3,d96,n1,s1,g1,p4,f7,a0_l2,h3,d192,n2,s1,g1,p2,f7,a0_l3,h6,d384,n8,s1,g1,p2,f7,a0_l4,h12,d768,n1,s1,g0,p2,f7,a0"
SOLVER.AMP.ENABLED True SOLVER.BASE_LR 1e-4 SOLVER.WEIGHT_DECAY 0.1 TEST.EVAL_PERIOD
7330 SOLVER.IMS_PER_BATCH 16

Model Zoo on COCO

Vision Longformer with relative positional bias

Backbone Method pretrain drop_path Lr Schd box mAP mask mAP #params FLOPs checkpoints log
ViL-Tiny Mask R-CNN ImageNet-1K 0.05 1x 41.4 38.1 26.9M 145.6G ckpt config log
ViL-Tiny Mask R-CNN ImageNet-1K 0.1 3x 44.2 40.6 26.9M 145.6G ckpt config log
ViL-Small Mask R-CNN ImageNet-1K 0.2 1x 44.9 41.1 45.0M 218.3G ckpt config log
ViL-Small Mask R-CNN ImageNet-1K 0.2 3x 47.1 42.7 45.0M 218.3G ckpt config log
ViL-Medium (D) Mask R-CNN ImageNet-21K 0.2 1x 47.6 43.0 60.1M 293.8G ckpt config log
ViL-Medium (D) Mask R-CNN ImageNet-21K 0.3 3x 48.9 44.2 60.1M 293.8G ckpt config log
ViL-Base (D) Mask R-CNN ImageNet-21K 0.3 1x 48.6 43.6 76.1M 384.4G ckpt config log
ViL-Base (D) Mask R-CNN ImageNet-21K 0.3 3x 49.6 44.5 76.1M 384.4G ckpt config log
--- --- --- --- --- --- --- --- ---
ViL-Tiny RetinaNet ImageNet-1K 0.05 1x 40.8 -- 16.64M 182.7G ckpt config log
ViL-Tiny RetinaNet ImageNet-1K 0.1 3x 43.6 -- 16.64M 182.7G ckpt config log
ViL-Small RetinaNet ImageNet-1K 0.1 1x 44.2 -- 35.68M 254.8G ckpt config log
ViL-Small RetinaNet ImageNet-1K 0.2 3x 45.9 -- 35.68M 254.8G ckpt config log
ViL-Medium (D) RetinaNet ImageNet-21K 0.2 1x 46.8 -- 50.77M 330.4G ckpt config log
ViL-Medium (D) RetinaNet ImageNet-21K 0.3 3x 47.9 -- 50.77M 330.4G ckpt config log
ViL-Base (D) RetinaNet ImageNet-21K 0.3 1x 47.8 -- 66.74M 420.9G ckpt config log
ViL-Base (D) RetinaNet ImageNet-21K 0.3 3x 48.6 -- 66.74M 420.9G ckpt config log

See more fine-grained results in Table 6 and Table 7 in the Vision Longformer paper. We use weight decay 0.05 for all experiments, but search for best drop path in [0.05, 0.1, 0.2, 0.3].

Comparison of various efficient attention mechanims with absolute positional embedding (Small size)

Backbone Method pretrain drop_path Lr Schd box mAP mask mAP #params FLOPs Memory checkpoints log
srformer/64 Mask R-CNN ImageNet-1K 0.1 1x 36.4 34.6 73.3M 224.1G 7.1G ckpt config log
srformer/32 Mask R-CNN ImageNet-1K 0.1 1x 39.9 37.3 51.5M 268.3G 13.6G ckpt config log
Partial srformer/32 Mask R-CNN ImageNet-1K 0.1 1x 42.4 39.0 46.8M 352.1G 22.6G ckpt config log
global Mask R-CNN ImageNet-1K 0.1 1x 34.8 33.4 45.2M 226.4G 7.6G ckpt config log
Partial global Mask R-CNN ImageNet-1K 0.1 1x 42.5 39.2 45.1M 326.5G 20.1G ckpt config log
performer Mask R-CNN ImageNet-1K 0.1 1x 36.1 34.3 45.0M 251.5G 8.4G ckpt config log
Partial performer Mask R-CNN ImageNet-1K 0.05 1x 42.3 39.1 45.0M 343.7G 20.0G ckpt config log
ViL Mask R-CNN ImageNet-1K 0.1 1x 42.9 39.6 45.0M 218.3G 7.4G ckpt config log
Partial ViL Mask R-CNN ImageNet-1K 0.1 1x 43.3 39.8 45.0M 326.8G 19.5G ckpt config log

We use weight decay 0.05 for all experiments, but search for best drop path in [0.05, 0.1, 0.2].

About

Vision Longformer For Object Detection

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages