Vision Longformer for Object Detection

This project provides the source code for the object detection part of vision longformer paper. It is based on detectron2.

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

The classification part of the code and checkpoints can be found here.

Updates

03/29/2021: First version of vision longformer paper posted on Arxiv.
05/17/2021: Performance improved by adding relative positional bias, inspired by Swin Transformer! First version of Object Detection code released.

Usage

Here is an example command for evaluating a pretrained vision-longformer small model on COCO

python -m pip install -e .

ln -s /mnt/data_storage datasets

DETECTRON2_DATASETS=datasets python train_net.py --num-gpus 1 --eval-only --config configs/msvit_maskrcnn_fpn_1x_small_sparse.yaml 
MODEL.TRANSFORMER.MSVIT.ARCH "l1,h3,d96,n1,s1,g1,p4,f7,a0_l2,h3,d192,n2,s1,g1,p2,f7,a0_l3,h6,d384,n8,s1,g1,p2,f7,a0_l4,h12,d768,n1,s1,g0,p2,f7,a0" 
SOLVER.AMP.ENABLED True 
MODEL.WEIGHTS /mnt/model_storage/msvit_det/visionlongformer/vilsmall/maskrcnn1x/model_final.pth

Here is an example training command for training the vision-longformer small model on COCO

python -m pip install -e .

ln -s /mnt/data_storage datasets

# convert the classification checkpoint into a detection checkpoint for initialization
python3 converter.py --source_model "/mnt/model_storage/msvit/visionlongformer/small1281_relative/model_best.pth"
--output_model msvit_pretrain.pth --config configs/msvit_maskrcnn_fpn_3xms_small_sparse.yaml
MODEL.TRANSFORMER.MSVIT.ARCH "l1,h3,d96,n1,s1,g1,p4,f7,a0_l2,h3,d192,n2,s1,g1,p2,f7,a0_l3,h6,d384,n8,s1,g1,p2,f7,a0_l4,h12,d768,n1,s1,g0,p2,f7,a0"

# train with the converted detection checkpoint as initialization
DETECTRON2_DATASETS=datasets python train_net.py --num-gpus 8 --config configs/msvit_maskrcnn_fpn_3xms_small_sparse.yaml
MODEL.WEIGHTS msvit_pretrain.pth MODEL.TRANSFORMER.DROP_PATH 0.2 MODEL.TRANSFORMER.MSVIT.ATTN_TYPE
longformerhand MODEL.TRANSFORMER.MSVIT.ARCH "l1,h3,d96,n1,s1,g1,p4,f7,a0_l2,h3,d192,n2,s1,g1,p2,f7,a0_l3,h6,d384,n8,s1,g1,p2,f7,a0_l4,h12,d768,n1,s1,g0,p2,f7,a0"
SOLVER.AMP.ENABLED True SOLVER.BASE_LR 1e-4 SOLVER.WEIGHT_DECAY 0.1 TEST.EVAL_PERIOD
7330 SOLVER.IMS_PER_BATCH 16

Model Zoo on COCO

Vision Longformer with relative positional bias

Backbone	Method	pretrain	drop_path	Lr Schd	box mAP	mask mAP	#params	FLOPs	checkpoints	log
ViL-Tiny	Mask R-CNN	ImageNet-1K	0.05	1x	41.4	38.1	26.9M	145.6G	ckpt config	log
ViL-Tiny	Mask R-CNN	ImageNet-1K	0.1	3x	44.2	40.6	26.9M	145.6G	ckpt config	log
ViL-Small	Mask R-CNN	ImageNet-1K	0.2	1x	44.9	41.1	45.0M	218.3G	ckpt config	log
ViL-Small	Mask R-CNN	ImageNet-1K	0.2	3x	47.1	42.7	45.0M	218.3G	ckpt config	log
ViL-Medium (D)	Mask R-CNN	ImageNet-21K	0.2	1x	47.6	43.0	60.1M	293.8G	ckpt config	log
ViL-Medium (D)	Mask R-CNN	ImageNet-21K	0.3	3x	48.9	44.2	60.1M	293.8G	ckpt config	log
ViL-Base (D)	Mask R-CNN	ImageNet-21K	0.3	1x	48.6	43.6	76.1M	384.4G	ckpt config	log
ViL-Base (D)	Mask R-CNN	ImageNet-21K	0.3	3x	49.6	44.5	76.1M	384.4G	ckpt config	log
---	---	---	---	---	---	---	---	---
ViL-Tiny	RetinaNet	ImageNet-1K	0.05	1x	40.8	--	16.64M	182.7G	ckpt config	log
ViL-Tiny	RetinaNet	ImageNet-1K	0.1	3x	43.6	--	16.64M	182.7G	ckpt config	log
ViL-Small	RetinaNet	ImageNet-1K	0.1	1x	44.2	--	35.68M	254.8G	ckpt config	log
ViL-Small	RetinaNet	ImageNet-1K	0.2	3x	45.9	--	35.68M	254.8G	ckpt config	log
ViL-Medium (D)	RetinaNet	ImageNet-21K	0.2	1x	46.8	--	50.77M	330.4G	ckpt config	log
ViL-Medium (D)	RetinaNet	ImageNet-21K	0.3	3x	47.9	--	50.77M	330.4G	ckpt config	log
ViL-Base (D)	RetinaNet	ImageNet-21K	0.3	1x	47.8	--	66.74M	420.9G	ckpt config	log
ViL-Base (D)	RetinaNet	ImageNet-21K	0.3	3x	48.6	--	66.74M	420.9G	ckpt config	log

See more fine-grained results in Table 6 and Table 7 in the Vision Longformer paper. We use weight decay 0.05 for all experiments, but search for best drop path in [0.05, 0.1, 0.2, 0.3].

Comparison of various efficient attention mechanims with absolute positional embedding (Small size)

Backbone	Method	pretrain	drop_path	Lr Schd	box mAP	mask mAP	#params	FLOPs	Memory	checkpoints	log
srformer/64	Mask R-CNN	ImageNet-1K	0.1	1x	36.4	34.6	73.3M	224.1G	7.1G	ckpt config	log
srformer/32	Mask R-CNN	ImageNet-1K	0.1	1x	39.9	37.3	51.5M	268.3G	13.6G	ckpt config	log
Partial srformer/32	Mask R-CNN	ImageNet-1K	0.1	1x	42.4	39.0	46.8M	352.1G	22.6G	ckpt config	log
global	Mask R-CNN	ImageNet-1K	0.1	1x	34.8	33.4	45.2M	226.4G	7.6G	ckpt config	log
Partial global	Mask R-CNN	ImageNet-1K	0.1	1x	42.5	39.2	45.1M	326.5G	20.1G	ckpt config	log
performer	Mask R-CNN	ImageNet-1K	0.1	1x	36.1	34.3	45.0M	251.5G	8.4G	ckpt config	log
Partial performer	Mask R-CNN	ImageNet-1K	0.05	1x	42.3	39.1	45.0M	343.7G	20.0G	ckpt config	log
ViL	Mask R-CNN	ImageNet-1K	0.1	1x	42.9	39.6	45.0M	218.3G	7.4G	ckpt config	log
Partial ViL	Mask R-CNN	ImageNet-1K	0.1	1x	43.3	39.8	45.0M	326.8G	19.5G	ckpt config	log

We use weight decay 0.05 for all experiments, but search for best drop path in [0.05, 0.1, 0.2].

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
msvit		msvit
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
converter.py		converter.py
setup.py		setup.py
train_net.py		train_net.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Longformer for Object Detection

Updates

Usage

Model Zoo on COCO

About

Releases

Packages

Contributors 2

Languages

License

microsoft/VisionLongformerForObjectDetection

Folders and files

Latest commit

History

Repository files navigation

Vision Longformer for Object Detection

Updates

Usage

Model Zoo on COCO

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages