This project provides the source code for the object detection part of vision longformer paper. It is based on detectron2.
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding
The classification part of the code and checkpoints can be found here.
- 03/29/2021: First version of vision longformer paper posted on Arxiv.
- 05/17/2021: Performance improved by adding relative positional bias, inspired by Swin Transformer! First version of Object Detection code released.
Here is an example command for evaluating a pretrained vision-longformer small model on COCO
python -m pip install -e .
ln -s /mnt/data_storage datasets
DETECTRON2_DATASETS=datasets python train_net.py --num-gpus 1 --eval-only --config configs/msvit_maskrcnn_fpn_1x_small_sparse.yaml
MODEL.TRANSFORMER.MSVIT.ARCH "l1,h3,d96,n1,s1,g1,p4,f7,a0_l2,h3,d192,n2,s1,g1,p2,f7,a0_l3,h6,d384,n8,s1,g1,p2,f7,a0_l4,h12,d768,n1,s1,g0,p2,f7,a0"
SOLVER.AMP.ENABLED True
MODEL.WEIGHTS /mnt/model_storage/msvit_det/visionlongformer/vilsmall/maskrcnn1x/model_final.pth
Here is an example training command for training the vision-longformer small model on COCO
python -m pip install -e .
ln -s /mnt/data_storage datasets
# convert the classification checkpoint into a detection checkpoint for initialization
python3 converter.py --source_model "/mnt/model_storage/msvit/visionlongformer/small1281_relative/model_best.pth"
--output_model msvit_pretrain.pth --config configs/msvit_maskrcnn_fpn_3xms_small_sparse.yaml
MODEL.TRANSFORMER.MSVIT.ARCH "l1,h3,d96,n1,s1,g1,p4,f7,a0_l2,h3,d192,n2,s1,g1,p2,f7,a0_l3,h6,d384,n8,s1,g1,p2,f7,a0_l4,h12,d768,n1,s1,g0,p2,f7,a0"
# train with the converted detection checkpoint as initialization
DETECTRON2_DATASETS=datasets python train_net.py --num-gpus 8 --config configs/msvit_maskrcnn_fpn_3xms_small_sparse.yaml
MODEL.WEIGHTS msvit_pretrain.pth MODEL.TRANSFORMER.DROP_PATH 0.2 MODEL.TRANSFORMER.MSVIT.ATTN_TYPE
longformerhand MODEL.TRANSFORMER.MSVIT.ARCH "l1,h3,d96,n1,s1,g1,p4,f7,a0_l2,h3,d192,n2,s1,g1,p2,f7,a0_l3,h6,d384,n8,s1,g1,p2,f7,a0_l4,h12,d768,n1,s1,g0,p2,f7,a0"
SOLVER.AMP.ENABLED True SOLVER.BASE_LR 1e-4 SOLVER.WEIGHT_DECAY 0.1 TEST.EVAL_PERIOD
7330 SOLVER.IMS_PER_BATCH 16
Vision Longformer with relative positional bias
Backbone | Method | pretrain | drop_path | Lr Schd | box mAP | mask mAP | #params | FLOPs | checkpoints | log |
---|---|---|---|---|---|---|---|---|---|---|
ViL-Tiny | Mask R-CNN | ImageNet-1K | 0.05 | 1x | 41.4 | 38.1 | 26.9M | 145.6G | ckpt config | log |
ViL-Tiny | Mask R-CNN | ImageNet-1K | 0.1 | 3x | 44.2 | 40.6 | 26.9M | 145.6G | ckpt config | log |
ViL-Small | Mask R-CNN | ImageNet-1K | 0.2 | 1x | 44.9 | 41.1 | 45.0M | 218.3G | ckpt config | log |
ViL-Small | Mask R-CNN | ImageNet-1K | 0.2 | 3x | 47.1 | 42.7 | 45.0M | 218.3G | ckpt config | log |
ViL-Medium (D) | Mask R-CNN | ImageNet-21K | 0.2 | 1x | 47.6 | 43.0 | 60.1M | 293.8G | ckpt config | log |
ViL-Medium (D) | Mask R-CNN | ImageNet-21K | 0.3 | 3x | 48.9 | 44.2 | 60.1M | 293.8G | ckpt config | log |
ViL-Base (D) | Mask R-CNN | ImageNet-21K | 0.3 | 1x | 48.6 | 43.6 | 76.1M | 384.4G | ckpt config | log |
ViL-Base (D) | Mask R-CNN | ImageNet-21K | 0.3 | 3x | 49.6 | 44.5 | 76.1M | 384.4G | ckpt config | log |
--- | --- | --- | --- | --- | --- | --- | --- | --- | ||
ViL-Tiny | RetinaNet | ImageNet-1K | 0.05 | 1x | 40.8 | -- | 16.64M | 182.7G | ckpt config | log |
ViL-Tiny | RetinaNet | ImageNet-1K | 0.1 | 3x | 43.6 | -- | 16.64M | 182.7G | ckpt config | log |
ViL-Small | RetinaNet | ImageNet-1K | 0.1 | 1x | 44.2 | -- | 35.68M | 254.8G | ckpt config | log |
ViL-Small | RetinaNet | ImageNet-1K | 0.2 | 3x | 45.9 | -- | 35.68M | 254.8G | ckpt config | log |
ViL-Medium (D) | RetinaNet | ImageNet-21K | 0.2 | 1x | 46.8 | -- | 50.77M | 330.4G | ckpt config | log |
ViL-Medium (D) | RetinaNet | ImageNet-21K | 0.3 | 3x | 47.9 | -- | 50.77M | 330.4G | ckpt config | log |
ViL-Base (D) | RetinaNet | ImageNet-21K | 0.3 | 1x | 47.8 | -- | 66.74M | 420.9G | ckpt config | log |
ViL-Base (D) | RetinaNet | ImageNet-21K | 0.3 | 3x | 48.6 | -- | 66.74M | 420.9G | ckpt config | log |
See more fine-grained results in Table 6 and Table 7 in the Vision Longformer paper. We use weight decay 0.05 for all experiments, but search for best drop path in [0.05, 0.1, 0.2, 0.3].
Comparison of various efficient attention mechanims with absolute positional embedding (Small size)
Backbone | Method | pretrain | drop_path | Lr Schd | box mAP | mask mAP | #params | FLOPs | Memory | checkpoints | log |
---|---|---|---|---|---|---|---|---|---|---|---|
srformer/64 | Mask R-CNN | ImageNet-1K | 0.1 | 1x | 36.4 | 34.6 | 73.3M | 224.1G | 7.1G | ckpt config | log |
srformer/32 | Mask R-CNN | ImageNet-1K | 0.1 | 1x | 39.9 | 37.3 | 51.5M | 268.3G | 13.6G | ckpt config | log |
Partial srformer/32 | Mask R-CNN | ImageNet-1K | 0.1 | 1x | 42.4 | 39.0 | 46.8M | 352.1G | 22.6G | ckpt config | log |
global | Mask R-CNN | ImageNet-1K | 0.1 | 1x | 34.8 | 33.4 | 45.2M | 226.4G | 7.6G | ckpt config | log |
Partial global | Mask R-CNN | ImageNet-1K | 0.1 | 1x | 42.5 | 39.2 | 45.1M | 326.5G | 20.1G | ckpt config | log |
performer | Mask R-CNN | ImageNet-1K | 0.1 | 1x | 36.1 | 34.3 | 45.0M | 251.5G | 8.4G | ckpt config | log |
Partial performer | Mask R-CNN | ImageNet-1K | 0.05 | 1x | 42.3 | 39.1 | 45.0M | 343.7G | 20.0G | ckpt config | log |
ViL | Mask R-CNN | ImageNet-1K | 0.1 | 1x | 42.9 | 39.6 | 45.0M | 218.3G | 7.4G | ckpt config | log |
Partial ViL | Mask R-CNN | ImageNet-1K | 0.1 | 1x | 43.3 | 39.8 | 45.0M | 326.8G | 19.5G | ckpt config | log |
We use weight decay 0.05 for all experiments, but search for best drop path in [0.05, 0.1, 0.2].