Integrally Pre-Trained Transformer Pyramid Networks

Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token Migration

[CVPR2023/TPAMI2024]

(A Simple Hierarchical Vision Transformer Meets Masked Image Modeling)

Figure 1: The comparison between a conventional pre-training (left) and the proposed integral pre-training framework (right). We use a feature pyramid as the unified neck module and apply masked feature modeling for pre-training the feature pyramid. The green and red blocks indicate that the network weights are pre-trained and un-trained (i.e., randomly initialized for fine-tuning), respectively.

Updates

11/Jul./2024

Fast-iTPN is accepted by TPAMI2024.

08/Jan./2024

Fast-iTPN is public at arxiv. Fast-iTPN is a more powerful version of iTPN.

26/Dec./2023

model	Para. (M)	Pre-train	teacher	input/patch	21K ft?	Acc on IN.1K	checkpoint	checkpoint (21K)
Fast-iTPN-T	24	IN.1K	CLIP-L	224/16	N	85.1%	baidu/google
Fast-iTPN-T	24	IN.1K	CLIP-L	384/16	N	86.2%
Fast-iTPN-T	24	IN.1K	CLIP-L	512/16	N	86.5%
Fast-iTPN-S	40	IN.1K	CLIP-L	224/16	N	86.4%	baidu/google
Fast-iTPN-S	40	IN.1K	CLIP-L	384/16	N	86.95%
Fast-iTPN-S	40	IN.1K	CLIP-L	512/16	N	87.8%
Fast-iTPN-B	85	IN.1K	CLIP-L	224/16	N	87.4%	baidu/google
Fast-iTPN-B	85	IN.1K	CLIP-L	512/16	N	88.5%
Fast-iTPN-B	85	IN.1K	CLIP-L	512/16	Y	88.75%		baidu/google
Fast-iTPN-L	312	IN.1K	CLIP-L	640/16	N	89.5%	baidu/google

All the pre-trained Fast-iTPN models are available now (passward: itpn) ! The tiny/small/base scale models report the best performance on ImageNet-1K as far as we know. Use them for your own tasks! See Details.

30/May/2023

model	Pre-train	teacher	input/patch	21K ft?	Acc on IN.1K
EVA-02-B	IN.21K	EVA-CLIP-g	196/14	N	87.0%
EVA-02-B	IN.21K	EVA-CLIP-g	448/14	N	88.3%
EVA-02-B	IN.21K	EVA-CLIP-g	448/14	Y	88.6%
Fast-iTPN-B	IN.1K	CLIP-L	224/16	N	87.4%
Fast-iTPN-B	IN.1K	CLIP-L	512/16	N	88.5%
Fast-iTPN-B	IN.1K	CLIP-L	512/16	Y	88.7%

All the models above are only pre-trained on ImageNet-1K and these models will be available soon.

29/May/2023

The iTPN-L-CLIP/16 intermediate fine-tuned model is available (password:itpn) pretrained on 21K, and fine-tuned on 1K. Evaluating the latter one on ImageNet-1K obtains 89.2% accuracy.

28/Feb./2023

iTPN is accepted by CVPR2023!

08/Feb./2023

The iTPN-L-CLIP/16 model reaches 89.2% fine-tuning performance on ImageNet-1K.

configurations: intermediate fine-tuning on ImageNet-21K + 384 input size

21/Jan./2023

Our HiViT is accepted by ICLR2023!

HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer

08/Dec./2022

Get checkpoints (password: abcd):

	iTPN-B-pixel	iTPN-B-CLIP	iTPN-L-pixel	iTPN-L-CLIP/16
baidu drive	download	download	download	download
google drive	download	download	download	download

25/Nov./2022

The preprint version is public at arxiv.

Requiments

Ubuntu
Python 3.7+
CUDA 10.2+
GCC 5+
Pytorch 1.7+

Dataset

ImageNet-1K
COCO2017
ADE20K

Get Started

Prepare the environment:

conda create --name itpn python=3.8 -y
conda activate itpn

git clone git@github.com:sunsmarterjie/iTPN.git
cd iTPN

pip install torch==1.7.1+cu10.2 torchvision==0.8.2+cu10.2 timm==0.3.2 tensorboard einops

iTPN supports pre-training using pixel and CLIP as supervision. For the latter, please first download the CLIP models (We use CLIP-B/16 and CLIP-L/14 models in the paper).

Main Results

Table 1: Top-1 classification accuracy (%) by fine-tuning the pre-trained models on ImageNet-1K. We compare models of different levels and supervisions (e.g., with and without CLIP) separately.

Table 2: Visual recognition results (%) on COCO and ADE20K. Mask R-CNN (abbr. MR, 1x/3x) and Cascade Mask R-CNN (abbr. CMR, 1x) are used on COCO, and UPerHead with 512x512 input is used on ADE20K. For the base-level models, each cell of COCO results contains object detection (box) and instance segmentation (mask) APs. For the large-level models, the accuracy of 1x Mask R-CNN surpasses all existing methods.

License

iTPN is released under the License.

Citation

@article{tian2024fast,
  title={Fast-iTPN: Integrally pre-trained transformer pyramid network with token migration},
  author={Tian, Yunjie and Xie, Lingxi and Qiu, Jihao and Jiao, Jianbin and Wang, Yaowei and Tian, Qi and Ye, Qixiang},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2024},
  publisher={IEEE}
}

@inproceedings{tian2023integrally,
  title={Integrally pre-trained transformer pyramid networks},
  author={Tian, Yunjie and Xie, Lingxi and Wang, Zhaozhi and Wei, Longhui and Zhang, Xiaopeng and Jiao, Jianbin and Wang, Yaowei and Tian, Qi and Ye, Qixiang},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={18610--18620},
  year={2023}
}

@inproceedings{zhang2023hivit,
  title={HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer},
  author={Zhang, Xiaosong and Tian, Yunjie and Xie, Lingxi and Huang, Wei and Dai, Qi and Ye, Qixiang and Tian, Qi},
  booktitle={International Conference on Learning Representations},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Integrally Pre-Trained Transformer Pyramid Networks

Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token Migration

[CVPR2023/TPAMI2024]

(A Simple Hierarchical Vision Transformer Meets Masked Image Modeling)

Updates

Requiments

Dataset

Get Started

Main Results

License

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Integrally Pre-Trained Transformer Pyramid Networks

Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token Migration

[CVPR2023/TPAMI2024]

(A Simple Hierarchical Vision Transformer Meets Masked Image Modeling)

Updates

Requiments

Dataset

Get Started

Main Results

License

Citation