Skip to content

[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

License

Notifications You must be signed in to change notification settings

GuozhenZhang1999/VideoMAE

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

50 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Official PyTorch Implementation of VideoMAE (NeurIPS 2022 Spotlight).

VideoMAE Framework

License: CC BY-NC 4.0
Hugging Face ModelsHugging Face SpacesColab
PWC
PWC
PWC
PWC
PWC

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong, Yibing Song, Jue Wang, Limin Wang
Nanjing University, Tencent AI Lab

๐Ÿ“ฐ News

[2023.1.16] Code and pre-trained models for Action Detection are available!
[2022.11.20] ๐Ÿ‘€ VideoMAE is integrated into Hugging Face Spaces and Colab, supported by @Sayak Paul.
[2022.10.25] ๐Ÿ‘€ VideoMAE is integrated into MMAction2, the results on Kinetics-400 can be reproduced successfully.
[2022.10.20] The pre-trained models and scripts of ViT-S and ViT-H are available!
[2022.10.19] The pre-trained models and scripts on UCF101 are available!
[2022.9.15] VideoMAE is accepted by NeurIPS 2022 as a spotlight presentation! ๐ŸŽ‰
[2022.8.8] ๐Ÿ‘€ VideoMAE is integrated into official ๐Ÿค—HuggingFace Transformers now! Hugging Face Models
[2022.7.7] We have updated new results on downstream AVA 2.2 benchmark. Please refer to our paper for details.
[2022.4.24] Code and pre-trained models are available now!
[2022.3.24] Code and pre-trained models will be released here. Welcome to watch this repository for the latest updates.

โœจ Highlights

๐Ÿ”ฅ Masked Video Modeling for Video Pre-Training

VideoMAE performs the task of masked video modeling for video pre-training. We propose the extremely high masking ratio (90%-95%) and tube masking strategy to create a challenging task for self-supervised video pre-training.

โšก๏ธ A Simple, Efficient and Strong Baseline in SSVP

VideoMAE uses the simple masked autoencoder and plain ViT backbone to perform video self-supervised learning. Due to the extremely high masking ratio, the pre-training time of VideoMAE is much shorter than contrastive learning methods (3.2x speedup). VideoMAE can serve as a simple but strong baseline for future research in self-supervised video pre-training.

๐Ÿ˜ฎ High performance, but NO extra data required

VideoMAE works well for video datasets of different scales and can achieve 87.4% on Kinects-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51. To our best knowledge, VideoMAE is the first to achieve the state-of-the-art performance on these four popular benchmarks with the vanilla ViT backbones while doesn't need any extra data or pre-trained models.

๐Ÿš€ Main Results

โœจ Something-Something V2

Method Extra Data Backbone Resolution #Frames x Clips x Crops Top-1 Top-5
VideoMAE no ViT-S 224x224 16x2x3 66.8 90.3
VideoMAE no ViT-B 224x224 16x2x3 70.8 92.4
VideoMAE no ViT-L 224x224 16x2x3 74.3 94.6
VideoMAE no ViT-L 224x224 32x1x3 75.4 95.2

โœจ Kinetics-400

Method Extra Data Backbone Resolution #Frames x Clips x Crops Top-1 Top-5
VideoMAE no ViT-S 224x224 16x5x3 79.0 93.8
VideoMAE no ViT-B 224x224 16x5x3 81.5 95.1
VideoMAE no ViT-L 224x224 16x5x3 85.2 96.8
VideoMAE no ViT-H 224x224 16x5x3 86.6 97.1
VideoMAE no ViT-L 320x320 32x4x3 86.1 97.3
VideoMAE no ViT-H 320x320 32x4x3 87.4 97.6

โœจ AVA 2.2

Please check the code and checkpoints in VideoMAE-Action-Detection.

Method Extra Data Extra Label Backbone #Frame x Sample Rate mAP
VideoMAE Kinetics-400 โœ— ViT-S 16x4 22.5
VideoMAE Kinetics-400 โœ“ ViT-S 16x4 28.4
VideoMAE Kinetics-400 โœ— ViT-B 16x4 26.7
VideoMAE Kinetics-400 โœ“ ViT-B 16x4 31.8
VideoMAE Kinetics-400 โœ— ViT-L 16x4 34.3
VideoMAE Kinetics-400 โœ“ ViT-L 16x4 37.0
VideoMAE Kinetics-400 โœ— ViT-H 16x4 36.5
VideoMAE Kinetics-400 โœ“ ViT-H 16x4 39.5
VideoMAE Kinetics-700 โœ— ViT-L 16x4 36.1
VideoMAE Kinetics-700 โœ“ ViT-L 16x4 39.3

โœจ UCF101 & HMDB51

Method Extra Data Backbone UCF101 HMDB51
VideoMAE no ViT-B 91.3 62.6
VideoMAE Kinetics-400 ViT-B 96.1 73.3

๐Ÿ”จ Installation

Please follow the instructions in INSTALL.md.

โžก๏ธ Data Preparation

Please follow the instructions in DATASET.md for data preparation.

๐Ÿ”„ Pre-training

The pre-training instruction is in PRETRAIN.md.

โคด๏ธ Fine-tuning with pre-trained models

The fine-tuning instruction is in FINETUNE.md.

๐Ÿ“Model Zoo

We provide pre-trained and fine-tuned models in MODEL_ZOO.md.

๐Ÿ‘€ Visualization

We provide the script for visualization in vis.sh. Colab notebook for better visualization is coming soon.

โ˜Ž๏ธ Contact

Zhan Tong: [email protected]

๐Ÿ‘ Acknowledgements

Thanks to Ziteng Gao, Lei Chen, Chongjian Ge, and Zhiyu Zhao for their kind support.
This project is built upon MAE-pytorch and BEiT. Thanks to the contributors of these great codebases.

๐Ÿ”’ License

The majority of this project is released under the CC-BY-NC 4.0 license as found in the LICENSE file. Portions of the project are available under separate license terms: SlowFast and pytorch-image-models are licensed under the Apache 2.0 license. BEiT is licensed under the MIT license.

โœ๏ธ Citation

If you think this project is helpful, please feel free to leave a starโญ๏ธ and cite our paper:

@inproceedings{tong2022videomae,
  title={Video{MAE}: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Zhan Tong and Yibing Song and Jue Wang and Limin Wang},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}

@article{videomae,
  title={VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
  journal={arXiv preprint arXiv:2203.12602},
  year={2022}
}

About

[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 92.4%
  • Shell 7.6%