diff --git a/README.md b/README.md
index 75597cc887..edc7fe097c 100644
--- a/README.md
+++ b/README.md
@@ -179,7 +179,7 @@ Results and models are available in the [model zoo](https://mmaction2.readthedoc
MViT V2 (CVPR'2022) |
UniFormer V1 (ICLR'2022) |
UniFormer V2 (Arxiv'2022) |
- |
+ VideoMAE V2 (CVPR'2023) |
|
diff --git a/README_zh-CN.md b/README_zh-CN.md
index 5e866c3402..9481526796 100644
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -155,6 +155,7 @@ pip install -v -e .
MViT V2 (CVPR'2022) |
UniFormer V1 (ICLR'2022) |
UniFormer V2 (Arxiv'2022) |
+ VideoMAE V2 (CVPR'2023) |
|
|
diff --git a/configs/recognition/videomaev2/README.md b/configs/recognition/videomaev2/README.md
new file mode 100644
index 0000000000..3686950c1c
--- /dev/null
+++ b/configs/recognition/videomaev2/README.md
@@ -0,0 +1,63 @@
+# VideoMAE V2
+
+[VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking](https://arxiv.org/abs/2303.16727)
+
+
+
+## Abstract
+
+
+
+Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder can still further reduce the overall computational cost. This enables the efficient pre-training of billion-level models in video. We also use a progressive training paradigm that involves an initial pre-training on a diverse multi-sourced unlabeled dataset, followed by a post-pre-training on a mixed labeled dataset. Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and 89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In addition, we extensively verify the pre-trained video ViT models on a variety of downstream tasks, demonstrating its effectiveness as a general video representation learner.
+
+
+
+
+
+
+
+## Results and Models
+
+### Kinetics-400
+
+| frame sampling strategy | resolution | backbone | top1 acc | top5 acc | reference top1 acc | reference top5 acc | testing protocol | FLOPs | params | config | ckpt |
+| :---------------------: | :------------: | :------: | :------: | :------: | :--------------------------------: | :--------------------------------: | :---------------: | :---: | :----: | :--------------------: | :-------------------: |
+| 16x4x1 | short-side 320 | ViT-S | 83.6 | 96.3 | 83.7 \[[VideoMAE V2](https://github.com/OpenGVLab/VideoMAEv2/blob/master/docs/MODEL_ZOO.md)\] | 96.2 \[[VideoMAE V2](https://github.com/OpenGVLab/VideoMAEv2/blob/master/docs/MODEL_ZOO.md)\] | 5 clips x 3 crops | 57G | 22M | [config](/configs/recognition/videomaev2/vit-small-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/videomaev2/vit-small-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400/vit-small-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400_20230510-25c748fd.pth) \[1\] |
+| 16x4x1 | short-side 320 | ViT-B | 86.6 | 97.3 | 86.6 \[[VideoMAE V2](https://github.com/OpenGVLab/VideoMAEv2/blob/master/docs/MODEL_ZOO.md)\] | 97.3 \[[VideoMAE V2](https://github.com/OpenGVLab/VideoMAEv2/blob/master/docs/MODEL_ZOO.md)\] | 5 clips x 3 crops | 180G | 87M | [config](/configs/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400_20230510-3e7f93b2.pth) \[1\] |
+
+\[1\] The models were distilled from the VideoMAE V2-g model. Specifically, models are initialized with VideoMAE V2 pretraining, then distilled on Kinetics 710 dataset. They are ported from the repo [VideoMAE V2](https://github.com/OpenGVLab/VideoMAEv2) and tested on our data. The VideoMAE V2-g model can be obtained from the original repository. Currently, we only support the testing of VideoMAE V2 models.
+
+1. The values in columns named after "reference" are the results of the original repo.
+2. The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at [Kinetics400-Validation](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155136485_link_cuhk_edu_hk/EbXw2WX94J1Hunyt3MWNDJUBz-nHvQYhO9pvKqm6g39PMA?e=a9QldB). The corresponding [data list](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_val_list.txt) (each line is of the format 'video_id, num_frames, label_index') and the [label map](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_class2ind.txt) are also available.
+
+For more details on data preparation, you can refer to [preparing_kinetics](/tools/data/kinetics/README.md).
+
+## Test
+
+You can use the following command to test a model.
+
+```shell
+python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
+```
+
+Example: test ViT-base model on Kinetics-400 dataset and dump the result to a pkl file.
+
+```shell
+python tools/test.py configs/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py \
+ checkpoints/SOME_CHECKPOINT.pth --dump result.pkl
+```
+
+For more details, you can refer to the **Test** part in the [Training and Test Tutorial](/docs/en/user_guides/train_test.md).
+
+## Citation
+
+```BibTeX
+@misc{wang2023videomaev2,
+ title={VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking},
+ author={Limin Wang and Bingkun Huang and Zhiyu Zhao and Zhan Tong and Yinan He and Yi Wang and Yali Wang and Yu Qiao},
+ year={2023},
+ eprint={2303.16727},
+ archivePrefix={arXiv},
+ primaryClass={cs.CV}
+}
+```
diff --git a/configs/recognition/videomaev2/metafile.yml b/configs/recognition/videomaev2/metafile.yml
new file mode 100644
index 0000000000..463b3c360f
--- /dev/null
+++ b/configs/recognition/videomaev2/metafile.yml
@@ -0,0 +1,43 @@
+Collections:
+- Name: VideoMAEv2
+ README: configs/recognition/videomaev2/README.md
+ Paper:
+ URL: https://arxiv.org/abs/2303.16727
+ Title: "VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking"
+
+Models:
+ - Name: vit-small-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400
+ Config: configs/recognition/videomaev2/vit-small-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py
+ In Collection: VideoMAEv2
+ Metadata:
+ Architecture: ViT-S
+ Resolution: short-side 320
+ Modality: RGB
+ Converted From:
+ Weights: https://github.com/OpenGVLab/VideoMAEv2/blob/master/docs/MODEL_ZOO.md
+ Code: https://github.com/OpenGVLab/VideoMAEv2/
+ Results:
+ - Dataset: Kinetics-400
+ Task: Action Recognition
+ Metrics:
+ Top 1 Accuracy: 83.6
+ Top 5 Accuracy: 96.3
+ Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400_20230510-3e7f93b2.pth
+
+ - Name: vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400
+ Config: configs/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py
+ In Collection: VideoMAEv2
+ Metadata:
+ Architecture: ViT-B
+ Resolution: short-side 320
+ Modality: RGB
+ Converted From:
+ Weights: https://github.com/OpenGVLab/VideoMAEv2/blob/master/docs/MODEL_ZOO.md
+ Code: https://github.com/OpenGVLab/VideoMAEv2/
+ Results:
+ - Dataset: Kinetics-400
+ Task: Action Recognition
+ Metrics:
+ Top 1 Accuracy: 86.6
+ Top 5 Accuracy: 97.3
+ Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400_20230510-3e7f93b2.pth
diff --git a/configs/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py b/configs/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py
new file mode 100644
index 0000000000..d6f6e26a5f
--- /dev/null
+++ b/configs/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py
@@ -0,0 +1,61 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+model = dict(
+ type='Recognizer3D',
+ backbone=dict(
+ type='VisionTransformer',
+ img_size=224,
+ patch_size=16,
+ embed_dims=768,
+ depth=12,
+ num_heads=12,
+ mlp_ratio=4,
+ qkv_bias=True,
+ num_frames=16,
+ norm_cfg=dict(type='LN', eps=1e-6)),
+ cls_head=dict(
+ type='TimeSformerHead',
+ num_classes=400,
+ in_channels=768,
+ average_clips='prob'),
+ data_preprocessor=dict(
+ type='ActionDataPreprocessor',
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/kinetics400/videos_val'
+ann_file_test = 'data/kinetics400/kinetics400_val_list_videos.txt'
+
+test_pipeline = [
+ dict(type='DecordInit'),
+ dict(
+ type='SampleFrames',
+ clip_len=16,
+ frame_interval=4,
+ num_clips=5,
+ test_mode=True),
+ dict(type='DecordDecode'),
+ dict(type='Resize', scale=(-1, 224)),
+ dict(type='ThreeCrop', crop_size=224),
+ dict(type='FormatShape', input_format='NCTHW'),
+ dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+ batch_size=1,
+ num_workers=8,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ dataset=dict(
+ type=dataset_type,
+ ann_file=ann_file_test,
+ data_prefix=dict(video=data_root_val),
+ pipeline=test_pipeline,
+ test_mode=True))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/videomaev2/vit-small-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py b/configs/recognition/videomaev2/vit-small-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py
new file mode 100644
index 0000000000..e4d94d1cc3
--- /dev/null
+++ b/configs/recognition/videomaev2/vit-small-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py
@@ -0,0 +1,6 @@
+_base_ = ['vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py']
+
+# model settings
+model = dict(
+ backbone=dict(embed_dims=384, depth=12, num_heads=6),
+ cls_head=dict(in_channels=384))
diff --git a/model-index.yml b/model-index.yml
index ebf462e3f9..adf85c0fc7 100644
--- a/model-index.yml
+++ b/model-index.yml
@@ -15,6 +15,7 @@ Import:
- configs/recognition/trn/metafile.yml
- configs/recognition/swin/metafile.yml
- configs/recognition/c2d/metafile.yml
+- configs/recognition/videomaev2/metafile.yml
- configs/detection/slowfast/metafile.yml
- configs/detection/slowonly/metafile.yml
- configs/detection/acrn/metafile.yml