From ff96dde748a4c9eb53dd2ebf2f3685151584a8fb Mon Sep 17 00:00:00 2001
From: congee524 <congee524@gmail.com>
Date: Mon, 8 May 2023 19:08:56 +0800
Subject: [PATCH 1/2] add readme and config for videomaev2

fix lint

add accuracy and ckpt link

refine readme

fix typos

fix typo
---
 README.md                                     |  2 +-
 configs/recognition/videomaev2/README.md      | 63 +++++++++++++++++++
 configs/recognition/videomaev2/metafile.yml   | 43 +++++++++++++
 ...vit-g-dist-k710-pre_16x4x1_kinetics-400.py | 61 ++++++++++++++++++
 ...vit-g-dist-k710-pre_16x4x1_kinetics-400.py |  6 ++
 model-index.yml                               |  1 +
 6 files changed, 175 insertions(+), 1 deletion(-)
 create mode 100644 configs/recognition/videomaev2/README.md
 create mode 100644 configs/recognition/videomaev2/metafile.yml
 create mode 100644 configs/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py
 create mode 100644 configs/recognition/videomaev2/vit-small-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py
diff --git a/README.md b/README.md
index 75597cc887..edc7fe097c 100644
--- a/README.md
+++ b/README.md
@@ -179,7 +179,7 @@ Results and models are available in the [model zoo](https://mmaction2.readthedoc
     <td><a href="https://github.com/open-mmlab/mmaction2/blob/main/configs/recognition/mvit/README.md">MViT V2</a> (CVPR'2022)</td>
     <td><a href="https://github.com/open-mmlab/mmaction2/blob/main/configs/recognition/uniformer/README.md">UniFormer V1</a> (ICLR'2022)</td>
     <td><a href="https://github.com/open-mmlab/mmaction2/blob/main/configs/recognition/uniformerv2/README.md">UniFormer V2</a> (Arxiv'2022)</td>
-    <td></td>
+    <td><a href="https://github.com/open-mmlab/mmaction2/blob/main/configs/recognition/videomaev2/README.md">VideoMAE V2</a> (CVPR'2023)</td>
     <td></td>
   </tr>
   <tr>
diff --git a/configs/recognition/videomaev2/README.md b/configs/recognition/videomaev2/README.md
new file mode 100644
index 0000000000..3686950c1c
--- /dev/null
+++ b/configs/recognition/videomaev2/README.md
@@ -0,0 +1,63 @@
+# VideoMAE V2
+
+[VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking](https://arxiv.org/abs/2303.16727)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder can still further reduce the overall computational cost. This enables the efficient pre-training of billion-level models in video. We also use a progressive training paradigm that involves an initial pre-training on a diverse multi-sourced unlabeled dataset, followed by a post-pre-training on a mixed labeled dataset. Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and 89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In addition, we extensively verify the pre-trained video ViT models on a variety of downstream tasks, demonstrating its effectiveness as a general video representation learner.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/35596075/237352561-6204d743-8705-43f5-817f-0bc4907b88d0.png" width="800"/>
+</div>
+
+## Results and Models
+
+### Kinetics-400
+
+| frame sampling strategy |   resolution   | backbone | top1 acc | top5 acc |         reference top1 acc         |         reference top5 acc         | testing protocol  | FLOPs | params |         config         |         ckpt          |
+| :---------------------: | :------------: | :------: | :------: | :------: | :--------------------------------: | :--------------------------------: | :---------------: | :---: | :----: | :--------------------: | :-------------------: |
+|         16x4x1          | short-side 320 |  ViT-S   |   83.6   |   96.3   | 83.7 \[[VideoMAE V2](https://github.com/OpenGVLab/VideoMAEv2/blob/master/docs/MODEL_ZOO.md)\] | 96.2 \[[VideoMAE V2](https://github.com/OpenGVLab/VideoMAEv2/blob/master/docs/MODEL_ZOO.md)\] | 5 clips x 3 crops |  57G  |  22M   | [config](/configs/recognition/videomaev2/vit-small-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/videomaev2/vit-small-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400/vit-small-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400_20230510-25c748fd.pth) \[1\] |
+|         16x4x1          | short-side 320 |  ViT-B   |   86.6   |   97.3   | 86.6 \[[VideoMAE V2](https://github.com/OpenGVLab/VideoMAEv2/blob/master/docs/MODEL_ZOO.md)\] | 97.3 \[[VideoMAE V2](https://github.com/OpenGVLab/VideoMAEv2/blob/master/docs/MODEL_ZOO.md)\] | 5 clips x 3 crops | 180G  |  87M   | [config](/configs/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400_20230510-3e7f93b2.pth) \[1\] |
+
+\[1\] The models were distilled from the VideoMAE V2-g model. Specifically, models are initialized with VideoMAE V2 pretraining, then distilled on Kinetics 710 dataset. They are ported from the repo [VideoMAE V2](https://github.com/OpenGVLab/VideoMAEv2) and tested on our data. The VideoMAE V2-g model can be obtained from the original repository. Currently, we only support the testing of VideoMAE V2 models.
+
+1. The values in columns named after "reference" are the results of the original repo.
+2. The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at [Kinetics400-Validation](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155136485_link_cuhk_edu_hk/EbXw2WX94J1Hunyt3MWNDJUBz-nHvQYhO9pvKqm6g39PMA?e=a9QldB). The corresponding [data list](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_val_list.txt) (each line is of the format 'video_id, num_frames, label_index') and the [label map](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_class2ind.txt) are also available.
+
+For more details on data preparation, you can refer to [preparing_kinetics](/tools/data/kinetics/README.md).
+
+## Test
+
+You can use the following command to test a model.
+
+```shell
+python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
+```
+
+Example: test ViT-base model on Kinetics-400 dataset and dump the result to a pkl file.
+
+```shell
+python tools/test.py configs/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py \
+    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl
+```
+
+For more details, you can refer to the **Test** part in the [Training and Test Tutorial](/docs/en/user_guides/train_test.md).
+
+## Citation
+
+```BibTeX
+@misc{wang2023videomaev2,
+      title={VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking},
+      author={Limin Wang and Bingkun Huang and Zhiyu Zhao and Zhan Tong and Yinan He and Yi Wang and Yali Wang and Yu Qiao},
+      year={2023},
+      eprint={2303.16727},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
diff --git a/configs/recognition/videomaev2/metafile.yml b/configs/recognition/videomaev2/metafile.yml
new file mode 100644
index 0000000000..463b3c360f
--- /dev/null
+++ b/configs/recognition/videomaev2/metafile.yml
@@ -0,0 +1,43 @@
+Collections:
+- Name: VideoMAEv2
+  README: configs/recognition/videomaev2/README.md
+  Paper:
+    URL: https://arxiv.org/abs/2303.16727
+    Title: "VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking"
+
+Models:
+  - Name: vit-small-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400
+    Config: configs/recognition/videomaev2/vit-small-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py
+    In Collection: VideoMAEv2
+    Metadata:
+      Architecture: ViT-S
+      Resolution: short-side 320
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/VideoMAEv2/blob/master/docs/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/VideoMAEv2/
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 83.6
+        Top 5 Accuracy: 96.3
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400_20230510-3e7f93b2.pth
+
+  - Name: vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400
+    Config: configs/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py
+    In Collection: VideoMAEv2
+    Metadata:
+      Architecture: ViT-B
+      Resolution: short-side 320
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/VideoMAEv2/blob/master/docs/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/VideoMAEv2/
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 86.6
+        Top 5 Accuracy: 97.3
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400_20230510-3e7f93b2.pth
diff --git a/configs/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py b/configs/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py
new file mode 100644
index 0000000000..d6f6e26a5f
--- /dev/null
+++ b/configs/recognition/videomaev2/vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py
@@ -0,0 +1,61 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='VisionTransformer',
+        img_size=224,
+        patch_size=16,
+        embed_dims=768,
+        depth=12,
+        num_heads=12,
+        mlp_ratio=4,
+        qkv_bias=True,
+        num_frames=16,
+        norm_cfg=dict(type='LN', eps=1e-6)),
+    cls_head=dict(
+        type='TimeSformerHead',
+        num_classes=400,
+        in_channels=768,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/kinetics400/videos_val'
+ann_file_test = 'data/kinetics400/kinetics400_val_list_videos.txt'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='SampleFrames',
+        clip_len=16,
+        frame_interval=4,
+        num_clips=5,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='ThreeCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/videomaev2/vit-small-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py b/configs/recognition/videomaev2/vit-small-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py
new file mode 100644
index 0000000000..e4d94d1cc3
--- /dev/null
+++ b/configs/recognition/videomaev2/vit-small-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py
@@ -0,0 +1,6 @@
+_base_ = ['vit-base-p16_videomaev2-vit-g-dist-k710-pre_16x4x1_kinetics-400.py']
+
+# model settings
+model = dict(
+    backbone=dict(embed_dims=384, depth=12, num_heads=6),
+    cls_head=dict(in_channels=384))
diff --git a/model-index.yml b/model-index.yml
index ebf462e3f9..adf85c0fc7 100644
--- a/model-index.yml
+++ b/model-index.yml
@@ -15,6 +15,7 @@ Import:
 - configs/recognition/trn/metafile.yml
 - configs/recognition/swin/metafile.yml
 - configs/recognition/c2d/metafile.yml
+- configs/recognition/videomaev2/metafile.yml
 - configs/detection/slowfast/metafile.yml
 - configs/detection/slowonly/metafile.yml
 - configs/detection/acrn/metafile.yml

From c051c2764dc83f7991bb8c3f1d3036e51844ebad Mon Sep 17 00:00:00 2001
From: congee524 <congee524@gmail.com>
Date: Thu, 11 May 2023 15:34:19 +0800
Subject: [PATCH 2/2] add videomaev2 model name in README_zh_CN

---
 README_zh-CN.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/README_zh-CN.md b/README_zh-CN.md
index 5e866c3402..9481526796 100644
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -155,6 +155,7 @@ pip install -v -e .
     <td><a href="https://github.com/open-mmlab/mmaction2/blob/main/configs/recognition/mvit/README.md">MViT V2</a> (CVPR'2022)</td>
     <td><a href="https://github.com/open-mmlab/mmaction2/blob/main/configs/recognition/uniformer/README.md">UniFormer V1</a> (ICLR'2022)</td>
     <td><a href="https://github.com/open-mmlab/mmaction2/blob/main/configs/recognition/uniformerv2/README.md">UniFormer V2</a> (Arxiv'2022)</td>
+    <td><a href="https://github.com/open-mmlab/mmaction2/blob/main/configs/recognition/videomaev2/README.md">VideoMAE V2</a> (CVPR'2023)</td>
     <td></td>
     <td></td>
   </tr>