Our paper (AdaMAE) has been accepted for presentation at CVPR'23.
-
We propose AdaMAE, a novel, adaptive, and end-to-end trainable token sampling strategy for MAEs that takes into account the spatiotemporal properties of all input tokens to sample fewer but informative tokens.
-
We empirically show that AdaMAE samples more tokens from high spatiotemporal information regions of the input, resulting in learning meaningful representations for downstream tasks.
-
We demonstrate the efficiency of AdaMAE in terms of performance and GPU memory against random patch, tube, and frame sampling by conducting a thorough ablation study on the SSv2 dataset.
-
We show that our AdaMAE outperforms state-of-the-art (SOTA) by
$0.7%$ and$1.1%$ (in top-1) improvements on$SSv2$ and$Kinetics-400$ , respectively.
Video | Pred. | Error | CAT | Mask | Video | Pred. | Error | CAT | Mask |
---|
Video | Pred. | Error | CAT | Mask | Video | Pred. | Error | CAT | Mask |
---|
Comparison of our adaptive masking with existing random patch, tube, and frame masking for masking ratio of 80%.} Our adaptive masking approach selects more tokens from the regions with high spatiotemporal information while a small number of tokens from the background.
We use ViT-Base as the backbone for all experiments. MHA
-
We closely follow the VideoMAE pre-trainig receipy, but now with our adaptive masking instead of tube masking. To pre-train AdaMAE, please follow the steps in
DATASET.md
,PRETRAIN.md
. -
To check the performance of pre-trained AdaMAE please follow the steps in
DATASET.md
andFINETUNE.md
. -
To setup the conda environment, please refer
FINETUNE.md
.
- Download the pre-trained model weights for SSv2 and K400 datasets
here
.
Our AdaMAE codebase is based on the implementation of VideoMAE paper. We thank the authors of the VideoMAE for making their code available to the public.
@InProceedings{Bandara_2023_CVPR,
author = {Bandara, Wele Gedara Chaminda and Patel, Naman and Gholami, Ali and Nikkhah, Mehdi and Agrawal, Motilal and Patel, Vishal M.},
title = {AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning With Masked Autoencoders},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {14507-14517}
}