A Pytorch Implementation of [CVPR 2023] paper: MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
The project requires the following:
- PyTorch (version 1.9.0 or higher): The project was tested on PyTorch 1.11.0 with CUDA 11.3 support.
- Hardware: We have performed experiments on NVIDIA GeForce A5000 with 24GB GPU memory. Similar or higher specifications are recommended for optimal performance.
- Python packages: Additional Python packages specified in the
requirements.txt
file are necessary. Instructions for installing these are given below.
Let's begin from creating and activating a Conda environment an virtual environment
conda create --name mistenv python=3.7
conda activate mistenv
Then, clone this repository and install the requirements.
$ git clone https://github.com/showlab/mist.git
$ cd mist
$ pip install -r requirements.txt
You need to obtain necessary dataset and features. You can choose one of the following options to do so:
You can download the dataset annotation files and features directly from our online drive:
In our experiments, we placed the downloaded data folder in the same root directory as the code folder.
If you prefer, you can extract the features on your own using the provided script located in the extract
directory:
extract\extract_clip_features.ipynb
This script extracts the patch features from video frames. Additionally, it includes a checking routine to verify the correctness of the extracted features.
With your environment set up and data ready, you can start training the model. To begin training, run the agqa_v2_mist.sh
shell script located in the shells\
directory.
./shells/agqa_v2_mist.sh
Alternatively, input the command below on the terminal to start training.
CUDA_VISIBLE_DEVICES=6 python main_agqa_v2.py --dataset_dir='../data/datasets/' \
--feature_dir='../data/feats/' \
--checkpoint_dir=agqa \
--dataset=agqa \
--mc=0 \
--epochs=30 \
--lr=0.00003 \
--qmax_words=30 \
--amax_words=38 \
--max_feats=32 \
--batch_size=128 \
--batch_size_val=128 \
--num_thread_reader=8 \
--mlm_prob=0 \
--n_layers=2 \
--embd_dim=512 \
--ff_dim=1024 \
--feature_dim=512 \
--dropout=0.3 \
--seed=100 \
--freq_display=150 \
--save_dir='../data/save_models/agqa/mist_agqa_v2/'
Make sure to modify the dataset_dir
, feature_dir
, and save_dir
parameters in the command above to match the locations where you have stored the downloaded data and features.
To verify that your training process is running as expected, you can refer to our training logs located in the logs\
directory.
We are grateful to just-ask, an excellent VQA codebase, on which our codes are developed.
@inproceedings{gao2023mist,
title={MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering},
author={Difei Gao and Luowei Zhou and Lei Ji and Linchao Zhu and Yi Yang and Mike Zheng Shou},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={14773--14783},
year={2023}
}