Official PyTorch implementation of the paper "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring" and "Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding"
The original code is based on mmcv1.4. Due to all the data processing pipelines being built on private I/O, the training code cannot be open-sourced. Therefore, we have reproduced the results using mmcv2.0.
Git clone our repository, creating a python environment and activate it via the following command
git clone https://github.com/farewellthree/STAN.git
cd STAN
conda create --name stan python=3.10
conda activate stan
bash install.sh
You can follow CLIP4clip for the acquisition of videos and annotation.
Once the dataset is already, set the path in each config. Take stan-b/32 on MSRVTT for instance, set video path here at Line 25.
Considering there might be multiple versions of annotations for the dataset, our code may not be compatible with your annotations. In such cases, you just need to modify the corresponding dataset class in video_text_dataset.py, to output the paths of all videos along with their corresponding captions.
To train stan-b/32 on MSRVTT, run
torchrun --nproc_per_node=8 --master_port=20001 tools/train.py configs/exp/stan/stan_msrvtt_b32_hf.py --launcher pytorch
The same principle applies to other datasets or models in terms of scale.
To train mug-stan-b/32 on MSRVTT, run
torchrun --nproc_per_node=8 --master_port=20001 tools/train.py configs/exp/stan/mugstan_msrvt_b32_hf.py --launcher pytorch
The same principle applies to other datasets or models in terms of scale.
To post-pretraining mug-stan-b/32 on Webvid10m, run
torchrun --nproc_per_node=16 --master_port=20001 tools/train.py configs/exp/stan/mugstan_webvid10m_b32_pretrain.py --launcher pytorch
If you find the code useful for your research, please consider citing our paper:
@article{liu2023revisiting,
title={Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring},
author={Liu, Ruyang and Huang, Jingjia and Li, Ge and Feng, Jiashi and Wu, Xinglong and Li, Thomas H},
journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2023}
}