Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, Ying Shan
This repo is the official implementation of the paper TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale.
Before you start, run the following command to set up your Python environment.
pip install -r requirement.txt
We have uploaded the dataset splits on Google Drive. Download it from this link and unzip it in the root directory.
- Download YT-Temporal from here, and put the dataset under the folder
data/YTTemporal
. - Download WebVid-2M from here, and put the dataset under the folder
data/WebVid
.
- Download MSR-VTT from here, and put the dataset under the folder
data/msrvtt
. - Download DiDeMo from here, and put the dataset under the folder
data/didemo
. - Download LSMDC from here, and put the dataset under the folder
data/lsmdc
.
- Download HMDB-51 from here, and put the dataset under the folder
data/hmdb51
. - Download UCF-101 from here, and put the dataset under the folder
data/ucf101
. - Download Kinetics-400 from here, and put the dataset under the folder
data/k400
. - Download SSV2 from here, and put the dataset under the folder
data/SSV2
.
We use up to 80 NVIDIA V100 GPUs for pre-training. The detailed hyper-parameters can be found in the Appendix.
-
Download CLIP-B/32 and CLIP-B/16 weights from OpenAI’s official repo, and put them into
CLIP/models
. -
Download OpenCLIP-H/14 weights from the official repo, and put it into
OpenCLIP/models
. -
Run the following script to pre-train different models on the YT-Temporal dataset and WebVid dataset jointly.
bash scripts/train_dist_TVTSv2_ViT_B_32.sh # for ViT-B/32, no mask bash scripts/train_dist_TVTSv2_ViT_B_16.sh # for ViT-B/16, mask 50% bash scripts/train_dist_TVTSv2_ViT_H_14.sh # for ViT-H/14, mask 70%
We have released our pre-trained models on Google Drive in the following links to quickly reproduce the results reported in our paper.
- TVTSv2_B_32: https://drive.google.com/file/d/1zNHgqioo-aRUwZXPyTDiRT2uaRrnk386/view?usp=sharing
- TVTSv2_B_16: https://drive.google.com/file/d/1HKc7aGwMd5jhVaYztuY-jbmYqiz_wvWF/view?usp=sharing
- TVTSv2_H_14: https://drive.google.com/file/d/1nxNSaQKm2jt9NSZ3eLnKx7ATTumV-6D5/view?usp=sharing
Download the pre-trained models and put them in the root directory. All zero-shot evaluation scripts are available on a single GPU. Try our powerful models now 😎!
# MSR-VTT Zero-shot Text-to-Video Retrieval
bash scripts/zero_ret_msrvtt_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_ret_msrvtt_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_ret_msrvtt_TVTSv2_ViT_H_14.sh # for ViT-H/14
# DiDeMo Zero-shot Text-to-Video Retrieval
bash scripts/zero_ret_didemo_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_ret_didemo_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_ret_didemo_TVTSv2_ViT_H_14.sh # for ViT-H/14
# LSMDC Zero-shot Text-to-Video Retrieval
bash scripts/zero_ret_lsmdc_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_ret_lsmdc_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_ret_lsmdc_TVTSv2_ViT_H_14.sh # for ViT-H/14
# HMDB-51 Zero-shot Action Recognition
bash scripts/zero_recognition_hmdb51_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_recognition_hmdb51_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_recognition_hmdb51_TVTSv2_ViT_H_14.sh # for ViT-H/14
# UCF-101 Zero-shot Action Recognition
bash scripts/zero_recognition_ucf101_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_recognition_ucf101_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_recognition_ucf101_TVTSv2_ViT_H_14.sh # for ViT-H/14
# Kinetics-400 Zero-shot Action Recognition
bash scripts/zero_recognition_k400_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_recognition_k400_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_recognition_k400_TVTSv2_ViT_H_14.sh # for ViT-H/14
# SSV2-MC Zero-shot Action Recognition
bash scripts/zero_ssv2_mc_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_ssv2_mc_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_ssv2_mc_TVTSv2_ViT_H_14.sh # for ViT-H/14
Tips: The performance may differ slightly (either higher or lower) from our papers due to hardware environment differences.
Our model is able to act as an independent video feature extractor. And we provide simple scripts for out-of-the-box usage. Have a try on your own video😜!
cd downstream
python feature_extraction_TVTSv2_B_32.py --video_path /path/to/video.mp4 # for ViT-B/32, feature shape: [1 x 512]
python feature_extraction_TVTSv2_B_16.py --video_path /path/to/video.mp4 # for ViT-B/16, feature shape: [1 x 512]
python feature_extraction_TVTSv2_H_14.py --video_path /path/to/video.mp4 # for ViT-H/14, feature shape: [1 x 1024]
- The pre-training code is based on the official implementation of Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval.
If you find our work helps, please cite our paper.
@misc{zeng2023tvtsv2,
title={TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale},
author={Ziyun Zeng and Yixiao Ge and Zhan Tong and Xihui Liu and Shu-Tao Xia and Ying Shan},
year={2023},
eprint={2305.14173},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
This research paper makes references to some open-source projects. Credits are given to these projects. See License.txt for details.