Name	Name	Last commit message	Last commit date
parent directory ..
CLIP	CLIP
OpenCLIP	OpenCLIP
assets	assets
base	base
configs	configs
data_loader	data_loader
downstream	downstream
logger	logger
model	model
scripts	scripts
trainer	trainer
utils	utils
video_transforms	video_transforms
License.txt	License.txt
README.md	README.md
parse_config.py	parse_config.py
parse_config_dist_multi.py	parse_config_dist_multi.py
requirement.txt	requirement.txt
train_dist_TVTSv2_ViT_B_16.py	train_dist_TVTSv2_ViT_B_16.py
train_dist_TVTSv2_ViT_B_32.py	train_dist_TVTSv2_ViT_B_32.py
train_dist_TVTSv2_ViT_H_14.py	train_dist_TVTSv2_ViT_H_14.py

[Technical Report] TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, Ying Shan

This repo is the official implementation of the paper TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale.

Main Results

Zero-shot Text-to-Video Retrieval

Zero-shot Action Recognition

Linear Probe

Instruction

Environment Setup

Before you start, run the following command to set up your Python environment.

pip install -r requirement.txt

Dataset Preparation

Dataset Splits

We have uploaded the dataset splits on Google Drive. Download it from this link and unzip it in the root directory.

Pre-training Datasets

Download YT-Temporal from here, and put the dataset under the folder data/YTTemporal.
Download WebVid-2M from here, and put the dataset under the folder data/WebVid.

Downstream Datasets

Text-to-Video Retrieval

Download MSR-VTT from here, and put the dataset under the folder data/msrvtt.
Download DiDeMo from here, and put the dataset under the folder data/didemo.
Download LSMDC from here, and put the dataset under the folder data/lsmdc.

Action Recognition

Download HMDB-51 from here, and put the dataset under the folder data/hmdb51.
Download UCF-101 from here, and put the dataset under the folder data/ucf101.
Download Kinetics-400 from here, and put the dataset under the folder data/k400.
Download SSV2 from here, and put the dataset under the folder data/SSV2.

Training and Evaluation

We use up to 80 NVIDIA V100 GPUs for pre-training. The detailed hyper-parameters can be found in the Appendix.

Pre-training

Download CLIP-B/32 and CLIP-B/16 weights from OpenAI’s official repo, and put them into CLIP/models.
Download OpenCLIP-H/14 weights from the official repo, and put it into OpenCLIP/models.

Run the following script to pre-train different models on the YT-Temporal dataset and WebVid dataset jointly.

bash scripts/train_dist_TVTSv2_ViT_B_32.sh # for ViT-B/32, no mask
bash scripts/train_dist_TVTSv2_ViT_B_16.sh # for ViT-B/16, mask 50%
bash scripts/train_dist_TVTSv2_ViT_H_14.sh # for ViT-H/14, mask 70%

Downstream Evaluation

We have released our pre-trained models on Google Drive in the following links to quickly reproduce the results reported in our paper.

Download the pre-trained models and put them in the root directory. All zero-shot evaluation scripts are available on a single GPU. Try our powerful models now 😎!

# MSR-VTT Zero-shot Text-to-Video Retrieval
bash scripts/zero_ret_msrvtt_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_ret_msrvtt_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_ret_msrvtt_TVTSv2_ViT_H_14.sh # for ViT-H/14

# DiDeMo Zero-shot Text-to-Video Retrieval
bash scripts/zero_ret_didemo_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_ret_didemo_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_ret_didemo_TVTSv2_ViT_H_14.sh # for ViT-H/14

# LSMDC Zero-shot Text-to-Video Retrieval
bash scripts/zero_ret_lsmdc_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_ret_lsmdc_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_ret_lsmdc_TVTSv2_ViT_H_14.sh # for ViT-H/14

# HMDB-51 Zero-shot Action Recognition
bash scripts/zero_recognition_hmdb51_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_recognition_hmdb51_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_recognition_hmdb51_TVTSv2_ViT_H_14.sh # for ViT-H/14

# UCF-101 Zero-shot Action Recognition
bash scripts/zero_recognition_ucf101_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_recognition_ucf101_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_recognition_ucf101_TVTSv2_ViT_H_14.sh # for ViT-H/14

# Kinetics-400 Zero-shot Action Recognition
bash scripts/zero_recognition_k400_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_recognition_k400_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_recognition_k400_TVTSv2_ViT_H_14.sh # for ViT-H/14

# SSV2-MC Zero-shot Action Recognition
bash scripts/zero_ssv2_mc_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_ssv2_mc_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_ssv2_mc_TVTSv2_ViT_H_14.sh # for ViT-H/14

Tips: The performance may differ slightly (either higher or lower) from our papers due to hardware environment differences.

Video Feature Extraction

Our model is able to act as an independent video feature extractor. And we provide simple scripts for out-of-the-box usage. Have a try on your own video😜!

cd downstream
python feature_extraction_TVTSv2_B_32.py --video_path /path/to/video.mp4 # for ViT-B/32, feature shape: [1 x 512]
python feature_extraction_TVTSv2_B_16.py --video_path /path/to/video.mp4 # for ViT-B/16, feature shape: [1 x 512]
python feature_extraction_TVTSv2_H_14.py --video_path /path/to/video.mp4 # for ViT-H/14, feature shape: [1 x 1024]

Acknowledgement

The pre-training code is based on the official implementation of Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval.

Citation

If you find our work helps, please cite our paper.

@misc{zeng2023tvtsv2,
      title={TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale}, 
      author={Ziyun Zeng and Yixiao Ge and Zhan Tong and Xihui Liu and Shu-Tao Xia and Ying Shan},
      year={2023},
      eprint={2305.14173},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

This research paper makes references to some open-source projects. Credits are given to these projects. See License.txt for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2

v2

README.md

[Technical Report] TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

Main Results

Zero-shot Text-to-Video Retrieval

Zero-shot Action Recognition

Linear Probe

Instruction

Environment Setup

Dataset Preparation

Dataset Splits

Pre-training Datasets

Downstream Datasets

Text-to-Video Retrieval

Action Recognition

Training and Evaluation

Pre-training

Downstream Evaluation

Video Feature Extraction

Acknowledgement

Citation

License

Files

v2

Directory actions

More options

Directory actions

More options

Latest commit

History

v2

Folders and files

parent directory

README.md

[Technical Report] TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

Main Results

Zero-shot Text-to-Video Retrieval

Zero-shot Action Recognition

Linear Probe

Instruction

Environment Setup

Dataset Preparation

Dataset Splits

Pre-training Datasets

Downstream Datasets

Text-to-Video Retrieval

Action Recognition

Training and Evaluation

Pre-training

Downstream Evaluation

Video Feature Extraction

Acknowledgement

Citation

License