Skip to content

Latest commit

 

History

History

v2

[Technical Report] TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, Ying Shan

This repo is the official implementation of the paper TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale.

Fig2

Main Results

Zero-shot Text-to-Video Retrieval

Tab2

Zero-shot Action Recognition

Tab3

Linear Probe

Tab4

Instruction

Environment Setup

Before you start, run the following command to set up your Python environment.

pip install -r requirement.txt

Dataset Preparation

Dataset Splits

We have uploaded the dataset splits on Google Drive. Download it from this link and unzip it in the root directory.

Pre-training Datasets

  1. Download YT-Temporal from here, and put the dataset under the folder data/YTTemporal.
  2. Download WebVid-2M from here, and put the dataset under the folder data/WebVid.

Downstream Datasets

Text-to-Video Retrieval
  1. Download MSR-VTT from here, and put the dataset under the folder data/msrvtt.
  2. Download DiDeMo from here, and put the dataset under the folder data/didemo.
  3. Download LSMDC from here, and put the dataset under the folder data/lsmdc.
Action Recognition
  1. Download HMDB-51 from here, and put the dataset under the folder data/hmdb51.
  2. Download UCF-101 from here, and put the dataset under the folder data/ucf101.
  3. Download Kinetics-400 from here, and put the dataset under the folder data/k400.
  4. Download SSV2 from here, and put the dataset under the folder data/SSV2.

Training and Evaluation

We use up to 80 NVIDIA V100 GPUs for pre-training. The detailed hyper-parameters can be found in the Appendix.

Pre-training

  1. Download CLIP-B/32 and CLIP-B/16 weights from OpenAI’s official repo, and put them into CLIP/models.

  2. Download OpenCLIP-H/14 weights from the official repo, and put it into OpenCLIP/models.

  3. Run the following script to pre-train different models on the YT-Temporal dataset and WebVid dataset jointly.

    bash scripts/train_dist_TVTSv2_ViT_B_32.sh # for ViT-B/32, no mask
    bash scripts/train_dist_TVTSv2_ViT_B_16.sh # for ViT-B/16, mask 50%
    bash scripts/train_dist_TVTSv2_ViT_H_14.sh # for ViT-H/14, mask 70%

Downstream Evaluation

We have released our pre-trained models on Google Drive in the following links to quickly reproduce the results reported in our paper.

  1. TVTSv2_B_32: https://drive.google.com/file/d/1zNHgqioo-aRUwZXPyTDiRT2uaRrnk386/view?usp=sharing
  2. TVTSv2_B_16: https://drive.google.com/file/d/1HKc7aGwMd5jhVaYztuY-jbmYqiz_wvWF/view?usp=sharing
  3. TVTSv2_H_14: https://drive.google.com/file/d/1nxNSaQKm2jt9NSZ3eLnKx7ATTumV-6D5/view?usp=sharing

Download the pre-trained models and put them in the root directory. All zero-shot evaluation scripts are available on a single GPU. Try our powerful models now 😎!

# MSR-VTT Zero-shot Text-to-Video Retrieval
bash scripts/zero_ret_msrvtt_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_ret_msrvtt_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_ret_msrvtt_TVTSv2_ViT_H_14.sh # for ViT-H/14
# DiDeMo Zero-shot Text-to-Video Retrieval
bash scripts/zero_ret_didemo_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_ret_didemo_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_ret_didemo_TVTSv2_ViT_H_14.sh # for ViT-H/14
# LSMDC Zero-shot Text-to-Video Retrieval
bash scripts/zero_ret_lsmdc_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_ret_lsmdc_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_ret_lsmdc_TVTSv2_ViT_H_14.sh # for ViT-H/14
# HMDB-51 Zero-shot Action Recognition
bash scripts/zero_recognition_hmdb51_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_recognition_hmdb51_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_recognition_hmdb51_TVTSv2_ViT_H_14.sh # for ViT-H/14
# UCF-101 Zero-shot Action Recognition
bash scripts/zero_recognition_ucf101_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_recognition_ucf101_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_recognition_ucf101_TVTSv2_ViT_H_14.sh # for ViT-H/14
# Kinetics-400 Zero-shot Action Recognition
bash scripts/zero_recognition_k400_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_recognition_k400_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_recognition_k400_TVTSv2_ViT_H_14.sh # for ViT-H/14
# SSV2-MC Zero-shot Action Recognition
bash scripts/zero_ssv2_mc_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_ssv2_mc_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_ssv2_mc_TVTSv2_ViT_H_14.sh # for ViT-H/14

Tips: The performance may differ slightly (either higher or lower) from our papers due to hardware environment differences.

Video Feature Extraction

Our model is able to act as an independent video feature extractor. And we provide simple scripts for out-of-the-box usage. Have a try on your own video😜!

cd downstream
python feature_extraction_TVTSv2_B_32.py --video_path /path/to/video.mp4 # for ViT-B/32, feature shape: [1 x 512]
python feature_extraction_TVTSv2_B_16.py --video_path /path/to/video.mp4 # for ViT-B/16, feature shape: [1 x 512]
python feature_extraction_TVTSv2_H_14.py --video_path /path/to/video.mp4 # for ViT-H/14, feature shape: [1 x 1024]

Acknowledgement

Citation

If you find our work helps, please cite our paper.

@misc{zeng2023tvtsv2,
      title={TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale}, 
      author={Ziyun Zeng and Yixiao Ge and Zhan Tong and Xihui Liu and Shu-Tao Xia and Ying Shan},
      year={2023},
      eprint={2305.14173},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

This research paper makes references to some open-source projects. Credits are given to these projects. See License.txt for details.