🔮 Welcome to the official code repository for CG-STVG: Context-Guided Spatio-Temporal Video Grounding. We're excited to share our work with you, please bear with us as we prepare code. Stay tuned for the reveal!
💡 A picture is worth a thousand words!
Can we explore visual context from videos to enhance target localization for STVG? Yes!
Figure: Illustration of and comparison between (a) existing methods that localize the target using object information from text query and (b) our CG-STVG that enjoys object information from text query and guidance from instance context for STVG.
Figure: Overview of our method, which consists of a multimodal encoder for feature extraction and a context-guided decoder by cascading a set of decoding stages for grounding. In each decoding stage, instance context is mined (by ICG and ICR) to guide query learning for better localization. More details can be seen in the paper.
The used datasets are placed in data
folder with the following structure.
data
|_ vidstg
| |_ videos
| | |_ [video name 0].mp4
| | |_ [video name 1].mp4
| | |_ ...
| |_ vstg_annos
| | |_ train.json
| | |_ ...
| |_ sent_annos
| | |_ train_annotations.json
| | |_ ...
| |_ data_cache
| | |_ ...
|_ hc-stvg2
| |_ v2_video
| | |_ [video name 0].mp4
| | |_ [video name 1].mp4
| | |_ ...
| |_ annos
| | |_ hcstvg_v2
| | | |_ train.json
| | | |_ test.json
| | data_cache
| | |_ ...
|_ hc-stvg
| |_ v1_video
| | |_ [video name 0].mp4
| | |_ [video name 1].mp4
| | |_ ...
| |_ annos
| | |_ hcstvg_v1
| | | |_ train.json
| | | |_ test.json
| | data_cache
| | |_ ...
The download link for the above-mentioned document is as follows:
hc-stvg: v1_video, annos, data_cache
hc-stvg2: v2_video, annos, data_cache
vidstg: videos, vstg_annos, sent_annos, data_cache
The used datasets are placed in model_zoo
folder
ResNet-101, VidSwin-T, roberta-base
The code has been tested and verified using PyTorch 2.0.1 and CUDA 11.7. However, compatibility with other versions is also likely. To install the necessary requirements, please use the commands provided below:
pip3 install -r requirements.txt
apt install ffmpeg -y
Please utilize the script provided below:
# run for HC-STVG
python3 -m torch.distributed.launch \
--nproc_per_node=8 \
scripts/train_net.py \
--config-file "experiments/hcstvg.yaml" \
INPUT.RESOLUTION 420 \
OUTPUT_DIR output/hcstvg \
TENSORBOARD_DIR output/hcstvg
# run for HC-STVG2
python3 -m torch.distributed.launch \
--nproc_per_node=8 \
scripts/train_net.py \
--config-file "experiments/hcstvg2.yaml" \
INPUT.RESOLUTION 420 \
OUTPUT_DIR output/hcstvg2 \
TENSORBOARD_DIR output/hcstvg2
# run for VidSTG
python3 -m torch.distributed.launch \
--nproc_per_node=8 \
scripts/train_net.py \
--config-file "experiments/vidstg.yaml" \
INPUT.RESOLUTION 420 \
OUTPUT_DIR output/vidstg \
TENSORBOARD_DIR output/vidstg
For additional training options, such as utilizing different hyper-parameters, please adjust the configurations as needed:
experiments/hcstvg.yaml
, experiments/hcstvg2.yaml
and experiments/vidstg.yaml
.
Please utilize the script provided below:
# run for HC-STVG
python3 -m torch.distributed.launch \
--nproc_per_node=8 \
scripts/test_net.py \
--config-file "experiments/hcstvg.yaml" \
INPUT.RESOLUTION 420 \
MODEL.WEIGHT [Pretrained Model Weights] \
OUTPUT_DIR output/hcstvg
# run for HC-STVG2
python3 -m torch.distributed.launch \
--nproc_per_node=8 \
scripts/test_net.py \
--config-file "experiments/hcstvg2.yaml" \
INPUT.RESOLUTION 420 \
MODEL.WEIGHT [Pretrained Model Weights] \
OUTPUT_DIR output/hcstvg2
# run for VidSTG
python3 -m torch.distributed.launch \
--nproc_per_node=8 \
scripts/test_net.py \
--config-file "experiments/vidstg.yaml" \
INPUT.RESOLUTION 420 \
MODEL.WEIGHT [Pretrained Model Weights] \
OUTPUT_DIR output/vidstg
We provide our trained checkpoints for results reproducibility.
Dataset | resolution | url | m_vIoU/[email protected]/[email protected] | size |
---|---|---|---|---|
HC-STVG | 420 | Model | 38.4/61.5/36.3 | 3.4 GB |
HC-STVG2 | 420 | Model | 39.5/64.5/36.3 | 3.4 GB |
VidSTG | 420 | Model | 34.0/47.7/33.1 | 3.4 GB |
🎏 CG-STVG achieves state-of-the-art performance on three challenging benchmarks, including HCSTVG-v1, HCSTVG-v2, and VidSTG, as shown below. Note that, the baseline is our CG-STVG without context generation and refinement.
Methods | M_tIoU | m_vIoU | [email protected] | [email protected] |
---|---|---|---|---|
STGVT[TCSVT'2021] | - | 18.2 | 26.8 | 9.5 |
STVGBert[ICCV'2021] | - | 20.4 | 29.4 | 11.3 |
TubeDETR[CVPR'2022] | 43.7 | 32.4 | 49.8 | 23.5 |
STCAT[NeurIPS'2022] | 49.4 | 35.1 | 57.7 | 30.1 |
CSDVL[CVPR'2023] | - | 36.9 | 62.2 | 34.8 |
Baseline (ours) | 50.4 | 36.5 | 58.6 | 32.3 |
CG-STVG (ours) | 52.8(+2.4) | 38.4(+1.9) | 61.5(+2.9) | 36.3(+4.0) |
Methods | M_tIoU | m_vIoU | [email protected] | [email protected] |
---|---|---|---|---|
PCC[arxiv'2021] | - | 30.0 | - | - |
2D-Tan[arxiv'2021] | - | 30.4 | 50.4 | 18.8 |
MMN[AAAI'2022] | - | 30.3 | 49.0 | 25.6 |
TubeDETR[CVPR'2022] | - | 36.4 | 58.8 | 30.6 |
CSDVL[CVPR'2023] | 58.1 | 38.7 | 65.5 | 33.8 |
Baseline (ours) | 58.6 | 37.8 | 62.4 | 32.1 |
CG-STVG (ours) | 60.0(+1.4) | 39.5(+1.7) | 64.5(+2.1) | 36.3(+4.2) |
Methods | Declarative Sentences | Interrogative Sentences | ||||||
M_tIoU | m_vIoU | [email protected] | [email protected] | M_tIoU | m_vIoU | [email protected] | [email protected] | |
STGRN[CVPR'2020] | 48.5 | 19.8 | 25.8 | 14.6 | 47.0 | 18.3 | 21.1 | 12.8 |
OMRN[IJCAI'2020] | 50.7 | 23.1 | 32.6 | 16.4 | 49.2 | 20.6 | 28.4 | 14.1 |
STGVT[TCSVT'2021] | - | 21.6 | 29.8 | 18.9 | - | - | - | - |
STVGBert[ICCV'2021] | - | 24.0 | 30.9 | 18.4 | - | 22.5 | 26.0 | 16.0 |
TubeDETR[CVPR'2022] | 48.1 | 30.4 | 42.5 | 28.2 | 46.9 | 25.7 | 35.7 | 23.2 |
STCAT[NeurIPS'2022] | 50.8 | 33.1 | 46.2 | 32.6 | 49.7 | 28.2 | 39.2 | 26.6 |
CSDVL[CVPR'2023] | - | 33.7 | 47.2 | 32.8 | - | 28.5 | 39.9 | 26.2 |
Baseline (ours) | 49.7 | 32.4 | 45.0 | 31.4 | 48.8 | 27.7 | 38.7 | 25.6 |
CG-STVG (ours) | 51.4 (+1.7) | 34.0 (+1.6) | 47.7 (+2.7) | 33.1 (+1.7) | 49.9 (+1.1) | 29.0 (+1.3) | 40.5 (+1.8) | 27.5 (+1.9) |
This repo is partly based on the open-source release from STCAT and the evaluation metric implementation is borrowed from TubeDETR for a fair comparison.
⭐ If you find this repository useful, please consider giving it a star and citing it:
@inproceedings{gu2024context,
title={Context-Guided Spatio-Temporal Video Grounding},
author={Gu, Xin and Fan, Heng and Huang, Yan and Luo, Tiejian and Zhang, Libo},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={18330--18339},
year={2024}
}