VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

[arXiv] [pdf]

In this paper, we leverage the human perceiving process, that involves vision and language interaction, to generate a coherent paragraph description of untrimmed videos. We propose vision-language (VL) features consisting of two modalities, i.e., (i) vision modality to capture global visual content of the entire scene and (ii) language modality to extract scene elements description of both human and non-human objects (e.g. animals, vehicles, etc), visual and non-visual elements (e.g. relations, activities, etc). Furthermore, we propose to train our proposed VLCap under a contrastive learning VL loss. The experiments and ablation studies on ActivityNet Captions and YouCookII datasets show that our VLCap outperforms existing SOTA methods on both accuracy and diversity metrics.

Environment Setup

Clone this repository

git clone https://github.com/UARK-AICV/VLCAP.git
cd VLCAP

Prepare Conda environment

conda env create -f environment.yml
conda activate pytorch

Add project root to PYTHONPATH

Note that you need to do this each time you start a new session.

source setup.sh

Data Preparation

Download features from Google Drive: env feature and lang feature.

mkdir data/anet; cd data/anet
unzip anet_c3d
unzip anet_clip_b16

Training

To train our MART model on ActivityNet Captions:

bash scripts/train.sh [anet/yc2] [true/false]

Here you can specify the dataset (ActivityNet:anet or YouCook2:yc2) and whether to use the proposed language feature (true/false).

Training log and model will be saved at results/anet_re_*.
Once you have a trained model, you can follow the instructions below to generate captions.

Evaluation

Generate captions

bash scripts/translate_greedy.sh anet_re_* [val/test]

Replace anet_re_* with your own model directory name. The generated captions are saved at results/anet_re_*/greedy_pred_val.json

Evaluate generated captions

bash scripts/eval.sh anet [val/test] results/anet_re_*/greedy_pred_[val/test].json

The results should be comparable with the results we present at Table 5 of the paper.

Citations

If you find this code useful for your research, please cite our papers:

@INPROCEEDINGS{kashu_vlcap,
  author={Yamazaki, Kashu and Truong, Sang and Vo, Khoa and Kidd, Michael and Rainwater, Chase and Luu, Khoa and Le, Ngan},
  booktitle={2022 IEEE International Conference on Image Processing (ICIP)}, 
  title={VLCAP: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning}, 
  year={2022},
  volume={},
  number={},
  pages={3656-3661},
  doi={10.1109/ICIP46576.2022.9897766}}

@article{kashu_vltint,
　　title={VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning},
　　volume={37},
　　url={https://ojs.aaai.org/index.php/AAAI/article/view/25412},
　　DOI={10.1609/aaai.v37i3.25412},
　　number={3},
　　journal={Proceedings of the AAAI Conference on Artificial Intelligence},
　　author={Yamazaki, Kashu and Vo, Khoa and Truong, Quang Sang and Raj, Bhiksha and Le, Ngan},
　　year={2023},
　　month={Jun.},
　　pages={3081-3090}
}

Acknowledgement

We acknowledge the following open-source projects that we based on our work:

MART

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
cache		cache
densevid_eval		densevid_eval
scripts		scripts
src		src
video_feature		video_feature
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

Environment Setup

Data Preparation

Training

Evaluation

Citations

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

UARK-AICV/VLCAP

Folders and files

Latest commit

History

Repository files navigation

VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

Environment Setup

Data Preparation

Training

Evaluation

Citations

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages