- The datasets VSDv2 are available now.
This repository cotains code and data for our paper Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation
** Note ** Please go into VLT5 and follow the README there for Pretrained Models and Feature Extraction.
# Create python environment (optional)
conda create -n vsd python=3.7
source activate vsd
# Install python dependencies
pip install -r requirements.txt
# For captioning evaluation
python -c "import language_evaluation; language_evaluation.download('coco')"
# Store images, features, and annotations
./datasets
# Image feature extraction
./feature_extraction
# Train VL-T5
./VL-T5/
src/
modeling_t5.py modeling_bart.py <= VL-T5/VL-BART model classes
caption_sp.py, vrd_caption.py <= fine-tuning
param.py <= (argparse) configuration
tokenization.py <= custom tokenizer
utils.py, dist_utils.py <= utility functions
snap/ <= store weight checkpoints
- pretrained VL-BART and VL-T5 are provided by [1]
- Download
snap/
from Google Drive
gdrive download 1_SBj4sZ0gUqfBon1gFBiNRAmfHv5w_ph --recursive
bash ./baseline.sh gpu_num
bash ./end2end.sh gpu_num
This repo is adapted from VLT5.
Please cite our paper if you use our models or data in your project.
@inproceedings{zhao2022vsd,
title = {Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text
Generation},
author = {Yu Zhao and
Jianguo Wei and
Zhichao Lin and
Yueheng Sun and
Meishan Zhang and
Min Zhang},
booktitle = {EMNLP},
year = {2022}
}