Modern image captioning relies heavily on extracting knowledge, from images such as objects, to capture the concept of a static story in the image. In this paper, we propose a textual visual context dataset for image captioning, where the publicly available dataset COCO Captions (Lin et al., 2014) has been extended with information about the scene (such as objects in the image). Since this information has textual form, it can be used to leverage any NLP task, such as text similarity or semantic relation methods, into captioning systems, either as an end-to-end training strategy or a post-processing based approach.
This repository contains the implementation of the paper Visual Semantic Relatedness Dataset for Image Captioning.
Add v2 with recent SoTA model swinV2 classifier for both soft/hard-label visual_caption_cosine_score_v2 with person label (0.2, 0.3 and 0.4). Please refer to huggingface repository.
- Overview
- Visual semantic with BERT
- Dataset
- Visual semantic with pre-trained model
- Evaluation
- Citation
We enrich COCO-Captions with Textual Visual Context information. We use ResNet152, CLIP and Faster R-CNN to extract object information for each COCO-caption image. We use three filter approaches to ensure the quality of the dataset (1) Threshold: to filter out predictions where the object classifier is not confident enough, and (2) semantic alignment with semantic similarity to remove duplicated objects. (3) semantic relatedness score as Soft-Label: to guarantee the visual context and caption have strong relation, we use Sentence RoBERTa-sts to give a soft label via cosine similarity and then we use a threshold to annotate the final label (if th ≥ 0.2, 0.3, 0.4 then [1,0]). Finally, to take advantage of the overlapping between the visual context and the caption, and to extract global information from each visual, we use BERT followed by a shallow CNN (Kim, 2014) to estimate the visual relatedness score.
For a quick start please have a look at this project page and Demo
VC1 | VC2 | VC3 | human annoated caption |
---|---|---|---|
cheeseburger | plate | hotdog | a plate with a hamburger fries and tomatoes |
bakery | dining table | website | a table having tea and a cake on it |
gown | groom | apron | its time to cut the cake at this couples wedding |
- Dowload Raw data with ID and Visual context -> original dataset with related ID caption train2014
- Downlod Data with cosine score-> soft cosine lable with th 0.2, 0.3, 0.4 and 0.5 and hard-label
- Dowload Overlaping visual with caption-> Overlap visual context and the human annotated caption
- Download Dataset (tsv file) 0.0-> raw data with hard lable without cosine similairty and with threshold cosine sim degree of the relation beteween the visual and caption = 0.2, 0.3, 0.4
- Download Dataset GenderBias-> man/woman replaced with person class label
Fine-tune BERT on the created dataset.
- Tensorflow 1.15.0
- Python 3.6
conda create -n BERT_visual python=3.6 anaconda
conda activate BERT_visual
pip install tensorflow==1.15.0
pip install --upgrade tensorflow_hub==0.7.0
Download BERT check point uncased_L-12_H-768_A-12
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip
git clone https://github.com/gaphex/bert_experimental/
like this BERT-CNN/uncased_L-12_H-768_A-12
and BERT-CNN/bert_experimental
Download dataset
wget https://www.dropbox.com/s/dh38xibtjpohbeg/train_all.zip
unzip train_all.zip
for Training
parser.add_argument('--train', default='train.tsv', help='beam serach', type=str,required=False)
parser.add_argument('--num_bert_layer', default='12', help='truned layers', type=int,required=False)
parser.add_argument('--batch_size', default='128', help='truned layers', type=int,required=False)
parser.add_argument('--epochs', default='5', help='', type=int,required=False)
parser.add_argument('--seq_len', default='64', help='', type=int,required=False)
parser.add_argument('--CNN_kernel_size', default='3', help='', type=int,required=False)
parser.add_argument('--CNN_filters', default='32', help='', type=int,required=False)
python BERT_CNN.py --train /train_0.4.tsv --epochs 5
for inference only, download pre-trained model
wget https://www.dropbox.com/s/ip7p0wiwkwvph5k/0.4_bert-cnn.zip
unzip 0.4_bert-cnn.zip
python eval.py --testset test_demo.tsv --model 0.4_bert-cnn/frozen_graph.pb
Re-rank the most related caption to the image using the visual context information.
visual information, candidate caption (beam search)
standard poodle shopping cart footwear, a close up of shoes and a dog in a basket, 0.99774158
standard poodle shopping cart footwear, a brown teddy bear laying on top of a pair of shoes, 0.0621758029
Although this approach is proposed to take the advantage of the dataset (e.g. visual semantic model), we also investigate the use of out-of-the-box tools to estimate the relatedness score between the short text (i.e. caption) and its environmental visual context (we call it visual classifier).
For this we follow similarity to probability based approach but
we use only the cosine similarity from a pre-trained model and the top-3 averaged prob (confidence) from the object classifier as:
- Simialrity/relatedness between the caption and the object context
$\text{}sim(w,c)$
-
$\text{P}(c)$ is the classifier object confident in the image$\text{P}(w \mid \text{object})$
with Pre-trained SBERT
python model.py --vis visual-context_label.txt --vis_prob visual-context_prob.txt --c caption.txt
Please refer to this repository for more information about pre-trained visual re-ranker probability from similarity
pip install pycocoevalcap
Then run
python Evaluation/coco_eval.py --f Result_tune_BERT_0.4.json
For more evaluation (Lexical and Semantic Diversity)
The details of this repo are described in the following paper. If you find this repo useful, please kindly cite it:
@article{sabir2023visual,
title={Visual Semantic Relatedness Dataset for Image Captioning},
author={Sabir, Ahmed and Moreno-Noguer, Francesc and Padr{\'o}, Llu{\'\i}s},
journal={arXiv preprint arXiv:2301.08784},
year={2023}
}