CIC-BART-SSA: Controlled Image Captioning with Structured Semantic Augmentation

Figure 1. Existing captioning datasets contain captions that describe the entirety of an image. This is reflected in the narrow distributions of both the entities that appear in those captions and the corresponding caption lengths (the red-colored histograms). CIC aims to generate diverse descriptions by controllably re-focusing on different spatiosemantic aspects of an image, such as the semantically coherent subsets of image objects. Our proposed CIC-BART-SSA is designed to produce diverse, controlled captions that can range from brief and concise to detailed and comprehensive. Sentences 1-15 are example outputs of our approach where the highlighted text indicates the focus of a controllable caption. The histograms demonstrate that our approach generates high-quality descriptions for a wider range of scene focus (number of visual entities) and caption length compared to the original captions. Image is licensed under a Creative Commons CC BY-SA 2.0.

Structured Semantic Augmentation (SSA) Dataset Generation

Figure 2. An example of our Structured Semantic Augmentation (SSA) approach. Visually-grounded captions (1)-(5) are used to create a meta-vgAMR graph, which includes all available image information in one representation. Sub-graphs of meta-vgAMR are then sampled to generate a new and diverse set of captions, such as the sentences (a)-(e). Our approach takes advantage of both linguistic and spatial diversity, with the latter creating descriptions for new combinations of visual entities. For instance, caption (a) focuses only on the 'boat', and captions (c) and (d) focus on the 'dock' and 'house', combinations that are not explored in the original captions. Image is licensed under a Creative Commons CC BY-SA 2.0

1. Caption to AMR conversion

Convert captions to their AMR representations. The AMRs should include the alignment information of an AMR node with the corresponding word of the original phrase. In our current version we used the pretrained Text-to-AMR parser with alignment: https://github.com/IBM/transition-amr-parser. Example datasets with entity visual grounding information are 1) MSCOCO Entities and 2) Flickr-30k Entities.

2. Event-focused vgAMRs Generation

Visually ground the caption AMR nodes to the corresponding image entities, to derive the vgAMRs of each image--caption pair of the dataset. Merge to a single meta-vgAMR all individual caption vgAMRs of each image. Sample event-focused vgAMRs from the global meta-vgAMR structure. For MSCOCO Entities Dataset:

cd ssa-coco
python ssa.py

and for Flickr-30K Entities:

cd ssa-flickr
python ssa.py

3. SSA dataset for downstream tasks

Using the event-focused sampled vgAMRs and an AMR-to-Text parser, convert the graph representations to event-focused captions. In our pipeline we used SPRING AMR-to-Text model which we trained from scratch on a dataset composed of AMR2.0, plus the training MSCOCO captions paired with their (automatically-generated) AMRs.

For each generated sentence we compute their GRUEN score. We filter out the sentences with poor text quality, having a GRUEN score less than a pre-defined threshold (t=0.7).

Finally, we save the SSA event-focused captions along with their visual grounding information so we can use them for downstream applications. We follow a similar to ASG json format:

JSON Format:
{
	"region_id": {
		"objects":[
			{
	     		"object_id": int, 
	     		"name": str, 
	     		"attributes": [str],
				"xmin": int,
				"ymin": int, 
				"xmax": int, 
				"ymax": int
			}],
  	  "relationships": [
			{
				"relationship_id": int,
				"subject_id": int,
				"object_id": int,
				"name": str
			}],
  	  "phrase": str,
  }
}

To generate the SSA augmentations for MSCOCO Entities use:

cd ssa-coco
python dssa-generation.py

and for Flickr-30k Entities:

cd ssa-flickr
python dssa-generation.py

CIC-BART-SSA model for Controllable Image Captioning

1. Set up

1.1 Stanford Tagger

For Stanford Tagger use the installation instructions from here. Track the location of the downloaded english-bidirectional-distsim.tagger and stanford-postagger.jar files, and use them for the --spos_model and --spos_jar CIC-BART-SSA parameters.

1.2 Precomputed image feature vectors

For preparing the precomputed image feature vectors first set-up Detectron2 (README file in cic-bart-ssa\feature_extraction) and then run for:

MSCOCO Entities dataset images:

cd cic-bart-ssa\feature_extraction
python coco_gtboxes_cic-bart-ssa.py --cocoroot <MSCOCO Entities images folder> --coco_entities <MSCOCO Entities dataset>

Flickr-30k Entities dataset images:

cd cic-bart-ssa\feature_extraction
python flickr30k_gtbboxes-cic-bart-ssa --flickrroot <Flickr-30k Entities images folder> --entities_annotations <Flickr-30k Entities dataset 'Annotations' folder> --entities_sentences <Flickr-30k Entities dataset 'Sentences' folder>

Use the precomputed features h5 files for the --precomp_features parameter of CIC-BART-SSA.

1.3 GloVe vectors

You can download GloVe vectors from here. For --glove_vectors use the path of glove.6B.300d.txt file.

1.4 Visual Genome vocabulary files

Download the Visual Genome objects and attributes vocabulary from here. Add their location to --vg_objects_name, --vg_attrs_name.

2. Training

For training our CIC model you can use the scripts:

For MSCOCO Entities and their SSA augmentations

cic-bart-ssa\scripts\cic-bart-coco.sh

For Flickr-30k Entities and their SSA augmentations

cic-bart-ssa\scripts\cic-bart-flickr.sh

Acknowledgements

Parts of our codebase are taken or adapted from the following repositories:

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
SSA-FLICKR		SSA-FLICKR
cic-bart-ssa		cic-bart-ssa
dataset		dataset
figures		figures
ssa-coco		ssa-coco
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CIC-BART-SSA: Controlled Image Captioning with Structured Semantic Augmentation

Structured Semantic Augmentation (SSA) Dataset Generation

1. Caption to AMR conversion

2. Event-focused vgAMRs Generation

3. SSA dataset for downstream tasks

CIC-BART-SSA model for Controllable Image Captioning

1. Set up

1.1 Stanford Tagger

1.2 Precomputed image feature vectors

1.3 GloVe vectors

1.4 Visual Genome vocabulary files

2. Training

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

License

SamsungLabs/CIC-BART-SSA

Folders and files

Latest commit

History

Repository files navigation

CIC-BART-SSA: Controlled Image Captioning with Structured Semantic Augmentation

Structured Semantic Augmentation (SSA) Dataset Generation

1. Caption to AMR conversion

2. Event-focused vgAMRs Generation

3. SSA dataset for downstream tasks

CIC-BART-SSA model for Controllable Image Captioning

1. Set up

1.1 Stanford Tagger

1.2 Precomputed image feature vectors

1.3 GloVe vectors

1.4 Visual Genome vocabulary files

2. Training

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages