This project introduces our work "Diverse Image Annotation" published at CVPR 2017.
@inproceedings{wu-dia,
title={Diverse Image Annotation},
author={Wu, Baoyuan and Jia, Fan and Liu, Wei and Ghanem, Bernard},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
year={2017}
}
The goal of diverse image annotation (DIA) is to cover as much useful information of an image as possible using a limited number of tags. It requires the tags to be not only representative to the image, but also diverse from each other to reduce redundancy. DIA can be seen as a redefinition of the automatic image annotation (AIA), to encourage the results of AIA to be more close to human annotations.
In current automatic image annotation (AIA), the general target is to predict most relevant tags of the image. However, we find that there are obvious differences between the results of AIA and the ones of human annotations, at both the number of predicted tags and the quality of the predicted tag subset.
Let's start with an observation of how human annotates images. We conduct a subject study by asking 3 human annotators to annotate 500 test images of IAPRTC-12 independently, with the instruction "cover the main contents of one image using a few tags". Based on the human annotation results, we have two observations:
1. In each annotation, it is rare to give redundant tags together (see Fig 1).
2. The number of annotated tags of all annotations lay in a small range, with the average around 4 (see Fig 2).
Fig 1. Three independent human annotations of one image Fig 2. Statistics of tag numbers of three human annotators
Observation 1 demonstrates that the semantic reducdancy between tags is considered in human annotations, not only just the relevance of the tag to the image. As a result, the tag subset is semantically compact due to the reduction of redundancy, leading to observation 2.
In constrast, from many benchmark datasets for AIA (such as ESP Game and IAPRTC-12), we find that in their ground-truth tag subsets: there are many redundant tags in the same tag subset (e.g., person and people); the averaged number of tags are larger than the ones of human annotations (see Fig 2). Consequently, the current AIA models that only consider the relevance and is trained on such benchmark datasets will predict the redundant tag subset with larger-size. This demonstrates that both the collection of the ground-truth tag subsets in benchmark datasets and the learning target in current AIA only consider the relevance, while the redundancy between tags are ignored.
Now it is clear that the gap between AIA and human annotation is due to the tag redundancy. Another word to describe the avoiding of tag redundancy is diversity, which encourages different tags to cover different contents. Thus, we can summarize that human annotators considers both relevance and diversity, while AIA only considers the relevance. This is why we propose DIA to redefine the current AIA.
Determinantal point process (DPP) is a probabilistic model to formulate the diversity of a subset. Treating the image annotation as a subset selection problem and the image as the condition, the conditional DPP is adopted to formulate DIA, as follows: $$ \mathcal{P}_{\mathbf{W}}(\mathcal{Y}|\mathbf{x}) = \frac{ \text{det}(\mathbf{L}_{\mathcal{Y}}(\mathbf{x}; \mathbf{W}))}{\text{det}(\mathbf{L}(\mathbf{x}; \mathbf{W}) + \mathbf{I})} $$
where
Fig 3. (Left) The kernel matrix
We modify the standard k-DPP sampling process (see 'k-DPPs: Fixed-Size Determinantal Point Processes') by inserting the weighted semantic paths. Specifically, during the sampling process, we check that whether the newly sampled tag is from the same path with any existing sampled tag. If is, then discard it and do the next sampling until convergence; otherwise, adding it into the sampled tag subset. Thus, it can be avoided that two tags from the same path are sampled together.
Fig 4. An brief example of semantic hiearchy and semantic paths.We propose three semantic metrics by modifying the traditional precision, recall and F1 according to the
weighted semantic paths. Instead of calculating the scores on independent tags in traditional metrics, we
compute the scores on the weighted semantic paths. The following is an brief example. You can find more
detailed definitions from our paper and the implementation from the folder code
.