CLEVR-XAI aims to provide a benchmark dataset for the quantitative evaluation of XAI explanations (aka heatmaps) in computer vision.
It is made of visual question answering (VQA) questions, which are derived from the original CLEVR task, and where each question is accompanied by several ground truth (GT) masks that can be used as a realistic, selective and controlled testbed for the evaluation of heatmaps on the input image.
CLEVR-XAI was introduced in an Information Fusion paper. Furthermore in this paper several XAI methods were tested against the CLEVR-XAI benchmark, in particular Layer-wise Relevance Propagation (LRP), Integrated Gradients, Guided Backprop, Guided Grad-CAM, SmoothGrad, VarGrad, Gradient, Gradient×Input, Deconvnet and Grad-CAM.
The CLEVR-XAI dataset consists of 39,761 simple questions (CLEVR-XAI-simple) and 100,000 complex questions (CLEVR-XAI-complex), which are based on the same underlying set of 10,000 images (i.e., there are approx. 4 simple questions and 10 complex questions per image).
CLEVR-XAI-simple contains the following Ground Truths:
- GT Single Object (for all questions)
- GT All Objects (for all questions)
CLEVR-XAI-complex contains the following Ground Truths:
- GT Unique (for 89,873 questions)
- GT Unique First-non-empty (for 99,786 questions)
- GT Union (for 99,786 questions)
- GT All Objects (for all questions)
Note: For some complex questions a few GT masks are unavailable, since for these questions the masks are undefined/empty.
CLEVR-XAI-simple | Image | GT Single Object | GT All Objects |
---|---|---|---|
What is the small yellow sphere made of? metal |
|||
LRP | Integrated Gradients | Guided Backprop | Grad-CAM |
CLEVR-XAI-complex | Image | GT Unique | GT Unique First-non-empty |
---|---|---|---|
Is there any other thing that has the same size as the shiny sphere? yes |
|||
LRP | Integrated Gradients | Guided Backprop | Grad-CAM |
For more details on the definition of each GT please refer to the paper. More broadly, note that simple questions always contain one target object for the VQA question, and complex questions can have several objects involved in the VQA question.
The dataset can be downloaded from the releases section of this repository.
For the sake of completeness and to promote future research, we additionally provide the code to generate the CLEVR-XAI dataset. Note that if you are only interested in using the released version of our dataset you don't need to re-generate the dataset yourself and can directly download it here, thus you can skip the following dataset generation steps.
Our code to generate CLEVR-XAI is built upon the original CLEVR generator.
To limit the amount of prerequisites, all our generation steps run inside containers with Singularity. So Singularity is the only requirement to run the code. Here is a Singularity quick start guide.
Please refer to the README in the image_generation
folder.
Please refer to the README in the question_generation
folder.
Please refer to the README in the eval
folder.
This last step also includes the resizing of the masks, which can be useful in case your model takes input images of a different size than the CLEVR images (the original CLEVR images have size 320x480).
In our released version of the CLEVR-XAI benchmark dataset, the masks were resized to the size 128x128 (since the Relation Network model we use for the evaluation of XAI methods takes input images of size 128x128), see our paper Appendix D for more details on this step.
The code to generate heatmaps on a Relation Network model which was trained on the original CLEVR dataset, and which was used to evaluate different XAI methods w.r.t. our CLEVR-XAI benchmark dataset as done in the paper, will be made publicly available (admittedly with some delay but it will be released).
The code to evaluate heatmaps is currently available as a stand-alone gist.
(In the future we may automatize this step and integrate it in the eval
folder of this repository for more convenience.)
If you find our dataset or code useful, please cite our paper:
@article{Arras_etal:2022,
title = {{CLEVR-XAI: A benchmark dataset for the ground truth evaluation of neural network explanations}},
author = {Leila Arras and Ahmed Osman and Wojciech Samek},
journal = {Information Fusion},
volume = {81},
pages = {14-40},
year = {2022},
url = {https://doi.org/10.1016/j.inffus.2021.11.008}
}