Skip to content

Commit

Permalink
[feat] Add TextVQA dataset
Browse files Browse the repository at this point in the history
This would be the first classification-based vision-and-language dataset in the datasets library.

Currently, the dataset downloads everything you need beforehand. See the [paper](https://arxiv.org/abs/1904.08920) for more details.

Test Plan:
- Ran the full and the dummy data test locally
  • Loading branch information
apsdehal committed Mar 19, 2022
1 parent 64a7757 commit 518ae6d
Show file tree
Hide file tree
Showing 7 changed files with 349 additions and 1 deletion.
208 changes: 208 additions & 0 deletions datasets/textvqa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
---
annotations_creators:
- crowdsourced
language_creators:
- crowdsourced
languages:
- en-US
licenses:
- cc-by-4.0
multilinguality:
- monolingual
pretty_name: TextVQA
size_categories:
- unknown
source_datasets:
- original
task_categories:
- question-answering
task_ids:
- visual-question-answering
---

# Dataset Card for [Needs More Information]

## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-instances)
- [Data Splits](#data-instances)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)

## Dataset Description

- **Homepage:** https://textvqa.org
- **Repository:** https://github.com/facebookresearch/mmf
- **Paper:** https://arxiv.org/abs/1904.08920
- **Leaderboard:** https://eval.ai/web/challenges/challenge-page/874/overview
- **Point of Contact:** mailto:[email protected]

### Dataset Summary

TextVQA requires models to read and reason about text in images to answer questions about them.
Specifically, models need to incorporate a new modality of text present in the images and reason
over it to answer TextVQA questions. TextVQA dataset contains 45,336 questions over 28,408 images
from the OpenImages dataset. The dataset uses [VQA accuracy](https://visualqa.org/evaluation.html) metric for evaluation.

### Supported Tasks and Leaderboards

- `visual-question-answering`: The dataset can be used for Visual Question Answering tasks where given an image, you have to answer a question based on the image. For the TextVQA dataset specifically, the questions require reading and reasoning about the scene text in the given image.

### Languages

The questions in the dataset are in English.

## Dataset Structure

### Data Instances

A typical sample mainly contains the question in `question` field, an image object in `image` field, OpenImage image id in `image_id` and lot of other useful metadata. 10 answers per questions are contained in the `answers` attribute.

An example look like below:

```
{'question': 'who is this copyrighted by?',
'image_id': '00685bc495504d61',
'image':
'image_classes': ['Vehicle', 'Tower', 'Airplane', 'Aircraft'],
'flickr_original_url': 'https://farm2.staticflickr.com/5067/5620759429_4ea686e643_o.jpg',
'flickr_300k_url': 'https://c5.staticflickr.com/6/5067/5620759429_f43a649fb5_z.jpg',
'image_width': 786,
'image_height': 1024,
'answers': ['simon clancy',
'simon ciancy',
'simon clancy',
'simon clancy',
'the brand is bayard',
'simon clancy',
'simon clancy',
'simon clancy',
'simon clancy',
'simon clancy'],
'question_tokens': ['who', 'is', 'this', 'copyrighted', 'by'],
'question_id': 3,
'set_name': 'train'
},
```

### Data Fields

{
"question_id": "INT, incremental unique ID for the question",
"question": "what is ....?",
"question_tokens": [
"token_1",
"token_2",
"...",
"token_N"
],
"image_id": "OpenImages Image ID",
"image_classes": [
"OpenImages_class_1",
"OpenImages_class_2",
"...",
"OpenImages_class_n"
],
"flickr_original_url": "OpenImages original flickr url",
"flickr_300k_url": "OpenImages flickr 300k thumbnail url",
"image_width": "INT, Width of the image",
"image_height": "INT, Height of the image",
"set_name": "Dataset split question belongs to",
"answers": [
"answer_1",
"answer_2",
"...",
"answer_10"
]
}

### Data Splits

There are three splits. `train`, `validation` and `test`. `train` and `validation` share images and have their answers available. For test set predictions, you need to go to the [EvalAI leaderboard] and upload your predictions there. Please see instructions at [https://textvqa.org/challenge/](https://textvqa.org/challenge/).

## Dataset Creation

### Curation Rationale

[Needs More Information]

### Source Data

#### Initial Data Collection and Normalization

[Needs More Information]

#### Who are the source language producers?

English Crowdsource Annotators

### Annotations

#### Annotation process

See the [paper](https://arxiv.org/abs/1904.08920).

#### Who are the annotators?

See the [paper](https://arxiv.org/abs/1904.08920).

### Personal and Sensitive Information

See the [paper](https://arxiv.org/abs/1904.08920).

## Considerations for Using the Data

### Social Impact of Dataset

See the [paper](https://arxiv.org/abs/1904.08920).

### Discussion of Biases

See the [paper](https://arxiv.org/abs/1904.08920).

### Other Known Limitations

See the [paper](https://arxiv.org/abs/1904.08920).

## Additional Information

### Dataset Curators

[Amanpreet Singh](https://github.com/apsdehal)

### Licensing Information

CC by 4.0

### Citation Information

```
@inproceedings{singh2019towards,
title={Towards VQA Models That Can Read},
author={Singh, Amanpreet and Natarjan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Parikh, Devi and Rohrbach, Marcus},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={8317-8326},
year={2019}
}
```

### Contributions

Thanks to [@apsdehal](https://github.com/apsdehal) for adding this dataset.
1 change: 1 addition & 0 deletions datasets/textvqa/dataset_infos.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"train": {"description": "TextVQA requires models to read and reason about text in images to answer questions about them. \nSpecifically, models need to incorporate a new modality of text present in the images and reason \nover it to answer TextVQA questions. TextVQA dataset contains 45,336 questions over 28,408 images\nfrom the OpenImages dataset. \n", "citation": "\n@inproceedings{singh2019towards,\n title={Towards VQA Models That Can Read},\n author={Singh, Amanpreet and Natarjan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Parikh, Devi and Rohrbach, Marcus},\n booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},\n pages={8317-8326},\n year={2019}\n}\n", "homepage": "https://textvqa.org", "license": "CC BY 4.0", "features": {"image_id": {"dtype": "string", "id": null, "_type": "Value"}, "question_id": {"dtype": "int32", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "question_tokens": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "image": {"decode": true, "id": null, "_type": "Image"}, "image_width": {"dtype": "int32", "id": null, "_type": "Value"}, "image_height": {"dtype": "int32", "id": null, "_type": "Value"}, "flickr_original_url": {"dtype": "string", "id": null, "_type": "Value"}, "flickr_300k_url": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "image_classes": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "set_name": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "textvqa", "config_name": "train", "version": {"version_str": "0.5.1", "description": "", "major": 0, "minor": 5, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 21381310, "num_examples": 34602, "dataset_name": "textvqa"}, "validation": {"name": "validation", "num_bytes": 3077854, "num_examples": 5000, "dataset_name": "textvqa"}, "test": {"name": "test", "num_bytes": 3025046, "num_examples": 5734, "dataset_name": "textvqa"}}, "download_checksums": {"https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_train.json": {"num_bytes": 21634937, "checksum": "95f5c407db56cba56a177799dcd685a7cc0ec7c0d851b59910acf7786d31b68a"}, "https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json": {"num_bytes": 3116162, "checksum": "4ceb5aadc1a41719d0a3e4dfdf06838bcfee1db569a9a65ee67d31c99893081d"}, "https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_test.json": {"num_bytes": 2770520, "checksum": "d8d4b738101087bac5a6182d22d9aef3772e08e77827e6cf6116808910b75db2"}, "https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip": {"num_bytes": 7072297970, "checksum": "ecf35005640d0708eae185aab1c0a10f89b2db7420b29185a1ed92a8f4290498"}, "https://dl.fbaipublicfiles.com/textvqa/images/test_images.zip": {"num_bytes": 970296721, "checksum": "1276b908994c444c46484fb21e9e15fcda1be9c675f6ad727489e52eea68cbcd"}}, "download_size": 8070116310, "post_processing_size": null, "dataset_size": 27484210, "size_in_bytes": 8097600520}, "val": {"description": "TextVQA requires models to read and reason about text in images to answer questions about them. \nSpecifically, models need to incorporate a new modality of text present in the images and reason \nover it to answer TextVQA questions. TextVQA dataset contains 45,336 questions over 28,408 images\nfrom the OpenImages dataset. \n", "citation": "\n@inproceedings{singh2019towards,\n title={Towards VQA Models That Can Read},\n author={Singh, Amanpreet and Natarjan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Parikh, Devi and Rohrbach, Marcus},\n booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},\n pages={8317-8326},\n year={2019}\n}\n", "homepage": "https://textvqa.org", "license": "CC BY 4.0", "features": {"image_id": {"dtype": "string", "id": null, "_type": "Value"}, "question_id": {"dtype": "int32", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "question_tokens": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "image": {"decode": true, "id": null, "_type": "Image"}, "image_width": {"dtype": "int32", "id": null, "_type": "Value"}, "image_height": {"dtype": "int32", "id": null, "_type": "Value"}, "flickr_original_url": {"dtype": "string", "id": null, "_type": "Value"}, "flickr_300k_url": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "image_classes": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "set_name": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "textvqa", "config_name": "val", "version": {"version_str": "0.5.1", "description": "", "major": 0, "minor": 5, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 21381310, "num_examples": 34602, "dataset_name": "textvqa"}, "validation": {"name": "validation", "num_bytes": 3077854, "num_examples": 5000, "dataset_name": "textvqa"}, "test": {"name": "test", "num_bytes": 3025046, "num_examples": 5734, "dataset_name": "textvqa"}}, "download_checksums": {"https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_train.json": {"num_bytes": 21634937, "checksum": "95f5c407db56cba56a177799dcd685a7cc0ec7c0d851b59910acf7786d31b68a"}, "https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json": {"num_bytes": 3116162, "checksum": "4ceb5aadc1a41719d0a3e4dfdf06838bcfee1db569a9a65ee67d31c99893081d"}, "https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_test.json": {"num_bytes": 2770520, "checksum": "d8d4b738101087bac5a6182d22d9aef3772e08e77827e6cf6116808910b75db2"}, "https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip": {"num_bytes": 7072297970, "checksum": "ecf35005640d0708eae185aab1c0a10f89b2db7420b29185a1ed92a8f4290498"}, "https://dl.fbaipublicfiles.com/textvqa/images/test_images.zip": {"num_bytes": 970296721, "checksum": "1276b908994c444c46484fb21e9e15fcda1be9c675f6ad727489e52eea68cbcd"}}, "download_size": 8070116310, "post_processing_size": null, "dataset_size": 27484210, "size_in_bytes": 8097600520}, "test": {"description": "TextVQA requires models to read and reason about text in images to answer questions about them. \nSpecifically, models need to incorporate a new modality of text present in the images and reason \nover it to answer TextVQA questions. TextVQA dataset contains 45,336 questions over 28,408 images\nfrom the OpenImages dataset. \n", "citation": "\n@inproceedings{singh2019towards,\n title={Towards VQA Models That Can Read},\n author={Singh, Amanpreet and Natarjan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Parikh, Devi and Rohrbach, Marcus},\n booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},\n pages={8317-8326},\n year={2019}\n}\n", "homepage": "https://textvqa.org", "license": "CC BY 4.0", "features": {"image_id": {"dtype": "string", "id": null, "_type": "Value"}, "question_id": {"dtype": "int32", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "question_tokens": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "image": {"decode": true, "id": null, "_type": "Image"}, "image_width": {"dtype": "int32", "id": null, "_type": "Value"}, "image_height": {"dtype": "int32", "id": null, "_type": "Value"}, "flickr_original_url": {"dtype": "string", "id": null, "_type": "Value"}, "flickr_300k_url": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "image_classes": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "set_name": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "textvqa", "config_name": "test", "version": {"version_str": "0.5.1", "description": "", "major": 0, "minor": 5, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 21381310, "num_examples": 34602, "dataset_name": "textvqa"}, "validation": {"name": "validation", "num_bytes": 3077854, "num_examples": 5000, "dataset_name": "textvqa"}, "test": {"name": "test", "num_bytes": 3025046, "num_examples": 5734, "dataset_name": "textvqa"}}, "download_checksums": {"https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_train.json": {"num_bytes": 21634937, "checksum": "95f5c407db56cba56a177799dcd685a7cc0ec7c0d851b59910acf7786d31b68a"}, "https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json": {"num_bytes": 3116162, "checksum": "4ceb5aadc1a41719d0a3e4dfdf06838bcfee1db569a9a65ee67d31c99893081d"}, "https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_test.json": {"num_bytes": 2770520, "checksum": "d8d4b738101087bac5a6182d22d9aef3772e08e77827e6cf6116808910b75db2"}, "https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip": {"num_bytes": 7072297970, "checksum": "ecf35005640d0708eae185aab1c0a10f89b2db7420b29185a1ed92a8f4290498"}, "https://dl.fbaipublicfiles.com/textvqa/images/test_images.zip": {"num_bytes": 970296721, "checksum": "1276b908994c444c46484fb21e9e15fcda1be9c675f6ad727489e52eea68cbcd"}}, "download_size": 8070116310, "post_processing_size": null, "dataset_size": 27484210, "size_in_bytes": 8097600520}}
Binary file added datasets/textvqa/dummy/test/0.5.1/dummy_data.zip
Binary file not shown.
Binary file added datasets/textvqa/dummy/train/0.5.1/dummy_data.zip
Binary file not shown.
Binary file added datasets/textvqa/dummy/val/0.5.1/dummy_data.zip
Binary file not shown.
Loading

0 comments on commit 518ae6d

Please sign in to comment.