-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This would be the first classification-based vision-and-language dataset in the datasets library. Currently, the dataset downloads everything you need beforehand. See the [paper](https://arxiv.org/abs/1904.08920) for more details. Test Plan: - Ran the full and the dummy data test locally
- Loading branch information
Showing
7 changed files
with
349 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,208 @@ | ||
--- | ||
annotations_creators: | ||
- crowdsourced | ||
language_creators: | ||
- crowdsourced | ||
languages: | ||
- en-US | ||
licenses: | ||
- cc-by-4.0 | ||
multilinguality: | ||
- monolingual | ||
pretty_name: TextVQA | ||
size_categories: | ||
- unknown | ||
source_datasets: | ||
- original | ||
task_categories: | ||
- question-answering | ||
task_ids: | ||
- visual-question-answering | ||
--- | ||
|
||
# Dataset Card for [Needs More Information] | ||
|
||
## Table of Contents | ||
- [Dataset Description](#dataset-description) | ||
- [Dataset Summary](#dataset-summary) | ||
- [Supported Tasks](#supported-tasks-and-leaderboards) | ||
- [Languages](#languages) | ||
- [Dataset Structure](#dataset-structure) | ||
- [Data Instances](#data-instances) | ||
- [Data Fields](#data-instances) | ||
- [Data Splits](#data-instances) | ||
- [Dataset Creation](#dataset-creation) | ||
- [Curation Rationale](#curation-rationale) | ||
- [Source Data](#source-data) | ||
- [Annotations](#annotations) | ||
- [Personal and Sensitive Information](#personal-and-sensitive-information) | ||
- [Considerations for Using the Data](#considerations-for-using-the-data) | ||
- [Social Impact of Dataset](#social-impact-of-dataset) | ||
- [Discussion of Biases](#discussion-of-biases) | ||
- [Other Known Limitations](#other-known-limitations) | ||
- [Additional Information](#additional-information) | ||
- [Dataset Curators](#dataset-curators) | ||
- [Licensing Information](#licensing-information) | ||
- [Citation Information](#citation-information) | ||
- [Contributions](#contributions) | ||
|
||
## Dataset Description | ||
|
||
- **Homepage:** https://textvqa.org | ||
- **Repository:** https://github.com/facebookresearch/mmf | ||
- **Paper:** https://arxiv.org/abs/1904.08920 | ||
- **Leaderboard:** https://eval.ai/web/challenges/challenge-page/874/overview | ||
- **Point of Contact:** mailto:[email protected] | ||
|
||
### Dataset Summary | ||
|
||
TextVQA requires models to read and reason about text in images to answer questions about them. | ||
Specifically, models need to incorporate a new modality of text present in the images and reason | ||
over it to answer TextVQA questions. TextVQA dataset contains 45,336 questions over 28,408 images | ||
from the OpenImages dataset. The dataset uses [VQA accuracy](https://visualqa.org/evaluation.html) metric for evaluation. | ||
|
||
### Supported Tasks and Leaderboards | ||
|
||
- `visual-question-answering`: The dataset can be used for Visual Question Answering tasks where given an image, you have to answer a question based on the image. For the TextVQA dataset specifically, the questions require reading and reasoning about the scene text in the given image. | ||
|
||
### Languages | ||
|
||
The questions in the dataset are in English. | ||
|
||
## Dataset Structure | ||
|
||
### Data Instances | ||
|
||
A typical sample mainly contains the question in `question` field, an image object in `image` field, OpenImage image id in `image_id` and lot of other useful metadata. 10 answers per questions are contained in the `answers` attribute. | ||
|
||
An example look like below: | ||
|
||
``` | ||
{'question': 'who is this copyrighted by?', | ||
'image_id': '00685bc495504d61', | ||
'image': | ||
'image_classes': ['Vehicle', 'Tower', 'Airplane', 'Aircraft'], | ||
'flickr_original_url': 'https://farm2.staticflickr.com/5067/5620759429_4ea686e643_o.jpg', | ||
'flickr_300k_url': 'https://c5.staticflickr.com/6/5067/5620759429_f43a649fb5_z.jpg', | ||
'image_width': 786, | ||
'image_height': 1024, | ||
'answers': ['simon clancy', | ||
'simon ciancy', | ||
'simon clancy', | ||
'simon clancy', | ||
'the brand is bayard', | ||
'simon clancy', | ||
'simon clancy', | ||
'simon clancy', | ||
'simon clancy', | ||
'simon clancy'], | ||
'question_tokens': ['who', 'is', 'this', 'copyrighted', 'by'], | ||
'question_id': 3, | ||
'set_name': 'train' | ||
}, | ||
``` | ||
|
||
### Data Fields | ||
|
||
{ | ||
"question_id": "INT, incremental unique ID for the question", | ||
"question": "what is ....?", | ||
"question_tokens": [ | ||
"token_1", | ||
"token_2", | ||
"...", | ||
"token_N" | ||
], | ||
"image_id": "OpenImages Image ID", | ||
"image_classes": [ | ||
"OpenImages_class_1", | ||
"OpenImages_class_2", | ||
"...", | ||
"OpenImages_class_n" | ||
], | ||
"flickr_original_url": "OpenImages original flickr url", | ||
"flickr_300k_url": "OpenImages flickr 300k thumbnail url", | ||
"image_width": "INT, Width of the image", | ||
"image_height": "INT, Height of the image", | ||
"set_name": "Dataset split question belongs to", | ||
"answers": [ | ||
"answer_1", | ||
"answer_2", | ||
"...", | ||
"answer_10" | ||
] | ||
} | ||
|
||
### Data Splits | ||
|
||
There are three splits. `train`, `validation` and `test`. `train` and `validation` share images and have their answers available. For test set predictions, you need to go to the [EvalAI leaderboard] and upload your predictions there. Please see instructions at [https://textvqa.org/challenge/](https://textvqa.org/challenge/). | ||
|
||
## Dataset Creation | ||
|
||
### Curation Rationale | ||
|
||
[Needs More Information] | ||
|
||
### Source Data | ||
|
||
#### Initial Data Collection and Normalization | ||
|
||
[Needs More Information] | ||
|
||
#### Who are the source language producers? | ||
|
||
English Crowdsource Annotators | ||
|
||
### Annotations | ||
|
||
#### Annotation process | ||
|
||
See the [paper](https://arxiv.org/abs/1904.08920). | ||
|
||
#### Who are the annotators? | ||
|
||
See the [paper](https://arxiv.org/abs/1904.08920). | ||
|
||
### Personal and Sensitive Information | ||
|
||
See the [paper](https://arxiv.org/abs/1904.08920). | ||
|
||
## Considerations for Using the Data | ||
|
||
### Social Impact of Dataset | ||
|
||
See the [paper](https://arxiv.org/abs/1904.08920). | ||
|
||
### Discussion of Biases | ||
|
||
See the [paper](https://arxiv.org/abs/1904.08920). | ||
|
||
### Other Known Limitations | ||
|
||
See the [paper](https://arxiv.org/abs/1904.08920). | ||
|
||
## Additional Information | ||
|
||
### Dataset Curators | ||
|
||
[Amanpreet Singh](https://github.com/apsdehal) | ||
|
||
### Licensing Information | ||
|
||
CC by 4.0 | ||
|
||
### Citation Information | ||
|
||
``` | ||
@inproceedings{singh2019towards, | ||
title={Towards VQA Models That Can Read}, | ||
author={Singh, Amanpreet and Natarjan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Parikh, Devi and Rohrbach, Marcus}, | ||
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, | ||
pages={8317-8326}, | ||
year={2019} | ||
} | ||
``` | ||
|
||
### Contributions | ||
|
||
Thanks to [@apsdehal](https://github.com/apsdehal) for adding this dataset. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"train": {"description": "TextVQA requires models to read and reason about text in images to answer questions about them. \nSpecifically, models need to incorporate a new modality of text present in the images and reason \nover it to answer TextVQA questions. TextVQA dataset contains 45,336 questions over 28,408 images\nfrom the OpenImages dataset. \n", "citation": "\n@inproceedings{singh2019towards,\n title={Towards VQA Models That Can Read},\n author={Singh, Amanpreet and Natarjan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Parikh, Devi and Rohrbach, Marcus},\n booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},\n pages={8317-8326},\n year={2019}\n}\n", "homepage": "https://textvqa.org", "license": "CC BY 4.0", "features": {"image_id": {"dtype": "string", "id": null, "_type": "Value"}, "question_id": {"dtype": "int32", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "question_tokens": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "image": {"decode": true, "id": null, "_type": "Image"}, "image_width": {"dtype": "int32", "id": null, "_type": "Value"}, "image_height": {"dtype": "int32", "id": null, "_type": "Value"}, "flickr_original_url": {"dtype": "string", "id": null, "_type": "Value"}, "flickr_300k_url": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "image_classes": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "set_name": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "textvqa", "config_name": "train", "version": {"version_str": "0.5.1", "description": "", "major": 0, "minor": 5, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 21381310, "num_examples": 34602, "dataset_name": "textvqa"}, "validation": {"name": "validation", "num_bytes": 3077854, "num_examples": 5000, "dataset_name": "textvqa"}, "test": {"name": "test", "num_bytes": 3025046, "num_examples": 5734, "dataset_name": "textvqa"}}, "download_checksums": {"https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_train.json": {"num_bytes": 21634937, "checksum": "95f5c407db56cba56a177799dcd685a7cc0ec7c0d851b59910acf7786d31b68a"}, "https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json": {"num_bytes": 3116162, "checksum": "4ceb5aadc1a41719d0a3e4dfdf06838bcfee1db569a9a65ee67d31c99893081d"}, "https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_test.json": {"num_bytes": 2770520, "checksum": "d8d4b738101087bac5a6182d22d9aef3772e08e77827e6cf6116808910b75db2"}, "https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip": {"num_bytes": 7072297970, "checksum": "ecf35005640d0708eae185aab1c0a10f89b2db7420b29185a1ed92a8f4290498"}, "https://dl.fbaipublicfiles.com/textvqa/images/test_images.zip": {"num_bytes": 970296721, "checksum": "1276b908994c444c46484fb21e9e15fcda1be9c675f6ad727489e52eea68cbcd"}}, "download_size": 8070116310, "post_processing_size": null, "dataset_size": 27484210, "size_in_bytes": 8097600520}, "val": {"description": "TextVQA requires models to read and reason about text in images to answer questions about them. \nSpecifically, models need to incorporate a new modality of text present in the images and reason \nover it to answer TextVQA questions. TextVQA dataset contains 45,336 questions over 28,408 images\nfrom the OpenImages dataset. \n", "citation": "\n@inproceedings{singh2019towards,\n title={Towards VQA Models That Can Read},\n author={Singh, Amanpreet and Natarjan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Parikh, Devi and Rohrbach, Marcus},\n booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},\n pages={8317-8326},\n year={2019}\n}\n", "homepage": "https://textvqa.org", "license": "CC BY 4.0", "features": {"image_id": {"dtype": "string", "id": null, "_type": "Value"}, "question_id": {"dtype": "int32", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "question_tokens": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "image": {"decode": true, "id": null, "_type": "Image"}, "image_width": {"dtype": "int32", "id": null, "_type": "Value"}, "image_height": {"dtype": "int32", "id": null, "_type": "Value"}, "flickr_original_url": {"dtype": "string", "id": null, "_type": "Value"}, "flickr_300k_url": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "image_classes": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "set_name": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "textvqa", "config_name": "val", "version": {"version_str": "0.5.1", "description": "", "major": 0, "minor": 5, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 21381310, "num_examples": 34602, "dataset_name": "textvqa"}, "validation": {"name": "validation", "num_bytes": 3077854, "num_examples": 5000, "dataset_name": "textvqa"}, "test": {"name": "test", "num_bytes": 3025046, "num_examples": 5734, "dataset_name": "textvqa"}}, "download_checksums": {"https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_train.json": {"num_bytes": 21634937, "checksum": "95f5c407db56cba56a177799dcd685a7cc0ec7c0d851b59910acf7786d31b68a"}, "https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json": {"num_bytes": 3116162, "checksum": "4ceb5aadc1a41719d0a3e4dfdf06838bcfee1db569a9a65ee67d31c99893081d"}, "https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_test.json": {"num_bytes": 2770520, "checksum": "d8d4b738101087bac5a6182d22d9aef3772e08e77827e6cf6116808910b75db2"}, "https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip": {"num_bytes": 7072297970, "checksum": "ecf35005640d0708eae185aab1c0a10f89b2db7420b29185a1ed92a8f4290498"}, "https://dl.fbaipublicfiles.com/textvqa/images/test_images.zip": {"num_bytes": 970296721, "checksum": "1276b908994c444c46484fb21e9e15fcda1be9c675f6ad727489e52eea68cbcd"}}, "download_size": 8070116310, "post_processing_size": null, "dataset_size": 27484210, "size_in_bytes": 8097600520}, "test": {"description": "TextVQA requires models to read and reason about text in images to answer questions about them. \nSpecifically, models need to incorporate a new modality of text present in the images and reason \nover it to answer TextVQA questions. TextVQA dataset contains 45,336 questions over 28,408 images\nfrom the OpenImages dataset. \n", "citation": "\n@inproceedings{singh2019towards,\n title={Towards VQA Models That Can Read},\n author={Singh, Amanpreet and Natarjan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Parikh, Devi and Rohrbach, Marcus},\n booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},\n pages={8317-8326},\n year={2019}\n}\n", "homepage": "https://textvqa.org", "license": "CC BY 4.0", "features": {"image_id": {"dtype": "string", "id": null, "_type": "Value"}, "question_id": {"dtype": "int32", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "question_tokens": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "image": {"decode": true, "id": null, "_type": "Image"}, "image_width": {"dtype": "int32", "id": null, "_type": "Value"}, "image_height": {"dtype": "int32", "id": null, "_type": "Value"}, "flickr_original_url": {"dtype": "string", "id": null, "_type": "Value"}, "flickr_300k_url": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "image_classes": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "set_name": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "textvqa", "config_name": "test", "version": {"version_str": "0.5.1", "description": "", "major": 0, "minor": 5, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 21381310, "num_examples": 34602, "dataset_name": "textvqa"}, "validation": {"name": "validation", "num_bytes": 3077854, "num_examples": 5000, "dataset_name": "textvqa"}, "test": {"name": "test", "num_bytes": 3025046, "num_examples": 5734, "dataset_name": "textvqa"}}, "download_checksums": {"https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_train.json": {"num_bytes": 21634937, "checksum": "95f5c407db56cba56a177799dcd685a7cc0ec7c0d851b59910acf7786d31b68a"}, "https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json": {"num_bytes": 3116162, "checksum": "4ceb5aadc1a41719d0a3e4dfdf06838bcfee1db569a9a65ee67d31c99893081d"}, "https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_test.json": {"num_bytes": 2770520, "checksum": "d8d4b738101087bac5a6182d22d9aef3772e08e77827e6cf6116808910b75db2"}, "https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip": {"num_bytes": 7072297970, "checksum": "ecf35005640d0708eae185aab1c0a10f89b2db7420b29185a1ed92a8f4290498"}, "https://dl.fbaipublicfiles.com/textvqa/images/test_images.zip": {"num_bytes": 970296721, "checksum": "1276b908994c444c46484fb21e9e15fcda1be9c675f6ad727489e52eea68cbcd"}}, "download_size": 8070116310, "post_processing_size": null, "dataset_size": 27484210, "size_in_bytes": 8097600520}} |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Oops, something went wrong.