[feat] Add TextVQA dataset

This would be the first classification-based vision-and-language dataset in the datasets library. Currently, the dataset downloads everything you need beforehand. See the [paper](https://arxiv.org/abs/1904.08920) for more details. Test Plan: - Ran the full and the dummy data test locally
huggingface · Mar 19, 2022 · 518ae6d · 518ae6d
1 parent 64a7757
commit 518ae6d
Show file tree

Hide file tree

Showing 7 changed files with 349 additions and 1 deletion.
diff --git a/datasets/textvqa/README.md b/datasets/textvqa/README.md
@@ -0,0 +1,208 @@
+---
+annotations_creators:
+- crowdsourced
+language_creators:
+- crowdsourced
+languages:
+- en-US
+licenses:
+- cc-by-4.0
+multilinguality:
+- monolingual
+pretty_name: TextVQA
+size_categories:
+- unknown
+source_datasets:
+- original
+task_categories:
+- question-answering
+task_ids:
+- visual-question-answering
+---
+
+# Dataset Card for [Needs More Information]
+
+## Table of Contents
+- [Dataset Description](#dataset-description)
+  - [Dataset Summary](#dataset-summary)
+  - [Supported Tasks](#supported-tasks-and-leaderboards)
+  - [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Instances](#data-instances)
+  - [Data Fields](#data-instances)
+  - [Data Splits](#data-instances)
+- [Dataset Creation](#dataset-creation)
+  - [Curation Rationale](#curation-rationale)
+  - [Source Data](#source-data)
+  - [Annotations](#annotations)
+  - [Personal and Sensitive Information](#personal-and-sensitive-information)
+- [Considerations for Using the Data](#considerations-for-using-the-data)
+  - [Social Impact of Dataset](#social-impact-of-dataset)
+  - [Discussion of Biases](#discussion-of-biases)
+  - [Other Known Limitations](#other-known-limitations)
+- [Additional Information](#additional-information)
+  - [Dataset Curators](#dataset-curators)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)
+  - [Contributions](#contributions)
+
+## Dataset Description
+
+- **Homepage:** https://textvqa.org
+- **Repository:** https://github.com/facebookresearch/mmf
+- **Paper:** https://arxiv.org/abs/1904.08920
+- **Leaderboard:** https://eval.ai/web/challenges/challenge-page/874/overview
+- **Point of Contact:** mailto:[email protected]
+
+### Dataset Summary
+
+TextVQA requires models to read and reason about text in images to answer questions about them.
+Specifically, models need to incorporate a new modality of text present in the images and reason
+over it to answer TextVQA questions. TextVQA dataset contains 45,336 questions over 28,408 images
+from the OpenImages dataset. The dataset uses [VQA accuracy](https://visualqa.org/evaluation.html) metric for evaluation.
+
+### Supported Tasks and Leaderboards
+
+- `visual-question-answering`: The dataset can be used for Visual Question Answering tasks where given an image, you have to answer a question based on the image. For the TextVQA dataset specifically, the questions require reading and reasoning about the scene text in the given image.
+
+### Languages
+
+The questions in the dataset are in English.
+
+## Dataset Structure
+
+### Data Instances
+
+A typical sample mainly contains the question in `question` field, an image object in `image` field, OpenImage image id in `image_id` and lot of other useful metadata. 10 answers per questions are contained in the `answers` attribute.
+
+An example look like below: 
+
+```
+  {'question': 'who is this copyrighted by?',
+   'image_id': '00685bc495504d61',
+   'image': 
+   'image_classes': ['Vehicle', 'Tower', 'Airplane', 'Aircraft'],
+   'flickr_original_url': 'https://farm2.staticflickr.com/5067/5620759429_4ea686e643_o.jpg',
+   'flickr_300k_url': 'https://c5.staticflickr.com/6/5067/5620759429_f43a649fb5_z.jpg',
+   'image_width': 786,
+   'image_height': 1024,
+   'answers': ['simon clancy',
+    'simon ciancy',
+    'simon clancy',
+    'simon clancy',
+    'the brand is bayard',
+    'simon clancy',
+    'simon clancy',
+    'simon clancy',
+    'simon clancy',
+    'simon clancy'],
+   'question_tokens': ['who', 'is', 'this', 'copyrighted', 'by'],
+   'question_id': 3,
+   'set_name': 'train'
+  },
+```
+
+### Data Fields
+
+{
+      "question_id": "INT, incremental unique ID for the question",
+      "question": "what is ....?",
+      "question_tokens": [
+        "token_1",
+        "token_2",
+        "...",
+        "token_N"
+      ],
+      "image_id": "OpenImages Image ID",
+      "image_classes": [
+        "OpenImages_class_1",
+        "OpenImages_class_2",
+        "...",
+        "OpenImages_class_n"
+      ],
+      "flickr_original_url": "OpenImages original flickr url",
+      "flickr_300k_url": "OpenImages flickr 300k thumbnail url",
+      "image_width": "INT, Width of the image",
+      "image_height": "INT, Height of the image",
+      "set_name": "Dataset split question belongs to",
+      "answers": [
+        "answer_1",
+        "answer_2",
+        "...",
+        "answer_10"
+      ]
+}
+
+### Data Splits
+
+There are three splits. `train`, `validation` and `test`. `train` and `validation` share images and have their answers available. For test set predictions, you need to go to the [EvalAI leaderboard] and upload your predictions there. Please see instructions at [https://textvqa.org/challenge/](https://textvqa.org/challenge/).
+
+## Dataset Creation
+
+### Curation Rationale
+
+[Needs More Information]
+
+### Source Data
+
+#### Initial Data Collection and Normalization
+
+[Needs More Information]
+
+#### Who are the source language producers?
+
+English Crowdsource Annotators 
+
+### Annotations
+
+#### Annotation process
+
+See the [paper](https://arxiv.org/abs/1904.08920).
+
+#### Who are the annotators?
+
+See the [paper](https://arxiv.org/abs/1904.08920).
+
+### Personal and Sensitive Information
+
+See the [paper](https://arxiv.org/abs/1904.08920).
+
+## Considerations for Using the Data
+
+### Social Impact of Dataset
+
+See the [paper](https://arxiv.org/abs/1904.08920).
+
+### Discussion of Biases
+
+See the [paper](https://arxiv.org/abs/1904.08920).
+
+### Other Known Limitations
+
+See the [paper](https://arxiv.org/abs/1904.08920).
+
+## Additional Information
+
+### Dataset Curators
+
+[Amanpreet Singh](https://github.com/apsdehal)
+
+### Licensing Information
+
+CC by 4.0
+
+### Citation Information
+
+```
+@inproceedings{singh2019towards,
+    title={Towards VQA Models That Can Read},
+    author={Singh, Amanpreet and Natarjan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Parikh, Devi and Rohrbach, Marcus},
+    booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
+    pages={8317-8326},
+    year={2019}
+}
+```
+
+### Contributions
+
+Thanks to [@apsdehal](https://github.com/apsdehal) for adding this dataset.
diff --git a/datasets/textvqa/dataset_infos.json b/datasets/textvqa/dataset_infos.json
@@ -0,0 +1 @@
+{"train": {"description": "TextVQA requires models to read and reason about text in images to answer questions about them. \nSpecifically, models need to incorporate a new modality of text present in the images and reason \nover it to answer TextVQA questions. TextVQA dataset contains 45,336 questions over 28,408 images\nfrom the OpenImages dataset.    \n", "citation": "\n@inproceedings{singh2019towards,\n    title={Towards VQA Models That Can Read},\n    author={Singh, Amanpreet and Natarjan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Parikh, Devi and Rohrbach, Marcus},\n    booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},\n    pages={8317-8326},\n    year={2019}\n}\n", "homepage": "https://textvqa.org", "license": "CC BY 4.0", "features": {"image_id": {"dtype": "string", "id": null, "_type": "Value"}, "question_id": {"dtype": "int32", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "question_tokens": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "image": {"decode": true, "id": null, "_type": "Image"}, "image_width": {"dtype": "int32", "id": null, "_type": "Value"}, "image_height": {"dtype": "int32", "id": null, "_type": "Value"}, "flickr_original_url": {"dtype": "string", "id": null, "_type": "Value"}, "flickr_300k_url": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "image_classes": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "set_name": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "textvqa", "config_name": "train", "version": {"version_str": "0.5.1", "description": "", "major": 0, "minor": 5, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 21381310, "num_examples": 34602, "dataset_name": "textvqa"}, "validation": {"name": "validation", "num_bytes": 3077854, "num_examples": 5000, "dataset_name": "textvqa"}, "test": {"name": "test", "num_bytes": 3025046, "num_examples": 5734, "dataset_name": "textvqa"}}, "download_checksums": {"https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_train.json": {"num_bytes": 21634937, "checksum": "95f5c407db56cba56a177799dcd685a7cc0ec7c0d851b59910acf7786d31b68a"}, "https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json": {"num_bytes": 3116162, "checksum": "4ceb5aadc1a41719d0a3e4dfdf06838bcfee1db569a9a65ee67d31c99893081d"}, "https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_test.json": {"num_bytes": 2770520, "checksum": "d8d4b738101087bac5a6182d22d9aef3772e08e77827e6cf6116808910b75db2"}, "https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip": {"num_bytes": 7072297970, "checksum": "ecf35005640d0708eae185aab1c0a10f89b2db7420b29185a1ed92a8f4290498"}, "https://dl.fbaipublicfiles.com/textvqa/images/test_images.zip": {"num_bytes": 970296721, "checksum": "1276b908994c444c46484fb21e9e15fcda1be9c675f6ad727489e52eea68cbcd"}}, "download_size": 8070116310, "post_processing_size": null, "dataset_size": 27484210, "size_in_bytes": 8097600520}, "val": {"description": "TextVQA requires models to read and reason about text in images to answer questions about them. \nSpecifically, models need to incorporate a new modality of text present in the images and reason \nover it to answer TextVQA questions. TextVQA dataset contains 45,336 questions over 28,408 images\nfrom the OpenImages dataset.    \n", "citation": "\n@inproceedings{singh2019towards,\n    title={Towards VQA Models That Can Read},\n    author={Singh, Amanpreet and Natarjan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Parikh, Devi and Rohrbach, Marcus},\n    booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},\n    pages={8317-8326},\n    year={2019}\n}\n", "homepage": "https://textvqa.org", "license": "CC BY 4.0", "features": {"image_id": {"dtype": "string", "id": null, "_type": "Value"}, "question_id": {"dtype": "int32", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "question_tokens": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "image": {"decode": true, "id": null, "_type": "Image"}, "image_width": {"dtype": "int32", "id": null, "_type": "Value"}, "image_height": {"dtype": "int32", "id": null, "_type": "Value"}, "flickr_original_url": {"dtype": "string", "id": null, "_type": "Value"}, "flickr_300k_url": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "image_classes": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "set_name": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "textvqa", "config_name": "val", "version": {"version_str": "0.5.1", "description": "", "major": 0, "minor": 5, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 21381310, "num_examples": 34602, "dataset_name": "textvqa"}, "validation": {"name": "validation", "num_bytes": 3077854, "num_examples": 5000, "dataset_name": "textvqa"}, "test": {"name": "test", "num_bytes": 3025046, "num_examples": 5734, "dataset_name": "textvqa"}}, "download_checksums": {"https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_train.json": {"num_bytes": 21634937, "checksum": "95f5c407db56cba56a177799dcd685a7cc0ec7c0d851b59910acf7786d31b68a"}, "https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json": {"num_bytes": 3116162, "checksum": "4ceb5aadc1a41719d0a3e4dfdf06838bcfee1db569a9a65ee67d31c99893081d"}, "https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_test.json": {"num_bytes": 2770520, "checksum": "d8d4b738101087bac5a6182d22d9aef3772e08e77827e6cf6116808910b75db2"}, "https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip": {"num_bytes": 7072297970, "checksum": "ecf35005640d0708eae185aab1c0a10f89b2db7420b29185a1ed92a8f4290498"}, "https://dl.fbaipublicfiles.com/textvqa/images/test_images.zip": {"num_bytes": 970296721, "checksum": "1276b908994c444c46484fb21e9e15fcda1be9c675f6ad727489e52eea68cbcd"}}, "download_size": 8070116310, "post_processing_size": null, "dataset_size": 27484210, "size_in_bytes": 8097600520}, "test": {"description": "TextVQA requires models to read and reason about text in images to answer questions about them. \nSpecifically, models need to incorporate a new modality of text present in the images and reason \nover it to answer TextVQA questions. TextVQA dataset contains 45,336 questions over 28,408 images\nfrom the OpenImages dataset.    \n", "citation": "\n@inproceedings{singh2019towards,\n    title={Towards VQA Models That Can Read},\n    author={Singh, Amanpreet and Natarjan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Parikh, Devi and Rohrbach, Marcus},\n    booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},\n    pages={8317-8326},\n    year={2019}\n}\n", "homepage": "https://textvqa.org", "license": "CC BY 4.0", "features": {"image_id": {"dtype": "string", "id": null, "_type": "Value"}, "question_id": {"dtype": "int32", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "question_tokens": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "image": {"decode": true, "id": null, "_type": "Image"}, "image_width": {"dtype": "int32", "id": null, "_type": "Value"}, "image_height": {"dtype": "int32", "id": null, "_type": "Value"}, "flickr_original_url": {"dtype": "string", "id": null, "_type": "Value"}, "flickr_300k_url": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "image_classes": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "set_name": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "textvqa", "config_name": "test", "version": {"version_str": "0.5.1", "description": "", "major": 0, "minor": 5, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 21381310, "num_examples": 34602, "dataset_name": "textvqa"}, "validation": {"name": "validation", "num_bytes": 3077854, "num_examples": 5000, "dataset_name": "textvqa"}, "test": {"name": "test", "num_bytes": 3025046, "num_examples": 5734, "dataset_name": "textvqa"}}, "download_checksums": {"https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_train.json": {"num_bytes": 21634937, "checksum": "95f5c407db56cba56a177799dcd685a7cc0ec7c0d851b59910acf7786d31b68a"}, "https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json": {"num_bytes": 3116162, "checksum": "4ceb5aadc1a41719d0a3e4dfdf06838bcfee1db569a9a65ee67d31c99893081d"}, "https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_test.json": {"num_bytes": 2770520, "checksum": "d8d4b738101087bac5a6182d22d9aef3772e08e77827e6cf6116808910b75db2"}, "https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip": {"num_bytes": 7072297970, "checksum": "ecf35005640d0708eae185aab1c0a10f89b2db7420b29185a1ed92a8f4290498"}, "https://dl.fbaipublicfiles.com/textvqa/images/test_images.zip": {"num_bytes": 970296721, "checksum": "1276b908994c444c46484fb21e9e15fcda1be9c675f6ad727489e52eea68cbcd"}}, "download_size": 8070116310, "post_processing_size": null, "dataset_size": 27484210, "size_in_bytes": 8097600520}}
diff --git a/datasets/textvqa/dummy/test/0.5.1/dummy_data.zip b/datasets/textvqa/dummy/test/0.5.1/dummy_data.zip
diff --git a/datasets/textvqa/dummy/train/0.5.1/dummy_data.zip b/datasets/textvqa/dummy/train/0.5.1/dummy_data.zip
diff --git a/datasets/textvqa/dummy/val/0.5.1/dummy_data.zip b/datasets/textvqa/dummy/val/0.5.1/dummy_data.zip