We provide the extracted image region features, object tags, and the original text annotations for each downstream tasks.
path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/datasets/TASK_NAME' <target folder> --resursive
TASK_NAME
could be coco_caption
, nocaps
, coco_ir
, vqa
, gqa
, nlvr2
.
We provide pre-trained Oscar+ models of Bert-base and Bert-large structures, with the name starting with base
and large
, respectively.
path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/TASK_NAME' <target folder> --resursive
TASK_NAME
could be coco_caption
, nocaps
, coco_ir
, vqa
, gqa
, nlvr2
, od_models
.
The models are trained with both image region features and object tags. The image region features are extracted by the Faster R-CNN with
ResNet-101, using object and attribute annotations from Visual Genome.
The object tags are from:
1) the same VisualGenome model, named as -vg-labels
. Or,
2) the model trained on object annotations from Open Images V5. named as -oid-labels
. Or,
3) no object tags provied, serving as baseline, named as -no-labels
.
For ease-of-use, we make pretrained features available for all pretraining datasets and downstream tasks. Features are stored in tsv (tab-separated-values) format that can be used in pretraining and dowstream tasks like COCO Image-Text Retrieval.
Notice that all the links below are links to a folder. We recommend using the following AzCopy command to download.
path/to/azcopy copy <folder-link> <target-address> --resursive"
COCO 2014 Train/Val Image Features (~50G)
COCO 2014 Test Image Features (~16G)
COCO 2015 Test Image Features (~32G)
NVLR2 Train/Del/Test Image Features (~28G)
Flickr30k All Image Features (~14G)
Google Conceptual Captions Image Features (Huge, ~960G, splitted into 12 chunks)
SBU Image Features (Huge, ~280G, splitted into 4 chunks)
Open Images Detection Image Features (Huge, ~530G, splitted into 8 chunks)
We have tried our best to make sure that there is no data contamination between pretraining corpus and test sets for downstream tasks. More specifically, we use two methods to achieve this. (1) We use the COCO Image ID of Visual Genome and Flickr30k images. (2) For COCO, Visual Genome and Flickr30k, we calucate the pair-wise l2 norm between two images after resizing them into the same size.
It is recommended to download large files with AzCopy for faster speed. AzCopy executable tools can be downloaded here. Decompress the tar file and put the executable in any path. To download from any URL above, the command is:
path/to/azcopy copy <URL> <local_path>
# for example, downloading coco_caption.zip
path/to/azcopy copy https://biglmdiag.blob.core.windows.net/oscar/datasets/coco_caption.zip <local_path>