Skip to content

Latest commit

 

History

History
276 lines (197 loc) · 24.1 KB

File metadata and controls

276 lines (197 loc) · 24.1 KB

Awesome Vision Language Pretraining Glossary

Table of Contents

1. Models

1.1 Encoder-Only Models

  • BERT-like Pretrained Family

Model Name Arxiv Time Paper Code Resources
ViLBERT Aug 6 2019 paper official
VisualBERT Aug 9 2019 paper official huggingface
LXMERT Aug 20 2019 paper official huggingface
VL-BERT Aug 22 2019 paper official
UNITER Sep 25 2019 paper official
PixelBERT Apr 2 2020 paper
Oscar Apr 4 2020 paper official
VinVL Jan 2 2021 paper official
ViLT Feb 5 2021 paper official huggingface
CLIP-ViL Jul 13 2021 paper official
METER Nov 3 2021 paper official
  • Contrastive Learning Family

Model Name Arxiv Time Paper Code Comment
CLIP Feb 26 2021 paper offical Powerful representation learnt through large-scale image-text contrastive pairs.
ALIGN Feb 11 2021 paper Impressive image-text retrieval ability.
FILIP Nov 9 2021 paper Finer-grained representation learnt through contrastive patches.
LiT Nov 15 2021 paper official Frozen image encoders proven to be effective.
Florence Nov 22 2021 paper Large scale contrastive pretraining and adapted to vision downstream tasks.
FLIP Dec 1 2022 paper offical Further scaled up negative samples by masking out 95% image patches.
  • Large-scale Representation Learning Family

Model Name Arxiv Time Paper Code Comment
MDETR Apr 26 2021 paper official Impressive visual grounding abilities achieved with DETR and RoBERTa
ALBEF Jul 16 2021 paper official BLIP's predecessor. Contrastive learning for unimodal representation followed by a multimodal transformer-based encoder.
UniT Aug 3 2021 paper official
VLMo Nov 3 2021 paper official Mixture of unimodal experts before multimodal experts.
UFO Nov 19 2021 paper
FLAVA Dec 8 2021 paper official Multitask training for unimodal and multimodal representations. Can be finetuned for a variety of downstream tasks.
BEiT-3 Aug 8 2022 paper official VLMo scaled up.

1.2 Encoder-Decoder Models

  • Medium-scaled Encoder-Decoder Family

Model Name Arxiv Time Paper Code Comment
VL-T5 Feb 4 2021 paper official Unified image-text tasks with text generation, also capable of grounding.
SimVLM Aug 24 2021 paper official Pretrained with large-scale image-text pairs and image-text tasks with prefix LM.
UniTab Nov 23 2021 paper official Unified text generation with bounding box outputs.
BLIP Jan 28 2021 paper official Capfilt method for bootstrapping image-text pair data generation. Contrastive learning, image-text matching and LM as objectives.
CoCa May 4 2022 paper pytorch Large-scale image-text contrastive learning with text generation(LM)
GIT May 27 2022 paper official GPT-like language model conditioned on visual features extracted by pretrained ViT. (SoTA on image captioning tasks)
DaVinci Jun 15 2022 paper official Output generation conditioned on prefix texts or prefix images. Supports text and image generation.
  • Vision Signal Aligned with LLMs(Base Models)

Model Name Arxiv Time Paper Code
Frozen Jun 25 2021 paper
Flamingo Apr 29 2022 paper OpenFlamingo
MetaLM Jun 13 2022 paper official
PaLI Sep 14 2022 paper
BLIP-2 Jan 30 2023 paper official
KOSMOS Feb 27 2023 paper official
PaLM-E Mar 6 2023 paper
  • VLMs Aligned with Human Instructions(SFT, Instruction Tuning)

Model Name Arxiv Time Paper Code
LLaVA Apr 17 2023 paper official
Mini-GPT4 Apr 20 2023 paper official
Otter May 5 2023 paper official
InstructBLIP May 11 2023 paper official
VisionLLM May 18 2023 paper official
KOSMOS-2 Jun 26 2023 paper official
Emu Jul 11 2023 paper official
  • Parameter-Efficient VLMs(Exploring Training Schemes)

Model Name Arxiv Time Paper Code
MAGMA Dec 9 2021 paper official
VL-Adapter Dec 13 2021 paper official
LiMBeR Sep 30 2022 paper official
LLaMA-Adapter Mar 28 2023 paper official
LLaMA-Adapter-v2 Apr 28 2023 paper official
UniAdapter May 21 2023 paper official
ImageBind-LLM Sep 11 2023 paper official
  • LLMs as General Interface(New Frontiers in LLM Interface)

Model Name Arxiv Time Paper Code
Visual Programming Nov 18 2022 paper official
ViperGPT Mar 14 2023 paper official
MM-React Mar 20 2023 paper official
Chameleon May 24 2023 paper official
HuggingGPT May 25 2023 paper official
IdealGPT May 24 2023 paper official
NextGPT Sep 13 2023 paper

2. Tasks & Datasets

2.1 Pretraining Datasets

Dataset Time Size Format Task Link
SBU Captions 2011 1M image-text pairs pretraining/image captioning https://vislang.ai/sbu-explorer
YFCC-100M 2015 100M image-text pairs pretraining https://multimediacommons.wordpress.com/yfcc100m-core-dataset/
CC3M 2018 3M image-text pairs pretraining/image captioning https://github.com/google-research-datasets/conceptual-12m
LAIT 2020 10M image-text pairs pretraining
Localized Narratives 2020 849K image-text pairs pretraining https://google.github.io/localized-narratives/
CC12M 2021 12M image-text pairs pretraining https://github.com/google-research-datasets/conceptual-12m
LAION-400M 2021 400M image-text pairs pretraining https://laion.ai/laion-400-open-dataset/
RedCaps 2021 12M image-text pairs pretraining https://redcaps.xyz/
WIT 2021 37.5M image-text pairs pretraining https://github.com/google-research-datasets/wit
LAION-5B 2022 5B image-text pairs pretraining https://laion.ai/blog/laion-5b/

2.2 Image Captioning

Dataset Time Size Format Task Link
Flickr30k 2014 30K image-text pairs image captioning https://arxiv.org/abs/1505.04870
COCO 2014 567K image-text pairs image captioning https://cocodataset.org/#home
TextCaps 2020 28K image-text pairs image captioning https://textvqa.org/textcaps/
VizWiz 2020 20K image-question-answer pairs VQA https://vizwiz.org/tasks-and-datasets/vqa/

2.3 Visual Question Answering

Dataset Time Size Format Task Link
Visual Genome 2017 108K image-question-answer pairs, region descriptions VQA/pretraining https://homes.cs.washington.edu/~ranjay/visualgenome/index.html
VQA v2 2017 1.1M question-answer pairs VQA https://visualqa.org/
TextVQA 2019 28K image-question-answer pairs VQA https://textvqa.org/
OCR-VQA 2019 1M image-question-answer pairs VQA https://ocr-vqa.github.io/
ST-VQA 2019 31K image-question-answer pairs VQA https://arxiv.org/abs/1905.13648
OK-VQA 2019 14K image-question-answer pairs VQA https://okvqa.allenai.org/
VizWiz 2020 20K image-question-answer pairs VQA https://vizwiz.org/tasks-and-datasets/vqa/
IconQA 2021 107K image-question-answer pairs VQA https://iconqa.github.io/
ScienceQA 2022 21K image-question-answer pairs VQA https://github.com/lupantech/ScienceQA

2.4 Visual Reasoning

Dataset Time Size Format Task Link
NLVR 2017 92K image-grounded statements reasoning https://lil.nlp.cornell.edu/nlvr/
GQA 2019 1M image-text pairs visual reasoning/question answering https://cs.stanford.edu/people/dorarad/gqa/about.html
Visual Commonsense Reasoning 2019 110K image-question-answer pairs reasoning https://visualcommonsense.com/
SNLI-VE 2019 530K image-question-answer pairs reasoning https://github.com/necla-ml/SNLI-VE
Winoground 2022 image-text pairs reasoning https://huggingface.co/datasets/facebook/winoground

3. Surveys

4. Tutorials and Other Resources