Awesome Vision Language Pretraining Glossary

1. Models

1.1 Encoder-Only Models

BERT-like Pretrained Family

Model Name	Arxiv Time	Paper	Code	Resources
ViLBERT	Aug 6 2019	paper	official
VisualBERT	Aug 9 2019	paper	official	huggingface
LXMERT	Aug 20 2019	paper	official	huggingface
VL-BERT	Aug 22 2019	paper	official
UNITER	Sep 25 2019	paper	official
PixelBERT	Apr 2 2020	paper
Oscar	Apr 4 2020	paper	official
VinVL	Jan 2 2021	paper	official
ViLT	Feb 5 2021	paper	official	huggingface
CLIP-ViL	Jul 13 2021	paper	official
METER	Nov 3 2021	paper	official

Contrastive Learning Family

Model Name	Arxiv Time	Paper	Code	Comment
CLIP	Feb 26 2021	paper	offical	Powerful representation learnt through large-scale image-text contrastive pairs.
ALIGN	Feb 11 2021	paper		Impressive image-text retrieval ability.
FILIP	Nov 9 2021	paper		Finer-grained representation learnt through contrastive patches.
LiT	Nov 15 2021	paper	official	Frozen image encoders proven to be effective.
Florence	Nov 22 2021	paper		Large scale contrastive pretraining and adapted to vision downstream tasks.
FLIP	Dec 1 2022	paper	offical	Further scaled up negative samples by masking out 95% image patches.

Large-scale Representation Learning Family

Model Name	Arxiv Time	Paper	Code	Comment
MDETR	Apr 26 2021	paper	official	Impressive visual grounding abilities achieved with DETR and RoBERTa
ALBEF	Jul 16 2021	paper	official	BLIP's predecessor. Contrastive learning for unimodal representation followed by a multimodal transformer-based encoder.
UniT	Aug 3 2021	paper	official
VLMo	Nov 3 2021	paper	official	Mixture of unimodal experts before multimodal experts.
UFO	Nov 19 2021	paper
FLAVA	Dec 8 2021	paper	official	Multitask training for unimodal and multimodal representations. Can be finetuned for a variety of downstream tasks.
BEiT-3	Aug 8 2022	paper	official	VLMo scaled up.

1.2 Encoder-Decoder Models

Medium-scaled Encoder-Decoder Family

Model Name	Arxiv Time	Paper	Code	Comment
VL-T5	Feb 4 2021	paper	official	Unified image-text tasks with text generation, also capable of grounding.
SimVLM	Aug 24 2021	paper	official	Pretrained with large-scale image-text pairs and image-text tasks with prefix LM.
UniTab	Nov 23 2021	paper	official	Unified text generation with bounding box outputs.
BLIP	Jan 28 2021	paper	official	Capfilt method for bootstrapping image-text pair data generation. Contrastive learning, image-text matching and LM as objectives.
CoCa	May 4 2022	paper	pytorch	Large-scale image-text contrastive learning with text generation(LM)
GIT	May 27 2022	paper	official	GPT-like language model conditioned on visual features extracted by pretrained ViT. (SoTA on image captioning tasks)
DaVinci	Jun 15 2022	paper	official	Output generation conditioned on prefix texts or prefix images. Supports text and image generation.

Vision Signal Aligned with LLMs(Base Models)

Model Name	Arxiv Time	Paper	Code
Frozen	Jun 25 2021	paper
Flamingo	Apr 29 2022	paper	OpenFlamingo
MetaLM	Jun 13 2022	paper	official
PaLI	Sep 14 2022	paper
BLIP-2	Jan 30 2023	paper	official
KOSMOS	Feb 27 2023	paper	official
PaLM-E	Mar 6 2023	paper

VLMs Aligned with Human Instructions(SFT, Instruction Tuning)

Model Name	Arxiv Time	Paper	Code
LLaVA	Apr 17 2023	paper	official
Mini-GPT4	Apr 20 2023	paper	official
Otter	May 5 2023	paper	official
InstructBLIP	May 11 2023	paper	official
VisionLLM	May 18 2023	paper	official
KOSMOS-2	Jun 26 2023	paper	official
Emu	Jul 11 2023	paper	official

Parameter-Efficient VLMs(Exploring Training Schemes)

Model Name	Arxiv Time	Paper	Code
MAGMA	Dec 9 2021	paper	official
VL-Adapter	Dec 13 2021	paper	official
LiMBeR	Sep 30 2022	paper	official
LLaMA-Adapter	Mar 28 2023	paper	official
LLaMA-Adapter-v2	Apr 28 2023	paper	official
UniAdapter	May 21 2023	paper	official
ImageBind-LLM	Sep 11 2023	paper	official

LLMs as General Interface(New Frontiers in LLM Interface)

Model Name	Arxiv Time	Paper	Code
Visual Programming	Nov 18 2022	paper	official
ViperGPT	Mar 14 2023	paper	official
MM-React	Mar 20 2023	paper	official
Chameleon	May 24 2023	paper	official
HuggingGPT	May 25 2023	paper	official
IdealGPT	May 24 2023	paper	official
NextGPT	Sep 13 2023	paper

2. Tasks & Datasets

2.1 Pretraining Datasets

Dataset	Time	Size	Format	Task	Link
SBU Captions	2011	1M	image-text pairs	pretraining/image captioning	https://vislang.ai/sbu-explorer
YFCC-100M	2015	100M	image-text pairs	pretraining	https://multimediacommons.wordpress.com/yfcc100m-core-dataset/
CC3M	2018	3M	image-text pairs	pretraining/image captioning	https://github.com/google-research-datasets/conceptual-12m
LAIT	2020	10M	image-text pairs	pretraining
Localized Narratives	2020	849K	image-text pairs	pretraining	https://google.github.io/localized-narratives/
CC12M	2021	12M	image-text pairs	pretraining	https://github.com/google-research-datasets/conceptual-12m
LAION-400M	2021	400M	image-text pairs	pretraining	https://laion.ai/laion-400-open-dataset/
RedCaps	2021	12M	image-text pairs	pretraining	https://redcaps.xyz/
WIT	2021	37.5M	image-text pairs	pretraining	https://github.com/google-research-datasets/wit
LAION-5B	2022	5B	image-text pairs	pretraining	https://laion.ai/blog/laion-5b/

2.2 Image Captioning

Dataset	Time	Size	Format	Task	Link
Flickr30k	2014	30K	image-text pairs	image captioning	https://arxiv.org/abs/1505.04870
COCO	2014	567K	image-text pairs	image captioning	https://cocodataset.org/#home
TextCaps	2020	28K	image-text pairs	image captioning	https://textvqa.org/textcaps/
VizWiz	2020	20K	image-question-answer pairs	VQA	https://vizwiz.org/tasks-and-datasets/vqa/

2.3 Visual Question Answering

Dataset	Time	Size	Format	Task	Link
Visual Genome	2017	108K	image-question-answer pairs, region descriptions	VQA/pretraining	https://homes.cs.washington.edu/~ranjay/visualgenome/index.html
VQA v2	2017	1.1M	question-answer pairs	VQA	https://visualqa.org/
TextVQA	2019	28K	image-question-answer pairs	VQA	https://textvqa.org/
OCR-VQA	2019	1M	image-question-answer pairs	VQA	https://ocr-vqa.github.io/
ST-VQA	2019	31K	image-question-answer pairs	VQA	https://arxiv.org/abs/1905.13648
OK-VQA	2019	14K	image-question-answer pairs	VQA	https://okvqa.allenai.org/
VizWiz	2020	20K	image-question-answer pairs	VQA	https://vizwiz.org/tasks-and-datasets/vqa/
IconQA	2021	107K	image-question-answer pairs	VQA	https://iconqa.github.io/
ScienceQA	2022	21K	image-question-answer pairs	VQA	https://github.com/lupantech/ScienceQA

2.4 Visual Reasoning

Dataset	Time	Size	Format	Task	Link
NLVR	2017	92K	image-grounded statements	reasoning	https://lil.nlp.cornell.edu/nlvr/
GQA	2019	1M	image-text pairs	visual reasoning/question answering	https://cs.stanford.edu/people/dorarad/gqa/about.html
Visual Commonsense Reasoning	2019	110K	image-question-answer pairs	reasoning	https://visualcommonsense.com/
SNLI-VE	2019	530K	image-question-answer pairs	reasoning	https://github.com/necla-ml/SNLI-VE
Winoground	2022		image-text pairs	reasoning	https://huggingface.co/datasets/facebook/winoground

3. Surveys

Foundational Models Defining a New Era in Vision: A Survey and Outlook
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends
Multimodal Foundation Models: From Specialists to General-Purpose Assistants

4. Tutorials and Other Resources

Andrej Karpathy, State of GPT
[CVPR2023 Tutorial Talk] Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4
VLP Tutorial

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Awesome Vision Language Pretraining Glossary

Table of Contents

1. Models

1.1 Encoder-Only Models

BERT-like Pretrained Family

Contrastive Learning Family

Large-scale Representation Learning Family

1.2 Encoder-Decoder Models

Medium-scaled Encoder-Decoder Family

Vision Signal Aligned with LLMs(Base Models)

VLMs Aligned with Human Instructions(SFT, Instruction Tuning)

Parameter-Efficient VLMs(Exploring Training Schemes)

LLMs as General Interface(New Frontiers in LLM Interface)

2. Tasks & Datasets

2.1 Pretraining Datasets

2.2 Image Captioning

2.3 Visual Question Answering

2.4 Visual Reasoning

3. Surveys

4. Tutorials and Other Resources

Files

README.md

Latest commit

History

README.md

File metadata and controls

Awesome Vision Language Pretraining Glossary

Table of Contents

1. Models

1.1 Encoder-Only Models

BERT-like Pretrained Family

Contrastive Learning Family

Large-scale Representation Learning Family

1.2 Encoder-Decoder Models

Medium-scaled Encoder-Decoder Family

Vision Signal Aligned with LLMs(Base Models)

VLMs Aligned with Human Instructions(SFT, Instruction Tuning)

Parameter-Efficient VLMs(Exploring Training Schemes)

LLMs as General Interface(New Frontiers in LLM Interface)

2. Tasks & Datasets

2.1 Pretraining Datasets

2.2 Image Captioning

2.3 Visual Question Answering

2.4 Visual Reasoning

3. Surveys

4. Tutorials and Other Resources