Recent Multi-modal Deep Learning Advances (list of papers and highlights).
There are many advances of using unified models (e.g. Transformer) to create representations for multiple modalities. Some of them even enable fusion of multiple modalities to make different modalities help each other. Here, multiple modalities not only include natural language, vision and speech, but also include formal language (e.g. code), (semi-)structured knowledge (e.g. table, KG etc.) and biological/chemical compounds (e.g. protein, molecular, etc.). This is a list of recent important papers in this field. Welcome to contribute.
- Introduction
- Resources
- Natural Language
- Vision
- Speech
- Formal Language / Code
- Structured Knowledge
- Biology / Chemistry
- Modality Fusion
-
BERT, RoBERTa, BART, SpanBERT, UniLM, PEGASUS, ELECTRA, T5, GPT-k, FLAN, InstructGPT etc.
-
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language, arxiv Feb 2022.
-
DETR: End-to-End Object Detection with Transformers, ECCV 2020.
-
ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021.
-
DeiT: Training data-efficient image transformers & distillation through attention, arxiv Dec 2020.
-
MoCo-V3: An Empirical Study of Training Self-Supervised Vision Transformers, ICCV 2021.
-
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, arxiv Aug 2021.
-
DINO: Emerging Properties in Self-Supervised Vision Transformers, arxiv April 2021.
-
BEiT: BERT Pre-Training of Image Transformers, arxiv Jun 2021
-
SimMIM: A Simple Framework for Masked Image Modeling, arxiv Nov 2021.
-
MAE: Masked Autoencoders Are Scalable Vision Learners, arxiv Nov 2021.
-
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language, arxiv Feb 2022.
-
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, arxiv Jun 2020.
-
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, arxiv Jun 2021.
-
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing, arxiv Oct 2021.
-
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language, arxiv Feb 2022.
- wav2vec-U: Unsupervised Speech Recognition, arxiv May 2021.
-
CodeBERT: A Pre-Trained Model for Programming and Natural Languages, EMNLP 2020 (Findings).
-
Codex: Evaluating Large Language Models Trained on Code, arxiv Jul 2021.
-
GraphCodeBERT: Pre-training Code Representations with Data Flow, ICLR 2021.
-
Transformer Embeddings of Irregularly Spaced Events and Their Participants, ICLR 2022.
-
AlphaCode: Competition-Level Code Generation with AlphaCode.
- UNIFIEDSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models, arxiv Jan 2022.
-
TABERT: Pretraining for Joint Understanding of Textual and Tabular Data, ACL 2020.
-
GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing, ICLR 2021.
-
TAPAS: Weakly Supervised Table Parsing via Pre-training, ACL 2020.
-
STRUG: Structure-Grounded Pretraining for Text-to-SQL, NAACL 2021.
-
TAPEX: Table Pre-training via Learning a Neural SQL Executor, ICLR 2022.
-
TableFormer: Robust Transformer Modeling for Table-Text Encoding, ACL 2022.
-
COMET: Commonsense Transformers for Automatic Knowledge Graph Construction, ACL 2019.
-
(COMET-)ATOMIC-2020: On Symbolic and Neural Commonsense Knowledge Graphs, arxiv Oct 2020.
-
Knowledge is Power: Symbolic Knowledge Distillation, Commonsense Morality, & Multimodal Script Knowledge, WSDM 2022.
-
REALM: Retrieval-Augmented Language Model Pre-Training, arxiv Feb 2020.
-
MERGE: Pre-training via Paraphrasing, NeuralPS 2020.
-
Dense Passage Retrieval for Open-Domain Question Answering, EMNLP 2020.
-
RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, NeuralPS 2020.
-
End-to-End Training of Neural Retrievers for Open-Domain Question Answering, ACL 2021.
-
Condenser: a Pre-training Architecture for Dense Retrieval, EMNLP 2021.
-
Spider: Learning to Retrieve Passages without Supervision, arxiv Dec 2021.
-
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeuralPS 2019.
-
LXMERT: Learning Cross-Modality Encoder Representations, EMNLP 2019.
-
VisualBERT: A Simple and Performant Baseline for Vision and Language, ACL 2020.
-
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, arxiv Dec 2019.
-
UNITER: UNiversal Image-TExt Representation Learning, arxiv July 2020.
-
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, ECCV 2020.
-
VILLA: Large-Scale Adversarial Training for Vision-and-Language Representation Learning, NeuralPS 2020.
-
ViLBERT-MT: 12-in-1: Multi-Task Vision and Language Representation Learning, CVPR 2020.
-
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers, arxiv April 2020.
-
U-VisualBERT: Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions, NAACL 2021.
-
M6: A Chinese Multimodal Pretrainer, arxiv March 2021.
-
DALL·E: Zero-Shot Text-to-Image Generation, arxiv Feb 2021.
-
CLIP: Learning Transferable Visual Models From Natural Language Supervision, arxiv Feb 2021.
-
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning, ACL 2021.
-
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, ICML 2021.
-
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision, arxiv Aug 2021.
-
ALBEF: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation, arxiv July 2021.
-
VinVL: Revisiting Visual Representations in Vision-Language Models, CVPR 2021.
-
LAFITE: Towards Language-Free Training for Text-to-Image Generation, arxiv Nov 2021.
-
VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, arxiv Nov 2021.
-
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models arxiv Dec 2021.
-
FLAVA: A Foundational Language And Vision Alignment Model, arxiv Dec 2021.
-
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks, arxiv Dec 2021.
-
CM3: A Causal Masked Multimodal Model of the Internet, arxiv Jan 2022.
-
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework, arxiv Feb 2022.