- UCLA / Baidu [Paper]
- Explain Images with Multimodal Recurrent Neural Networks, arXiv:1410.1090.
- Toronto [Paper]
- Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, arXiv:1411.2539.
- Berkeley [Paper]
- Long-term Recurrent Convolutional Networks for Visual Recognition and Description, arXiv:1411.4389.
- Google [Paper]
- Show and Tell: A Neural Image Caption Generator, arXiv:1411.4555.
- Stanford [Web] [Paper]
- Deep Visual-Semantic Alignments for Generating Image Description, CVPR, 2015.
- UML / UT [Paper]
- Translating Videos to Natural Language Using Deep Recurrent Neural Networks, NAACL-HLT, 2015.
- CMU / Microsoft [Paper-arXiv] [Paper-CVPR]
- Learning a Recurrent Visual Representation for Image Caption Generation, arXiv:1411.5654.
- Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation, CVPR 2015
- Microsoft [Paper]
- From Captions to Visual Concepts and Back, CVPR, 2015.
- Univ. Montreal / Univ. Toronto [Web] [Paper]
- Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention, arXiv:1502.03044 / ICML 2015
- Idiap / EPFL / Facebook [Paper]
- Phrase-based Image Captioning, arXiv:1502.03671 / ICML 2015
- UCLA / Baidu [Paper]
- Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images, arXiv:1504.06692
- MS + Berkeley
- Adelaide [Paper]
- Image Captioning with an Intermediate Attributes Layer, arXiv:1506.01144
- Tilburg [Paper]
- Learning language through pictures, arXiv:1506.03694
- Univ. Montreal [Paper]
- Describing Multimedia Content using Attention-based Encoder-Decoder Networks, arXiv:1507.01053
- Cornell [Paper]
- Image Representations and New Domains in Neural Image Captioning, arXiv:1508.02091
- MS + City Univ. of HongKong [Paper]
- "Learning Query and Image Similarities with Ranking Canonical Correlation Analysis", ICCV, 2015
- Berkeley [Web] [Paper]
- Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR, 2015.
- UT / UML / Berkeley [Paper]
- Translating Videos to Natural Language Using Deep Recurrent Neural Networks, arXiv:1412.4729.
- Microsoft [Paper]
- Joint Modeling Embedding and Translation to Bridge Video and Language, arXiv:1505.01861.
- UT / Berkeley / UML [Paper]
- Sequence to Sequence--Video to Text, arXiv:1505.00487.
- Univ. Montreal / Univ. Sherbrooke [Paper]
- Describing Videos by Exploiting Temporal Structure, arXiv:1502.08029
- MPI / Berkeley [Paper]
- The Long-Short Story of Movie Description, arXiv:1506.01698
- Univ. Toronto / MIT [Paper]
- Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books, arXiv:1506.06724
- Univ. Montreal [Paper]
- Describing Multimedia Content using Attention-based Encoder-Decoder Networks, arXiv:1507.01053
- TAU / USC [paper]
- Temporal Tessellation for Video Annotation and Summarization, arXiv:1612.06950.
-
Virginia Tech / MSR [Web] [Paper]
- VQA: Visual Question Answering, CVPR, 2015 SUNw:Scene Understanding workshop.
-
- Ask Your Neurons: A Neural-based Approach to Answering Questions about Images, arXiv:1505.01121.
-
- Image Question Answering: A Visual Semantic Embedding Model and a New Dataset, arXiv:1505.02074 / ICML 2015 deep learning workshop.
-
Baidu / UCLA [Paper] [Dataset]
- Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering, arXiv:1505.05612.
-
POSTECH [Paper] [Project Page]
- Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction, arXiv:1511.05765
-
CMU / Microsoft Research [Paper]
- Stacked Attention Networks for Image Question Answering. arXiv:1511.02274.
-
MetaMind [Paper]
- "Dynamic Memory Networks for Visual and Textual Question Answering." arXiv:1603.01417 (2016).
-
SNU + NAVER [Paper]
- Multimodal Residual Learning for Visual QA, arXiv:1606:01455
-
UC Berkeley + Sony [Paper]
- Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, arXiv:1606.01847
-
Postech [Paper]
- Training Recurrent Answering Units with Joint Loss Minimization for VQA, arXiv:1606.03647
-
SNU + NAVER [Paper]
- Hadamard Product for Low-rank Bilinear Pooling, arXiv:1610.04325.