CV-Zoo

3D

OctNet: Learning Deep 3D Representations at High Resolutions
Unsupervised Learning of Long-Term Motion Dynamics for Videos

Scene Related

Scene Graph Generation by Iterative Message Passing

Discriptive Image Related

A Hierarchical Approach for Generating Descriptive Image Paragraphs

Reinforcement Learning Related

Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning

Ideas:

3D+VQA
GAN+3D or VQA

Datasets:

CLEVR: Compositional Language and Elementary Visual Reasoning