A most Frontend Collection and survey of vision-language model papers, and models GitHub repository
Below we compile awesome papers and model and github repositories that
- State-of-the-Art VLMs Collection of VLMs from 2022-2024.
- Evaluate VLM benchmarks and corresponding link to the works
- Applications applications of VLMs in embodied AI, robotics, etc.
- Contribute surveys, perspectives, and datasets on the above topics.
Welcome to contribute and discuss!
🤩 Papers marked with a ⭐️ are contributed by the maintainers of this repository. If you find them useful, we would greatly appreciate it if you could give the repository a star or cite our paper.
-
- 3.1. Embodied VLM agents
- 3.2. Generative Visual Media Applications
- 3.3. Robotics and Embodied AI
- 3.3.1. Manipulation
- 3.3.2. Navigation
- 3.3.3. Human-robot Interaction
- 3.3.4. Autonomous Driving
- 3.4. Human-Centered AI
- 3.4.1. Web Agent
- 3.4.2. Accessibility
- 3.4.3. Healthcare
- 3.4.4. Social Goodness
-
- 5.1. Hallucination
- 5.2. Safety
- 5.3. Fairness
- 5.4. Multi-modality Alignment
- 5.5. Efficient Training and Fine-Tuning
- 5.6. Scarce of High-quality Dataset
Model | Year | Architecture | Training Data | Parameters | Vision Encoder/Tokenizer | Pretrained Backbone Model |
---|---|---|---|---|---|---|
VisualBERT | 2019 | Encoder-only | COCO | 110M | Faster R-CNN | Pretrained from scratch |
CLIP | 2021 | Encoder-decoder | 400M image-text pairs | 63M-355M | ViT/ResNet | Pretrained from scratch |
BLIP | 2022 | Encoder-decoder | COCO, Visual Genome | 223M-400M | ViT-B/L/g | Pretrained from scratch |
Flamingo | 2022 | Decoder-only | M3W, ALIGN | 80B | Custom | Chinchilla |
BLIP-2 | 2023 | Encoder-decoder | COCO, Visual Genome | 7B-13B | ViT-g | Open Pretrained Transformer (OPT) |
GPT-4V | 2023 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
Gemini | 2023 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
LLaVA-1.5 | 2023 | Decoder-only | COCO | 13B | CLIP ViT-L/14 | Vicuna |
PaLM-E | 2023 | Decoder-only | All robots, WebLI | 562B | ViT | PaLM |
CogVLM | 2023 | Encoder-decoder | LAION-2B ,COYO-700M | 18B | CLIP ViT-L/14 | Vicuna |
InstructBLIP | 2023 | Encoder-decoder | CoCo, VQAv2 | 13B | ViT | Flan-T5, Vicuna |
InternVL | 2023 | Encoder-decoder | LAION-en, LAION- multi | 7B/20B | Eva CLIP ViT-g | QLLaMA |
Claude 3 | 2024 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
Emu3 | 2024 | Decoder-only | Aquila | 7B | MoVQGAN | LLaMA-2 |
NVLM | 2024 | Encoder-decoder | LAION-115M | 8B-24B | Custom ViT | Qwen-2-Instruct |
Qwen2-VL | 2024 | Decoder-only | Undisclosed | 7B-14B | EVA-CLIP ViT-L | Qwen-2 |
Pixtral | 2024 | Decoder-only | Undisclosed | 12B | CLIP ViT-L/14 | Mistral Large 2 |
LLaMA 3.2-vision | 2024 | Decoder-only | Undisclosed | 11B-90B | CLIP | LLaMA-3.1 |
Baichuan Ocean Mini | 2024 | Decoder-only | Image/Video/Audio/Text | 7B | CLIP ViT-L/14 | Baichuan |
TransFusion | 2024 | Encoder-decoder | Undisclosed | 7B | VAE Encoder | Pretrained from scratch on transformer architecture |
DeepSeek-VL2 | 2024 | Decoder-only | WiT, WikiHow | 4.5B x 74 | SigLIP/SAMB | DeepSeekMoE |
Benchmark Dataset | Metric Type | Source | Size (K) | Project |
---|---|---|---|---|
MMTBench | Multiple Choice | AI Experts | 30.1 | Github Repo |
MM-Vet | LLM Eval | Human | 0.2 | Github Repo |
MM-En/CN | Multiple Choice | Human | 3.2 | Github Repo |
GQA | Answer Matching | Seed with Synthetic | 22,000 | Website |
VCR | Multiple Choice | MTurks | 290 | Website |
VQAv2 | Yes/No Answer Matching |
MTurks | 1,100 | Github Repo |
MMMU | Answer Matching; Multiple Choice | College Students | 11.5 | Website |
SEEDBench | Multiple Choice | Synthetic | 19 | Github Repo |
RealWorld QA | Multiple Choice | Human | 0.765 | Huggingface |
MMMU-Pro | Multiple Choice | Human | 3.64 | Website |
DPG-Bench | Semantic Alignment | Synthetic | 1.06 | Website |
MSCOCO-30K | BLEU, Rouge, Similarity | MTurks | 30 | Website |
TextVQA | Answer Matching | CrowdSource | 45 | Github Repo |
DocVQA | Answer Matching | CrowdSource | 50 | Website |
CMMLU | Multiple Choice | College Students | 11.5 | Github Repo |
C-Eval | Multiple Choice | Human | 13.9 | Website |
TextVQA | Answer Matching | Expert Human | 28.6 | Github Repo |
MathVista | Answer Matching Multiple Choice |
Human | 6.15 | Website |
MathVision | Answer Matching Multiple Choice |
College Students | 3.04 | Website |
OCRBench | Answer Matching (ANLS) | Human | 1 | Github Repo |
MME | Yes/No | Human | 2.8 | Github Repo |
InfographicVQA | Answer Matching | CrowdSource | 30 | Website |
AI2D | Answer Matching | CrowdSource | 1 | Website |
ChartQA | Answer Matching | CrowdSource/Synthetic | 32.7 | Github Repo |
GenEval | CLIPScore GenEval |
MTurks | 1.2 | Github Repo |
T2I-CompBench | Multiple Metrics | Synthetic | 6 | Website |
HallusionBench | Yes/No | Human | 1.13 | Github Repo |
POPE | Yes/No | Human | 9 | Github Repo |
MMLU | Multiple Choice | Human | 15.9 | Github Repo |
MMStar | Multiple Choice | Human | 1.5 | Website |
M3GIA | Multiple Choice | Human | 1.8 | Huggingface |
InternetAGIEval | Multiple Choice Answer Matching |
Human | 8.06 | Github Repo |
EgoSchem | Multiple Choice | Synthetic/Human | 5 | Website |
MVBench | Multiple Choice | Synthetic/Human | 4 | Github Repo |
MLVU | Multiple Choice | Synthetic/Human | 2.6 | Github Repo |
VideoMME | Multiple Choice | Experts | 2.7 | Website |
Perception-Test | Multiple Choice | CrowdSource | 11.6 | Github Repo |
VQAScore | Yes/No | AI Expert | 665 | Github Repo |
GenAI-Bench | Human Ratings | Human | 80.0 | Huggingface |
NaturalBench | Yes/No Multiple Choice |
Human | 10.0 | Huggingface |
Benchmark | Domain | Type | Project |
---|---|---|---|
Habitat, Habitat 2.0, Habitat 3.0 | Robotics (Navigation) | Simulator + Dataset | Website |
Gibson | Robotics (Navigation) | Simulator + Dataset | Website, Github Repo |
iGibson1.0, iGibson2.0 | Robotics (Navigation) | Simulator + Dataset | Website, Document |
Isaac Gym | Robotics (Navigation) | Simulator | Website, Github Repo |
Isaac Lab | Robotics (Navigation) | Simulator | Website, Github Repo |
VIMA-Bench | Robotics (Manipulation) | Simulator | Website, Github Repo |
VLMbench | Robotics (Manipulation) | Simulator | Github Repo |
CALVIN | Robotics (Manipulation) | Simulator | Website, Github Repo |
GemBench | Robotics (Manipulation) | Simulator | Website, Github Repo |
WebArena | Web Agent | Simulator | Website, Github Repo |
UniSim | Robotics (Manipulation) | Generative Model, World Model | Website |
GAIA-1 | Robotics (Automonous Driving) | Generative Model, World Model | Website |
LWM | Embodied AI | Generative Model, World Model | Website, Github Repo |
Genesis | Embodied AI | Generative Model, World Model | Github Repo |
- Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI, 2024, [paper].
- ScreenAI: A Vision-Language Model for UI and Infographics Understanding, 2024, [paper]
- ChartLlama: A Multimodal LLM for Chart Understanding and Generation, 2023, [paper].
- SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement, 2024, [paper].
- Training a Vision Language Model as Smartphone Assistant, 2024, [paper]
- Screenagent: A vision language model-driven computer control agent, 2024, [paper]
- Embodied vision-language programmer from environmental feedback, 2024, [paper]
- Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI, 2024, [paper].
- AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation, 2024, [paper], [website].
- SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities, 2024, [paper], [website].
- Vision-language model-driven scene understanding and robotic object manipulation, 2024, [paper].
- Guiding Long-Horizon Task and Motion Planning with Vision Language Models, 2024, [paper], [website].
- AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers, 2023, [paper], [website].
- VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model, 2024, [paper].
- Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems?, 2023, [paper], [website].
- DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models, 2024, [paper], [website].
- Motiongpt: Human motion as a foreign language, 2023, [paper], [code].
- Learning Reward for Robot Skills Using Large Language Models via Self-Alignment, 2024, [paper].
- Language to Rewards for Robotic Skill Synthesis, 2023, [paper], [website].
- Eureka: Human-Level Reward Design via Coding Large Language Models, 2023, [paper], [website].
- Integrated Task and Motion Planning, 2020, [paper].
- Jailbreaking LLM-Controlled Robots, 2024, [paper], [website].
- Robots Enact Malignant Stereotypes, 2022, [paper], [website].
- LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions, 2024, [paper].
- Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics, 2024, [paper], [website].
- VIMA: General Robot Manipulation with Multimodal Prompts, 2022, [paper], [website].
- Instruct2act: Mapping multi-modality instructions to robotic actions with large language model, 2023, [paper].
- Creative Robot Tool Use with Large Language Models, 2023, [paper], [website].
- Robovqa: Multimodal long-horizon reasoning for robotics, 2024, [paper].
- RT-1: Robotics Transformer for Real-World Control at Scale, 2022, [paper], [website].
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, 2023, [paper], [website].
- Open X-Embodiment: Robotic Learning Datasets and RT-X Models, 2023, [paper], [website].
- ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings, 2022, [paper].
- LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation, 2024, [paper].
- LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action, 2022, [paper], [website].
- NaVILA: Legged Robot Vision-Language-Action Model for Navigation, 2022, [paper], [website].
- VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation, 2024, [paper]
- Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning, 2023, [paper], [website].
- MUTEX: Learning Unified Policies from Multimodal Task Specifications, 2023, [paper], [website].
- LaMI: Large Language Models for Multi-Modal Human-Robot Interaction, 2024, [paper], [website].
- VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models, 2024, [paper].
- DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models, 2024, [paper], [website].
- GPT-Driver: Learning to Drive with GPT, 2023, [paper].
- LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving, 2023, [paper], [website].
- Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving, 2023, [paper].
- Referring Multi-Object Tracking, 2023, [paper], [code].
- VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision, 2023, [paper], [code].
- MotionLM: Multi-Agent Motion Forecasting as Language Modeling, 2023, [paper].
- DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models, 2023, [paper], [website].
- VLP: Vision Language Planning for Autonomous Driving, 2024, [paper].
- DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model, 2023, [paper].
- DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis, 2024, [paper], [code].
- LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration -- A Robot Sous-Chef Application, 2024, [paper].
- Pretrained Language Models as Visual Planners for Human Assistance, 2023, [paper].
- Promoting AI Equity in Science: Generalized Domain Prompt Learning for Accessible VLM Research, 2024, [paper].
- Image and Data Mining in Reticular Chemistry Using GPT-4V, 2023, [paper].
- A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis, 2023, [paper].
- CogAgent: A Visual Language Model for GUI Agents, 2023, [paper], [code].
- WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models, 2024, [paper], [code].
- ShowUI: One Vision-Language-Action Model for GUI Visual Agent, 2024, [paper], [code].
- ScreenAgent: A Vision Language Model-driven Computer Control Agent, 2024, [paper], [code].
- X-World: Accessibility, Vision, and Autonomy Meet, 2021, [paper].
- Context-Aware Image Descriptions for Web Accessibility, 2024, [paper].
- Improving VR Accessibility Through Automatic 360 Scene Description Using Multimodal Large Language Models, 2024, [paper].
- VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge, 2024, [paper], [code].
- Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology, 2024, [paper].
- M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization, 2023, [paper].
- MedCLIP: Contrastive Learning from Unpaired Medical Images and Text, 2022, [paper], [code].
- Med-Flamingo: a Multimodal Medical Few-shot Learner, 2023, [paper], [code].
- Analyzing K-12 AI education: A large language model study of classroom instruction on learning theories, pedagogy, tools, and AI literacy, 2024, [paper].
- Students Rather Than Experts: A New AI For Education Pipeline To Model More Human-Like And Personalised Early Adolescences, 2024, [paper].
- Harnessing Large Vision and Language Models in Agriculture: A Review, 2024, [paper].
- A Vision-Language Model for Predicting Potential Distribution Land of Soybean Double Cropping, 2024, [paper].
- Vision Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models, 2024, [paper], [code].
- DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images, 2024, [paper].
- MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models, 2024, [paper], [code].
- Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps, 2024, [paper], [code].
- He is very intelligent, she is very beautiful? On Mitigating Social Biases in Language Modelling and Generation, 2021, [paper].
- UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling, 2024, [paper].
- Object Hallucination in Image Captioning, 2018, [paper].
- Evaluating Object Hallucination in Large Vision-Language Models, 2023, [paper], [code].
- Detecting and Preventing Hallucinations in Large Vision Language Models, 2023, [paper].
- HallE-Control: Controlling Object Hallucination in Large Multimodal Models, 2023, [paper], [code].
- Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs, 2024, [paper], [code].
- BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models, 2024, [paper], [website].
- HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models, 2023, [paper], [code].
- AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models, 2024, [paper], [website].
- Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning, 2023, [paper], [code].
- Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models, 2024, [paper], [code].
- AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation, 2023, [paper], [code].
- JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models, 2024, [paper], [website].
- Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments, 2023, [paper].
- SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models, 2024, [paper].
- JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks, 2024, [paper].
- SHIELD: An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models, 2024, [paper], [code].
- Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models, 2024, [paper].
- Jailbreaking Attack against Multimodal Large Language Model, 2024, [paper].
- Hallucination of Multimodal Large Language Models: A Survey, 2024, [paper].
- Bias and Fairness in Large Language Models: A Survey, 2023, [paper].
- Fairness and Bias in Multimodal AI: A Survey, 2024, [paper].
- Multi-Modal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision–Language Models, 2023, [paper].
- FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks, 2024, [paper].
- FairCLIP: Harnessing Fairness in Vision-Language Learning, 2024, [paper].
- FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models, 2024, [paper].
- Benchmarking Vision Language Models for Cultural Understanding, 2024, [paper].
- Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding, 2024, [paper].
- Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement, 2024, [paper].
- Assessing and Learning Alignment of Unimodal Vision and Language Models, 2024, [paper], [website].
- Extending Multi-modal Contrastive Representations, 2023, [paper], [code].
- OneLLM: One Framework to Align All Modalities with Language, 2023, [paper], [code].
- VILA: On Pre-training for Visual Language Models, 2023, [paper].
- SimVLM: Simple Visual Language Model Pretraining with Weak Supervision, 2021, [paper].
- LoRA: Low-Rank Adaptation of Large Language Models, 2021, [paper], [code].
- QLoRA: Efficient Finetuning of Quantized LLMs, 2023, [paper].
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, 2022, [paper], [code].
- RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback, 2023, [paper].
- SLIP: Self-supervision meets Language-Image Pre-training, 2021, [paper], [code].
- Synthetic Vision: Training Vision-Language Models to Understand Physics, 2024, [paper].
- Synth2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings, 2024, [paper].
- KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data, 2024, [paper].
- Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation, 2024, [paper].
@misc{li2025benchmarkevaluationsapplicationschallenges,
title={Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey},
author={Zongxia Li and Xiyang Wu and Hongyang Du and Huy Nghiem and Guangyao Shi},
year={2025},
eprint={2501.02189},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.02189},
}