Benchmark and Evaluations, Applications, and Challenges of Large Vision Language Models

A most Frontend Collection and survey of vision-language model papers, and models GitHub repository

Below we compile awesome papers and model and github repositories that

State-of-the-Art VLMs Collection of VLMs from 2022-2024.
Evaluate VLM benchmarks and corresponding link to the works
Applications applications of VLMs in embodied AI, robotics, etc.
Contribute surveys, perspectives, and datasets on the above topics.

Welcome to contribute and discuss!

🤩 Papers marked with a ⭐️ are contributed by the maintainers of this repository. If you find them useful, we would greatly appreciate it if you could give the repository a star or cite our paper.

1. 📚 SoTA VLMs
1. 🗂️ Dataset and Evaluation
- 2.1. Datasets and Evaluation for VLM
- 2.2. Benchmark Datasets, Simulators and Generative Models for Embodied VLM
1. ⚒️ Applications
- 3.1. Embodied VLM agents
- 3.2. Generative Visual Media Applications
- 3.3. Robotics and Embodied AI
  - 3.3.1. Manipulation
  - 3.3.2. Navigation
  - 3.3.3. Human-robot Interaction
  - 3.3.4. Autonomous Driving
- 3.4. Human-Centered AI
  - 3.4.1. Web Agent
  - 3.4.2. Accessibility
  - 3.4.3. Healthcare
  - 3.4.4. Social Goodness
1. ⛑️ Challenges
- 5.1. Hallucination
- 5.2. Safety
- 5.3. Fairness
- 5.4. Multi-modality Alignment
- 5.5. Efficient Training and Fine-Tuning
- 5.6. Scarce of High-quality Dataset
1. ⛑️ Citation

1. 📚 SoTA VLMs

Model	Year	Architecture	Training Data	Parameters	Vision Encoder/Tokenizer	Pretrained Backbone Model
VisualBERT	2019	Encoder-only	COCO	110M	Faster R-CNN	Pretrained from scratch
CLIP	2021	Encoder-decoder	400M image-text pairs	63M-355M	ViT/ResNet	Pretrained from scratch
BLIP	2022	Encoder-decoder	COCO, Visual Genome	223M-400M	ViT-B/L/g	Pretrained from scratch
Flamingo	2022	Decoder-only	M3W, ALIGN	80B	Custom	Chinchilla
BLIP-2	2023	Encoder-decoder	COCO, Visual Genome	7B-13B	ViT-g	Open Pretrained Transformer (OPT)
GPT-4V	2023	Decoder-only	Undisclosed	Undisclosed	Undisclosed	Undisclosed
Gemini	2023	Decoder-only	Undisclosed	Undisclosed	Undisclosed	Undisclosed
LLaVA-1.5	2023	Decoder-only	COCO	13B	CLIP ViT-L/14	Vicuna
PaLM-E	2023	Decoder-only	All robots, WebLI	562B	ViT	PaLM
CogVLM	2023	Encoder-decoder	LAION-2B ,COYO-700M	18B	CLIP ViT-L/14	Vicuna
InstructBLIP	2023	Encoder-decoder	CoCo, VQAv2	13B	ViT	Flan-T5, Vicuna
InternVL	2023	Encoder-decoder	LAION-en, LAION- multi	7B/20B	Eva CLIP ViT-g	QLLaMA
Claude 3	2024	Decoder-only	Undisclosed	Undisclosed	Undisclosed	Undisclosed
Emu3	2024	Decoder-only	Aquila	7B	MoVQGAN	LLaMA-2
NVLM	2024	Encoder-decoder	LAION-115M	8B-24B	Custom ViT	Qwen-2-Instruct
Qwen2-VL	2024	Decoder-only	Undisclosed	7B-14B	EVA-CLIP ViT-L	Qwen-2
Pixtral	2024	Decoder-only	Undisclosed	12B	CLIP ViT-L/14	Mistral Large 2
LLaMA 3.2-vision	2024	Decoder-only	Undisclosed	11B-90B	CLIP	LLaMA-3.1
Baichuan Ocean Mini	2024	Decoder-only	Image/Video/Audio/Text	7B	CLIP ViT-L/14	Baichuan
TransFusion	2024	Encoder-decoder	Undisclosed	7B	VAE Encoder	Pretrained from scratch on transformer architecture
DeepSeek-VL2	2024	Decoder-only	WiT, WikiHow	4.5B x 74	SigLIP/SAMB	DeepSeekMoE

2. 🗂️ Benchmarks and Evaluation

2.1. Datasets and Evaluation for VLM

Benchmark Dataset	Metric Type	Source	Size (K)	Project
MMTBench	Multiple Choice	AI Experts	30.1	Github Repo
MM-Vet	LLM Eval	Human	0.2	Github Repo
MM-En/CN	Multiple Choice	Human	3.2	Github Repo
GQA	Answer Matching	Seed with Synthetic	22,000	Website
VCR	Multiple Choice	MTurks	290	Website
VQAv2	Yes/No Answer Matching	MTurks	1,100	Github Repo
MMMU	Answer Matching; Multiple Choice	College Students	11.5	Website
SEEDBench	Multiple Choice	Synthetic	19	Github Repo
RealWorld QA	Multiple Choice	Human	0.765	Huggingface
MMMU-Pro	Multiple Choice	Human	3.64	Website
DPG-Bench	Semantic Alignment	Synthetic	1.06	Website
MSCOCO-30K	BLEU, Rouge, Similarity	MTurks	30	Website
TextVQA	Answer Matching	CrowdSource	45	Github Repo
DocVQA	Answer Matching	CrowdSource	50	Website
CMMLU	Multiple Choice	College Students	11.5	Github Repo
C-Eval	Multiple Choice	Human	13.9	Website
TextVQA	Answer Matching	Expert Human	28.6	Github Repo
MathVista	Answer Matching Multiple Choice	Human	6.15	Website
MathVision	Answer Matching Multiple Choice	College Students	3.04	Website
OCRBench	Answer Matching (ANLS)	Human	1	Github Repo
MME	Yes/No	Human	2.8	Github Repo
InfographicVQA	Answer Matching	CrowdSource	30	Website
AI2D	Answer Matching	CrowdSource	1	Website
ChartQA	Answer Matching	CrowdSource/Synthetic	32.7	Github Repo
GenEval	CLIPScore GenEval	MTurks	1.2	Github Repo
T2I-CompBench	Multiple Metrics	Synthetic	6	Website
HallusionBench	Yes/No	Human	1.13	Github Repo
POPE	Yes/No	Human	9	Github Repo
MMLU	Multiple Choice	Human	15.9	Github Repo
MMStar	Multiple Choice	Human	1.5	Website
M3GIA	Multiple Choice	Human	1.8	Huggingface
InternetAGIEval	Multiple Choice Answer Matching	Human	8.06	Github Repo
EgoSchem	Multiple Choice	Synthetic/Human	5	Website
MVBench	Multiple Choice	Synthetic/Human	4	Github Repo
MLVU	Multiple Choice	Synthetic/Human	2.6	Github Repo
VideoMME	Multiple Choice	Experts	2.7	Website
Perception-Test	Multiple Choice	CrowdSource	11.6	Github Repo
VQAScore	Yes/No	AI Expert	665	Github Repo
GenAI-Bench	Human Ratings	Human	80.0	Huggingface
NaturalBench	Yes/No Multiple Choice	Human	10.0	Huggingface

2.2. Benchmark Datasets, Simulators and Generative Models for Embodied VLM

Benchmark	Domain	Type	Project
Habitat, Habitat 2.0, Habitat 3.0	Robotics (Navigation)	Simulator + Dataset	Website
Gibson	Robotics (Navigation)	Simulator + Dataset	Website, Github Repo
iGibson1.0, iGibson2.0	Robotics (Navigation)	Simulator + Dataset	Website, Document
Isaac Gym	Robotics (Navigation)	Simulator	Website, Github Repo
Isaac Lab	Robotics (Navigation)	Simulator	Website, Github Repo
VIMA-Bench	Robotics (Manipulation)	Simulator	Website, Github Repo
VLMbench	Robotics (Manipulation)	Simulator	Github Repo
CALVIN	Robotics (Manipulation)	Simulator	Website, Github Repo
GemBench	Robotics (Manipulation)	Simulator	Website, Github Repo
WebArena	Web Agent	Simulator	Website, Github Repo
UniSim	Robotics (Manipulation)	Generative Model, World Model	Website
GAIA-1	Robotics (Automonous Driving)	Generative Model, World Model	Website
LWM	Embodied AI	Generative Model, World Model	Website, Github Repo
Genesis	Embodied AI	Generative Model, World Model	Github Repo

3. ⚒️ Applications

3.1. Embodied VLM agents

Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI, 2024, [paper].
ScreenAI: A Vision-Language Model for UI and Infographics Understanding, 2024, [paper]
ChartLlama: A Multimodal LLM for Chart Understanding and Generation, 2023, [paper].
SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement, 2024, [paper].
Training a Vision Language Model as Smartphone Assistant, 2024, [paper]
Screenagent: A vision language model-driven computer control agent, 2024, [paper]
Embodied vision-language programmer from environmental feedback, 2024, [paper]

3.2. Generative Visual Media Applications

3.3. Robotics and Embodied AI

Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI, 2024, [paper].
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation, 2024, [paper], [website].
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities, 2024, [paper], [website].
Vision-language model-driven scene understanding and robotic object manipulation, 2024, [paper].
Guiding Long-Horizon Task and Motion Planning with Vision Language Models, 2024, [paper], [website].
AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers, 2023, [paper], [website].
VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model, 2024, [paper].
Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems?, 2023, [paper], [website].
DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models, 2024, [paper], [website].
Motiongpt: Human motion as a foreign language, 2023, [paper], [code].
Learning Reward for Robot Skills Using Large Language Models via Self-Alignment, 2024, [paper].
Language to Rewards for Robotic Skill Synthesis, 2023, [paper], [website].
Eureka: Human-Level Reward Design via Coding Large Language Models, 2023, [paper], [website].
Integrated Task and Motion Planning, 2020, [paper].
Jailbreaking LLM-Controlled Robots, 2024, [paper], [website].
Robots Enact Malignant Stereotypes, 2022, [paper], [website].
LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions, 2024, [paper].
Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics, 2024, [paper], [website].

3.3.1. Manipulation

VIMA: General Robot Manipulation with Multimodal Prompts, 2022, [paper], [website].
Instruct2act: Mapping multi-modality instructions to robotic actions with large language model, 2023, [paper].
Creative Robot Tool Use with Large Language Models, 2023, [paper], [website].
Robovqa: Multimodal long-horizon reasoning for robotics, 2024, [paper].
RT-1: Robotics Transformer for Real-World Control at Scale, 2022, [paper], [website].
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, 2023, [paper], [website].
Open X-Embodiment: Robotic Learning Datasets and RT-X Models, 2023, [paper], [website].

3.3.2. Navigation

ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings, 2022, [paper].
LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation, 2024, [paper].
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action, 2022, [paper], [website].
NaVILA: Legged Robot Vision-Language-Action Model for Navigation, 2022, [paper], [website].
VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation, 2024, [paper]
Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning, 2023, [paper], [website].

3.3.3. Human-robot Interaction

MUTEX: Learning Unified Policies from Multimodal Task Specifications, 2023, [paper], [website].
LaMI: Large Language Models for Multi-Modal Human-Robot Interaction, 2024, [paper], [website].
VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models, 2024, [paper].

3.3.4. Autonomous Driving

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models, 2024, [paper], [website].
GPT-Driver: Learning to Drive with GPT, 2023, [paper].
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving, 2023, [paper], [website].
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving, 2023, [paper].
Referring Multi-Object Tracking, 2023, [paper], [code].
VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision, 2023, [paper], [code].
MotionLM: Multi-Agent Motion Forecasting as Language Modeling, 2023, [paper].
DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models, 2023, [paper], [website].
VLP: Vision Language Planning for Autonomous Driving, 2024, [paper].
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model, 2023, [paper].

3.4. Human-Centered AI

DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis, 2024, [paper], [code].
LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration -- A Robot Sous-Chef Application, 2024, [paper].
Pretrained Language Models as Visual Planners for Human Assistance, 2023, [paper].
Promoting AI Equity in Science: Generalized Domain Prompt Learning for Accessible VLM Research, 2024, [paper].
Image and Data Mining in Reticular Chemistry Using GPT-4V, 2023, [paper].

3.4.1. Web Agent

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis, 2023, [paper].
CogAgent: A Visual Language Model for GUI Agents, 2023, [paper], [code].
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models, 2024, [paper], [code].
ShowUI: One Vision-Language-Action Model for GUI Visual Agent, 2024, [paper], [code].
ScreenAgent: A Vision Language Model-driven Computer Control Agent, 2024, [paper], [code].

3.4.2. Accessibility

X-World: Accessibility, Vision, and Autonomy Meet, 2021, [paper].
Context-Aware Image Descriptions for Web Accessibility, 2024, [paper].
Improving VR Accessibility Through Automatic 360 Scene Description Using Multimodal Large Language Models, 2024, [paper].

3.4.3. Healthcare

VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge, 2024, [paper], [code].
Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology, 2024, [paper].
M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization, 2023, [paper].
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text, 2022, [paper], [code].
Med-Flamingo: a Multimodal Medical Few-shot Learner, 2023, [paper], [code].

3.4.4. Social Goodness

Analyzing K-12 AI education: A large language model study of classroom instruction on learning theories, pedagogy, tools, and AI literacy, 2024, [paper].
Students Rather Than Experts: A New AI For Education Pipeline To Model More Human-Like And Personalised Early Adolescences, 2024, [paper].
Harnessing Large Vision and Language Models in Agriculture: A Review, 2024, [paper].
A Vision-Language Model for Predicting Potential Distribution Land of Soybean Double Cropping, 2024, [paper].
Vision Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models, 2024, [paper], [code].
DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images, 2024, [paper].
MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models, 2024, [paper], [code].
Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps, 2024, [paper], [code].
He is very intelligent, she is very beautiful? On Mitigating Social Biases in Language Modelling and Generation, 2021, [paper].
UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling, 2024, [paper].

4. Challenges

4.1 Hallucination

Object Hallucination in Image Captioning, 2018, [paper].
Evaluating Object Hallucination in Large Vision-Language Models, 2023, [paper], [code].
Detecting and Preventing Hallucinations in Large Vision Language Models, 2023, [paper].
HallE-Control: Controlling Object Hallucination in Large Multimodal Models, 2023, [paper], [code].
Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs, 2024, [paper], [code].
BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models, 2024, [paper], [website].
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models, 2023, [paper], [code].
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models, 2024, [paper], [website].
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning, 2023, [paper], [code].
Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models, 2024, [paper], [code].
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation, 2023, [paper], [code].

4.2 Safety

JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models, 2024, [paper], [website].
Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments, 2023, [paper].
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models, 2024, [paper].
JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks, 2024, [paper].
SHIELD: An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models, 2024, [paper], [code].
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models, 2024, [paper].
Jailbreaking Attack against Multimodal Large Language Model, 2024, [paper].

4.3 Fairness

Hallucination of Multimodal Large Language Models: A Survey, 2024, [paper].
Bias and Fairness in Large Language Models: A Survey, 2023, [paper].
Fairness and Bias in Multimodal AI: A Survey, 2024, [paper].
Multi-Modal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision–Language Models, 2023, [paper].
FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks, 2024, [paper].
FairCLIP: Harnessing Fairness in Vision-Language Learning, 2024, [paper].
FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models, 2024, [paper].
Benchmarking Vision Language Models for Cultural Understanding, 2024, [paper].

4.4 Multi-modality Alignment

Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding, 2024, [paper].
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement, 2024, [paper].
Assessing and Learning Alignment of Unimodal Vision and Language Models, 2024, [paper], [website].
Extending Multi-modal Contrastive Representations, 2023, [paper], [code].
OneLLM: One Framework to Align All Modalities with Language, 2023, [paper], [code].

4.5 Efficient Training and Fine-Tuning

VILA: On Pre-training for Visual Language Models, 2023, [paper].
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision, 2021, [paper].
LoRA: Low-Rank Adaptation of Large Language Models, 2021, [paper], [code].
QLoRA: Efficient Finetuning of Quantized LLMs, 2023, [paper].
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, 2022, [paper], [code].
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback, 2023, [paper].

4.6 Scarce of High-quality Dataset

SLIP: Self-supervision meets Language-Image Pre-training, 2021, [paper], [code].
Synthetic Vision: Training Vision-Language Models to Understand Physics, 2024, [paper].
Synth2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings, 2024, [paper].
KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data, 2024, [paper].
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation, 2024, [paper].

5. Citation

@misc{li2025benchmarkevaluationsapplicationschallenges,
      title={Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey}, 
      author={Zongxia Li and Xiyang Wu and Hongyang Du and Huy Nghiem and Guangyao Shi},
      year={2025},
      eprint={2501.02189},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.02189}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmark and Evaluations, Applications, and Challenges of Large Vision Language Models

Table of Contents

1. 📚 SoTA VLMs

2. 🗂️ Benchmarks and Evaluation

2.1. Datasets and Evaluation for VLM

2.2. Benchmark Datasets, Simulators and Generative Models for Embodied VLM

3. ⚒️ Applications

3.1. Embodied VLM agents

3.2. Generative Visual Media Applications

3.3. Robotics and Embodied AI

3.3.1. Manipulation

3.3.2. Navigation

3.3.3. Human-robot Interaction

3.3.4. Autonomous Driving

3.4. Human-Centered AI

3.4.1. Web Agent

3.4.2. Accessibility

3.4.3. Healthcare

3.4.4. Social Goodness

4. Challenges

4.1 Hallucination

4.2 Safety

4.3 Fairness

4.4 Multi-modality Alignment

4.5 Efficient Training and Fine-Tuning

4.6 Scarce of High-quality Dataset

5. Citation

About

Releases

Packages

Contributors 3

zli12321/VLM-surveys

Folders and files

Latest commit

History

Repository files navigation

Benchmark and Evaluations, Applications, and Challenges of Large Vision Language Models

Table of Contents

1. 📚 SoTA VLMs

2. 🗂️ Benchmarks and Evaluation

2.1. Datasets and Evaluation for VLM

2.2. Benchmark Datasets, Simulators and Generative Models for Embodied VLM

3. ⚒️ Applications

3.1. Embodied VLM agents

3.2. Generative Visual Media Applications

3.3. Robotics and Embodied AI

3.3.1. Manipulation

3.3.2. Navigation

3.3.3. Human-robot Interaction

3.3.4. Autonomous Driving

3.4. Human-Centered AI

3.4.1. Web Agent

3.4.2. Accessibility

3.4.3. Healthcare

3.4.4. Social Goodness

4. Challenges

4.1 Hallucination

4.2 Safety

4.3 Fairness

4.4 Multi-modality Alignment

4.5 Efficient Training and Fine-Tuning

4.6 Scarce of High-quality Dataset

5. Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages