Skip to content

A most Frontend Collection and survey of vision-language model papers, and models GitHub repository

Notifications You must be signed in to change notification settings

zli12321/VLM-surveys

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 

Repository files navigation

Benchmark and Evaluations, Applications, and Challenges of Large Vision Language Models

A most Frontend Collection and survey of vision-language model papers, and models GitHub repository

Below we compile awesome papers and model and github repositories that

  • State-of-the-Art VLMs Collection of VLMs from 2022-2024.
  • Evaluate VLM benchmarks and corresponding link to the works
  • Applications applications of VLMs in embodied AI, robotics, etc.
  • Contribute surveys, perspectives, and datasets on the above topics.

Welcome to contribute and discuss!


🤩 Papers marked with a ⭐️ are contributed by the maintainers of this repository. If you find them useful, we would greatly appreciate it if you could give the repository a star or cite our paper.


Table of Contents


1. 📚 SoTA VLMs

Model Year Architecture Training Data Parameters Vision Encoder/Tokenizer Pretrained Backbone Model
VisualBERT 2019 Encoder-only COCO 110M Faster R-CNN Pretrained from scratch
CLIP 2021 Encoder-decoder 400M image-text pairs 63M-355M ViT/ResNet Pretrained from scratch
BLIP 2022 Encoder-decoder COCO, Visual Genome 223M-400M ViT-B/L/g Pretrained from scratch
Flamingo 2022 Decoder-only M3W, ALIGN 80B Custom Chinchilla
BLIP-2 2023 Encoder-decoder COCO, Visual Genome 7B-13B ViT-g Open Pretrained Transformer (OPT)
GPT-4V 2023 Decoder-only Undisclosed Undisclosed Undisclosed Undisclosed
Gemini 2023 Decoder-only Undisclosed Undisclosed Undisclosed Undisclosed
LLaVA-1.5 2023 Decoder-only COCO 13B CLIP ViT-L/14 Vicuna
PaLM-E 2023 Decoder-only All robots, WebLI 562B ViT PaLM
CogVLM 2023 Encoder-decoder LAION-2B ,COYO-700M 18B CLIP ViT-L/14 Vicuna
InstructBLIP 2023 Encoder-decoder CoCo, VQAv2 13B ViT Flan-T5, Vicuna
InternVL 2023 Encoder-decoder LAION-en, LAION- multi 7B/20B Eva CLIP ViT-g QLLaMA
Claude 3 2024 Decoder-only Undisclosed Undisclosed Undisclosed Undisclosed
Emu3 2024 Decoder-only Aquila 7B MoVQGAN LLaMA-2
NVLM 2024 Encoder-decoder LAION-115M 8B-24B Custom ViT Qwen-2-Instruct
Qwen2-VL 2024 Decoder-only Undisclosed 7B-14B EVA-CLIP ViT-L Qwen-2
Pixtral 2024 Decoder-only Undisclosed 12B CLIP ViT-L/14 Mistral Large 2
LLaMA 3.2-vision 2024 Decoder-only Undisclosed 11B-90B CLIP LLaMA-3.1
Baichuan Ocean Mini 2024 Decoder-only Image/Video/Audio/Text 7B CLIP ViT-L/14 Baichuan
TransFusion 2024 Encoder-decoder Undisclosed 7B VAE Encoder Pretrained from scratch on transformer architecture
DeepSeek-VL2 2024 Decoder-only WiT, WikiHow 4.5B x 74 SigLIP/SAMB DeepSeekMoE

2. 🗂️ Benchmarks and Evaluation

2.1. Datasets and Evaluation for VLM

Benchmark Dataset Metric Type Source Size (K) Project
MMTBench Multiple Choice AI Experts 30.1 Github Repo
MM-Vet LLM Eval Human 0.2 Github Repo
MM-En/CN Multiple Choice Human 3.2 Github Repo
GQA Answer Matching Seed with Synthetic 22,000 Website
VCR Multiple Choice MTurks 290 Website
VQAv2 Yes/No
Answer Matching
MTurks 1,100 Github Repo
MMMU Answer Matching; Multiple Choice College Students 11.5 Website
SEEDBench Multiple Choice Synthetic 19 Github Repo
RealWorld QA Multiple Choice Human 0.765 Huggingface
MMMU-Pro Multiple Choice Human 3.64 Website
DPG-Bench Semantic Alignment Synthetic 1.06 Website
MSCOCO-30K BLEU, Rouge, Similarity MTurks 30 Website
TextVQA Answer Matching CrowdSource 45 Github Repo
DocVQA Answer Matching CrowdSource 50 Website
CMMLU Multiple Choice College Students 11.5 Github Repo
C-Eval Multiple Choice Human 13.9 Website
TextVQA Answer Matching Expert Human 28.6 Github Repo
MathVista Answer Matching
Multiple Choice
Human 6.15 Website
MathVision Answer Matching
Multiple Choice
College Students 3.04 Website
OCRBench Answer Matching (ANLS) Human 1 Github Repo
MME Yes/No Human 2.8 Github Repo
InfographicVQA Answer Matching CrowdSource 30 Website
AI2D Answer Matching CrowdSource 1 Website
ChartQA Answer Matching CrowdSource/Synthetic 32.7 Github Repo
GenEval CLIPScore
GenEval
MTurks 1.2 Github Repo
T2I-CompBench Multiple Metrics Synthetic 6 Website
HallusionBench Yes/No Human 1.13 Github Repo
POPE Yes/No Human 9 Github Repo
MMLU Multiple Choice Human 15.9 Github Repo
MMStar Multiple Choice Human 1.5 Website
M3GIA Multiple Choice Human 1.8 Huggingface
InternetAGIEval Multiple Choice
Answer Matching
Human 8.06 Github Repo
EgoSchem Multiple Choice Synthetic/Human 5 Website
MVBench Multiple Choice Synthetic/Human 4 Github Repo
MLVU Multiple Choice Synthetic/Human 2.6 Github Repo
VideoMME Multiple Choice Experts 2.7 Website
Perception-Test Multiple Choice CrowdSource 11.6 Github Repo
VQAScore Yes/No AI Expert 665 Github Repo
GenAI-Bench Human Ratings Human 80.0 Huggingface
NaturalBench Yes/No
Multiple Choice
Human 10.0 Huggingface

2.2. Benchmark Datasets, Simulators and Generative Models for Embodied VLM

Benchmark Domain Type Project
Habitat, Habitat 2.0, Habitat 3.0 Robotics (Navigation) Simulator + Dataset Website
Gibson Robotics (Navigation) Simulator + Dataset Website, Github Repo
iGibson1.0, iGibson2.0 Robotics (Navigation) Simulator + Dataset Website, Document
Isaac Gym Robotics (Navigation) Simulator Website, Github Repo
Isaac Lab Robotics (Navigation) Simulator Website, Github Repo
VIMA-Bench Robotics (Manipulation) Simulator Website, Github Repo
VLMbench Robotics (Manipulation) Simulator Github Repo
CALVIN Robotics (Manipulation) Simulator Website, Github Repo
GemBench Robotics (Manipulation) Simulator Website, Github Repo
WebArena Web Agent Simulator Website, Github Repo
UniSim Robotics (Manipulation) Generative Model, World Model Website
GAIA-1 Robotics (Automonous Driving) Generative Model, World Model Website
LWM Embodied AI Generative Model, World Model Website, Github Repo
Genesis Embodied AI Generative Model, World Model Github Repo

3. ⚒️ Applications

3.1. Embodied VLM agents

  • Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI, 2024, [paper].
  • ScreenAI: A Vision-Language Model for UI and Infographics Understanding, 2024, [paper]
  • ChartLlama: A Multimodal LLM for Chart Understanding and Generation, 2023, [paper].
  • SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement, 2024, [paper].
  • Training a Vision Language Model as Smartphone Assistant, 2024, [paper]
  • Screenagent: A vision language model-driven computer control agent, 2024, [paper]
  • Embodied vision-language programmer from environmental feedback, 2024, [paper]

3.2. Generative Visual Media Applications

3.3. Robotics and Embodied AI

  • Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI, 2024, [paper].
  • AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation, 2024, [paper], [website].
  • SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities, 2024, [paper], [website].
  • Vision-language model-driven scene understanding and robotic object manipulation, 2024, [paper].
  • Guiding Long-Horizon Task and Motion Planning with Vision Language Models, 2024, [paper], [website].
  • AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers, 2023, [paper], [website].
  • VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model, 2024, [paper].
  • Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems?, 2023, [paper], [website].
  • DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models, 2024, [paper], [website].
  • Motiongpt: Human motion as a foreign language, 2023, [paper], [code].
  • Learning Reward for Robot Skills Using Large Language Models via Self-Alignment, 2024, [paper].
  • Language to Rewards for Robotic Skill Synthesis, 2023, [paper], [website].
  • Eureka: Human-Level Reward Design via Coding Large Language Models, 2023, [paper], [website].
  • Integrated Task and Motion Planning, 2020, [paper].
  • Jailbreaking LLM-Controlled Robots, 2024, [paper], [website].
  • Robots Enact Malignant Stereotypes, 2022, [paper], [website].
  • LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions, 2024, [paper].
  • Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics, 2024, [paper], [website].

3.3.1. Manipulation

  • VIMA: General Robot Manipulation with Multimodal Prompts, 2022, [paper], [website].
  • Instruct2act: Mapping multi-modality instructions to robotic actions with large language model, 2023, [paper].
  • Creative Robot Tool Use with Large Language Models, 2023, [paper], [website].
  • Robovqa: Multimodal long-horizon reasoning for robotics, 2024, [paper].
  • RT-1: Robotics Transformer for Real-World Control at Scale, 2022, [paper], [website].
  • RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, 2023, [paper], [website].
  • Open X-Embodiment: Robotic Learning Datasets and RT-X Models, 2023, [paper], [website].

3.3.2. Navigation

  • ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings, 2022, [paper].
  • LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation, 2024, [paper].
  • LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action, 2022, [paper], [website].
  • NaVILA: Legged Robot Vision-Language-Action Model for Navigation, 2022, [paper], [website].
  • VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation, 2024, [paper]
  • Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning, 2023, [paper], [website].

3.3.3. Human-robot Interaction

  • MUTEX: Learning Unified Policies from Multimodal Task Specifications, 2023, [paper], [website].
  • LaMI: Large Language Models for Multi-Modal Human-Robot Interaction, 2024, [paper], [website].
  • VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models, 2024, [paper].

3.3.4. Autonomous Driving

  • DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models, 2024, [paper], [website].
  • GPT-Driver: Learning to Drive with GPT, 2023, [paper].
  • LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving, 2023, [paper], [website].
  • Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving, 2023, [paper].
  • Referring Multi-Object Tracking, 2023, [paper], [code].
  • VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision, 2023, [paper], [code].
  • MotionLM: Multi-Agent Motion Forecasting as Language Modeling, 2023, [paper].
  • DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models, 2023, [paper], [website].
  • VLP: Vision Language Planning for Autonomous Driving, 2024, [paper].
  • DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model, 2023, [paper].

3.4. Human-Centered AI

  • DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis, 2024, [paper], [code].
  • LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration -- A Robot Sous-Chef Application, 2024, [paper].
  • Pretrained Language Models as Visual Planners for Human Assistance, 2023, [paper].
  • Promoting AI Equity in Science: Generalized Domain Prompt Learning for Accessible VLM Research, 2024, [paper].
  • Image and Data Mining in Reticular Chemistry Using GPT-4V, 2023, [paper].

3.4.1. Web Agent

  • A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis, 2023, [paper].
  • CogAgent: A Visual Language Model for GUI Agents, 2023, [paper], [code].
  • WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models, 2024, [paper], [code].
  • ShowUI: One Vision-Language-Action Model for GUI Visual Agent, 2024, [paper], [code].
  • ScreenAgent: A Vision Language Model-driven Computer Control Agent, 2024, [paper], [code].

3.4.2. Accessibility

  • X-World: Accessibility, Vision, and Autonomy Meet, 2021, [paper].
  • Context-Aware Image Descriptions for Web Accessibility, 2024, [paper].
  • Improving VR Accessibility Through Automatic 360 Scene Description Using Multimodal Large Language Models, 2024, [paper].

3.4.3. Healthcare

  • VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge, 2024, [paper], [code].
  • Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology, 2024, [paper].
  • M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization, 2023, [paper].
  • MedCLIP: Contrastive Learning from Unpaired Medical Images and Text, 2022, [paper], [code].
  • Med-Flamingo: a Multimodal Medical Few-shot Learner, 2023, [paper], [code].

3.4.4. Social Goodness

  • Analyzing K-12 AI education: A large language model study of classroom instruction on learning theories, pedagogy, tools, and AI literacy, 2024, [paper].
  • Students Rather Than Experts: A New AI For Education Pipeline To Model More Human-Like And Personalised Early Adolescences, 2024, [paper].
  • Harnessing Large Vision and Language Models in Agriculture: A Review, 2024, [paper].
  • A Vision-Language Model for Predicting Potential Distribution Land of Soybean Double Cropping, 2024, [paper].
  • Vision Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models, 2024, [paper], [code].
  • DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images, 2024, [paper].
  • MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models, 2024, [paper], [code].
  • Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps, 2024, [paper], [code].
  • He is very intelligent, she is very beautiful? On Mitigating Social Biases in Language Modelling and Generation, 2021, [paper].
  • UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling, 2024, [paper].

4. Challenges

4.1 Hallucination

  • Object Hallucination in Image Captioning, 2018, [paper].
  • Evaluating Object Hallucination in Large Vision-Language Models, 2023, [paper], [code].
  • Detecting and Preventing Hallucinations in Large Vision Language Models, 2023, [paper].
  • HallE-Control: Controlling Object Hallucination in Large Multimodal Models, 2023, [paper], [code].
  • Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs, 2024, [paper], [code].
  • BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models, 2024, [paper], [website].
  • HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models, 2023, [paper], [code].
  • AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models, 2024, [paper], [website].
  • Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning, 2023, [paper], [code].
  • Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models, 2024, [paper], [code].
  • AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation, 2023, [paper], [code].

4.2 Safety

  • JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models, 2024, [paper], [website].
  • Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments, 2023, [paper].
  • SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models, 2024, [paper].
  • JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks, 2024, [paper].
  • SHIELD: An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models, 2024, [paper], [code].
  • Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models, 2024, [paper].
  • Jailbreaking Attack against Multimodal Large Language Model, 2024, [paper].

4.3 Fairness

  • Hallucination of Multimodal Large Language Models: A Survey, 2024, [paper].
  • Bias and Fairness in Large Language Models: A Survey, 2023, [paper].
  • Fairness and Bias in Multimodal AI: A Survey, 2024, [paper].
  • Multi-Modal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision–Language Models, 2023, [paper].
  • FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks, 2024, [paper].
  • FairCLIP: Harnessing Fairness in Vision-Language Learning, 2024, [paper].
  • FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models, 2024, [paper].
  • Benchmarking Vision Language Models for Cultural Understanding, 2024, [paper].

4.4 Multi-modality Alignment

  • Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding, 2024, [paper].
  • Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement, 2024, [paper].
  • Assessing and Learning Alignment of Unimodal Vision and Language Models, 2024, [paper], [website].
  • Extending Multi-modal Contrastive Representations, 2023, [paper], [code].
  • OneLLM: One Framework to Align All Modalities with Language, 2023, [paper], [code].

4.5 Efficient Training and Fine-Tuning

  • VILA: On Pre-training for Visual Language Models, 2023, [paper].
  • SimVLM: Simple Visual Language Model Pretraining with Weak Supervision, 2021, [paper].
  • LoRA: Low-Rank Adaptation of Large Language Models, 2021, [paper], [code].
  • QLoRA: Efficient Finetuning of Quantized LLMs, 2023, [paper].
  • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, 2022, [paper], [code].
  • RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback, 2023, [paper].

4.6 Scarce of High-quality Dataset

  • SLIP: Self-supervision meets Language-Image Pre-training, 2021, [paper], [code].
  • Synthetic Vision: Training Vision-Language Models to Understand Physics, 2024, [paper].
  • Synth2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings, 2024, [paper].
  • KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data, 2024, [paper].
  • Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation, 2024, [paper].

5. Citation

@misc{li2025benchmarkevaluationsapplicationschallenges,
      title={Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey}, 
      author={Zongxia Li and Xiyang Wu and Hongyang Du and Huy Nghiem and Guangyao Shi},
      year={2025},
      eprint={2501.02189},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.02189}, 
}

About

A most Frontend Collection and survey of vision-language model papers, and models GitHub repository

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •