Skip to content
/ BenchLMM Public

[ECCV 2024] BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models

License

Notifications You must be signed in to change notification settings

AIFEG/BenchLMM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

    1Nanyang Technological University 2University of Technology Sydney 3Northeastern University 4Mohamed bin Zayed University of Artificial Intelligence 5Zhejiang University
*Equal contribution, †Corresponding Author
If you like our project, please give us a star ⭐ on GitHub for latest update.

Website hf_space arXiv Endpoint Badge

zhihu Wechat Twitter Twitter Twitter

Benchmark Examples

Demo

Note: For a simple presentation, the questions in Domestic Robot and Open Game have been simplified from multiple-choice format. Please see our Benchmark for more examples and detailed questions.

Directory Structure

  • baseline/:

    • Contains LLaVA and InstructBLIP baseline code.
  • evaluate/:

    • All the Python code used for evaluating the model's output. This evaluation is done by using ChatGPT to compare the model output answers with ground truth answers.
  • evaluate_results/:

    • This directory contains the evaluation results of the baseline models.
  • jsonl/:

    • This directory contains all JSONL files with the question, image relative location, and the ground truth answer.

    • Sample JSONL format:

      {
        "question_id": "bottle_test_broken_large_000_001", 
        "image": "bottle_test_broken_large_000.png", 
        "text": "Is there any defect in the object in this image? Answer the question using a single word or phrase.", 
        "answer": "Yes"
      }

      The image is the relative image location of corresponding style image folder, the text is the question, answer is ground truth answer.

  • imgs/:

    • This directory contains the images used on this page. However, they are not our benchmark images.
  • results/:

    • This directory contains the inference results of the baseline models.
  • scripts/:

    • Contains the scripts to run the baseline and evaluate the results.

Evaluate on our Benchmark

git clone [email protected]:AIFEG/BenchLMM.git
cd BenchLMM
mkdir evaluate_results
  • Prepare your model output
    Prepare your results in the following format, Key "prompt" is the input of the model, you better use the Jsonl file to store your results.
{
  "question_id": 110, 
  "prompt": "Is there any defect in the object in this image? Answer the question using a single word or phrase.", 
  "model_output": "Yes",
}
  • Rename your Jsonl file
    Rename your Jsonl file to xxxx_StyleName.jsonl like the following project tree. You must keep the style of the suffix consistent with the example.
.
├── answers_Benchmark_AD.jsonl
├── xxxxxxxx_CT.jsonl
├── xxxxxxxx_MRI.jsonl
├── xxxxxxxx_Med-X-RAY.jsonl
├── xxxxxxxx_RS.jsonl
├── xxxxxxxx_Robots.jsonl
├── xxxxxxxx_defect_detection.jsonl
├── xxxxxxxx_game.jsonl
├── xxxxxxxx_infrard.jsonl
├── xxxxxxxx_style_cartoon.jsonl
├── xxxxxxxx_style_handmake.jsonl
├── xxxxxxxx_style_painting.jsonl
├── xxxxxxxx_style_sketch.jsonl
├── xxxxxxxx_style_tattoo.jsonl
├── xxxxxxxx_xray.jsonl
bash scripts/evaluate.sh

Note: Score will be saved in the file results. Robots and game scores are included in the evaluate_results/Robots.jsonl and evaluate_results/game.jsonl respectively.

Baseline

Model VRAM required
InstructBLIP-7B 30GB
InstructBLIP-13B 65GB
LLava-1.5-7B <24GB
LLava-1.5-13B 30GB

LLaVA

  • Install
  1. Clone this repository and navigate to LLaVA folder
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
  1. Install Package
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
  1. LLaVA Weights
    Please check out our Model Zoo for all public LLaVA checkpoints, and the instructions of how to use the weights.
  • Run and evaluate LLaVA on our Benchmark
  1. Add the file BenchLMM_LLaVA_model_vqa.py to the path LLaVA/llava/eval/

  2. Modify the file path and run the script scripts/LLaVA.sh

bash scripts/LLaVA.sh
  1. Evaluate results
bash scripts/evaluate.sh

Note: Score will be saved in the file results.

InstructBLIP

  • Install
git clone https://github.com/salesforce/LAVIS.git  
cd LAVIS  
pip install -e .  
  • Prepare Vicuna Weights
    InstructBLIP uses frozen Vicuna 7B and 13B models. Please first follow the instructions to prepare Vicuna v1.1 weights.
    Then modify the llm_model in the Model Config to the folder that contains Vicuna weights.

  • Run InstructBLIP on our Benchmark

Modify the file path and run the script BenchLMM/scripts/InstructBLIP.sh

bash BenchLMM/scripts/InstructBLIP.sh
bash BenchLMM/scripts/evaluate.sh

Note: Score will be saved in the file results.


Cite our work

@article{cai2023benchlmm,
  title={BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models},
  author={Cai, Rizhao and Song, Zirui and Guan, Dayan and Chen, Zhenhao and Luo, Xing and Yi, Chenyu and Kot, Alex},
  journal={arXiv preprint arXiv:2312.02896},
  year={2023}
}

Contact

If you have any question or issue with our project, please contact Dayan Guan: [email protected]

Acknowledgement

This research is supported in part by the Rapid-Rich Object Search (ROSE) Lab of Nanyang Technological University and the NTU-PKU Joint Research Institute (a collaboration between NTU and Peking University that is sponsored by a donation from the Ng Teng Fong Charitable Foundation). We are deeply grateful to Yaohang Li from the University of Technology Sydney for his invaluable assistance in conducting the experiments, and to Jingpu Yang, Helin Wang, Zihui Cui, Yushan Jiang, Fengxian Ji, and Yuxiao Hang from NLULab@NEUQ (Northeastern University at Qinhuangdao, China) for their meticulous efforts in annotating the dataset. We also would like to thank Prof. Miao Fang (PI of NLULab@NEUQ) for his supervision and insightful suggestion during discussion on this project.

Related project