This repository contains the Pytorch code and model weight of INF-LLaVA, a novel MLLM designed for high-resolution image perception and reasoning.
INF-LLaVA has the following features to process high-resolution images:
- Dual-perspective Cropping Module(DCM) : Integrate both global and local perspectives when cropping high-resolution images into subimages. This enhances the model’s ability to capture detailed and contextual information.
- Dual-perspective Enhancement Module(DEM) : An effective and efficient module for fusing dual-perspective features, resulting in dual-enhanced features that significantly improve performance.
- Strong Performance : INF-LLaVA outperforms existing models on multiple benchmarks, demonstrating the effectiveness of our approach. Check out our model zoo.
- 🔥[2024-7-19] Release the ckpt model of INF-LLaVA on Hugging Face.
- 🔥[2024-7-16] Release the code of INF-LLaVA.
- Release INF-LLaVA model based on Llama 3.1
- Release INF-LLaVA Strong Models.
- Release INF-LLaVA training code.
- Clone this repository and navigate to INF-LLaVA folder
git clone https://github.com/WeihuangLin/INF-LLaVA.git
cd INF-LLaVA
- Install Package
conda create -n inf-llava python=3.10 -y
conda activate inf-llava
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir
- Pre-train
cd INF-LLaVA
bash INF-LLava_pretrain.sh
Note: You should replace the data_path and image_folder in the INF-LLava_pretrain.sh
- Finetune
cd INF-LLaVA
bash INF-LLava_finetune.sh
Note: You should replace the data_path and image_folder in the INF-LLava_finetune.sh
You can download our pretrained weights in Model Zoo
We follow lmm-eval to conduct evaluations. Please refer to lmm-eval for help. We provide the same script to complete the testing.
Version | Checkpoint |
---|---|
🤗WeihuangLin/INF-LLaVA-sft | |
🤗WeihuangLin/INF_star-LLaVA-sft |
This project is released under the Apache 2.0 license.
If you find this project useful in your research, please consider cite:
@misc{ma2024infllava,
title={INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model},
author={Yiwei Ma and Zhibin Wang and Xiaoshuai Sun and Weihuang Lin and Qiang Zhou and Jiayi Ji and Rongrong Ji},
journal={arXiv preprint arXiv:2407.16198},
year={2024}
}
We are thankful to LLaVA, lmms-eval and LLama3 for releasing their models and code as open-source contributions.
In case if you face any issues or have any questions, please feel free to create an issue.