GitHub - Fantasyele/LLaVA-KD

Yuxuan Cai^1*, Jiangning Zhang^2,3*, Haoyang He², Xinwei He⁴, Ao Tong¹,

¹Huazhong University of Science and Technology,

²Zhejiang University, ³Youtu Lab, Tencent, ⁴Huazhong Agricultural University

[Paper]

Abstract

The success of Large Language Models (LLM) has led researchers to explore Multimodal Large Language Models (MLLM) for unified visual and linguistic understanding. However, the increasing model size and computational complexity of MLLM limit their use in resource-constrained environments. Small-scale MLLM ($s$-MLLM) aims to retain the capabilities of the large-scale model ($l$-MLLM) while reducing computational demands, but resulting in a significant decline in performance. To address the aforementioned issues, we propose a novel LLaVA-KD framework to transfer knowledge from $l$-MLLM to $s$-MLLM. Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of $l$-MLLM and $s$-MLLM, and Relation Distillation (RDist) to transfer $l$-MLLM’s ability to model correlations between visual features. Additionally, we propose a three-stage training scheme to fully exploit the potential of $s$-MLLM: (1) Distilled Pre-Training to align visual-textual representations, (2) Supervised Fine-Tuning to equip the model with multimodal understanding, and (3) Distilled Fine-Tuning to further transfer $l$-MLLM capabilities. Our approach significantly improves performance without altering the small model's architecture. Extensive experiments and ablation studies validate the effectiveness of each proposed component.

Overview

📜 Main Results on 10 Popular Benchmarks

Benchmarked results with SoTA MLLMs. Compared with counterparts, our \method~achieves highly competitive results than current small-scale MLLM models. AVG: The average of the nine benchmarks for comprehensive comparison except MMMU. $^\dagger$: reproduced results using the official code.

🛠️ Installation

Install torch and torchvision

pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu118

Prepare the environment

python3 -m pip install --no-cache-dir --upgrade -r requirements.txt
python3 -m pip install numpy==1.26.2
python3 -m pip install urllib3==1.26.6
pip install ptflops

Install Flash Attention

git clone https://github.com/Dao-AILab/flash-attention.git
cd ./flash-attention
python3 -m pip install wheel==0.41.3
python3 setup.py install

Install bitsandbytes

git clone https://github.com/bitsandbytes-foundation/bitsandbytes.git
cd ./bitsandbytes
pip install -e .

LLaVA-KD Weights

Model	Vision Encoder	LLM	CKPTs
LLaVA-KD-1B	siglip-so400m-patch14-384	Qwen/Qwen1.5-0.5B	LLaVA-KD-Base-siglip-Qwen1.5-0.5B
LLaVA-KD-2B	siglip-so400m-patch14-384	Qwen/Qwen1.5-1.8B	LLaVA-KD-Base-siglip-Qwen1.5-1.8B

💻 Evaluation

Please evaluate the model according to Evaluation.md.

Quickstart

Download the Pre-trained VisualEnc, LLM, LLaVAKD weights to the ./pretrained_ckpt. And then:

python quick_inference.py --model_path ./pretrained_ckpt/LLaVAKD_Model_Path --image_file ./image_test/img_test_1.jpg  --query "What is that orange thing behind the girl?"

☑️ TODO List

Release the training code

💫 Citation

If you find this code useful, don't forget to star the repo and cite the paper.

@article{cai2024llava,
  title={LLaVA-KD: A Framework of Distilling Multimodal Large Language Models},
  author={Cai, Yuxuan and Zhang, Jiangning and He, Haoyang and He, Xinwei and Tong, Ao and Gan, Zhenye and Wang, Chengjie and Bai, Xiang},
  journal={arXiv preprint arXiv:2410.16236},
  year={2024}
}

💘 Acknowledgements

We thank the great works TinyLLaVA, LLaVA for providing assistance for our research.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
bitsandbytes		bitsandbytes
docs		docs
image_test		image_test
llavakd		llavakd
scripts		scripts
.gitignore		.gitignore
README.md		README.md
eval_dataset		eval_dataset
install_requirements.sh		install_requirements.sh
pretrained_hg		pretrained_hg
quick_inference.py		quick_inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstract

Overview

📜 Main Results on 10 Popular Benchmarks

🛠️ Installation

LLaVA-KD Weights

💻 Evaluation

Quickstart

☑️ TODO List

💫 Citation

💘 Acknowledgements

About

Releases

Packages

Languages

Fantasyele/LLaVA-KD

Folders and files

Latest commit

History

Repository files navigation

Abstract

Overview

📜 Main Results on 10 Popular Benchmarks

🛠️ Installation

LLaVA-KD Weights

💻 Evaluation

Quickstart

☑️ TODO List

💫 Citation

💘 Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages