- [2025/01/08] We released the full training code.
- [2025/01/02] We discovered that when testing with the AI2D benchmark, we were using AI2D_TEST_NO_MASK, while the VLMEvalKit utilizes AI2D_TEST. We previously overlooked the distinction between the two, and we sincerely apologize for this oversight. We will make the necessary corrections.
- [2024/11/28] We've released the dataset: https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k
- [2024/11/25] We've released the code for dataset generation: dataset_generation/generate.py
- [2024/11/23] We've released the Gradio App: https://huggingface.co/spaces/Xkev/Llama-3.2V-11B-cot
- [2024/11/21] LLaVA-o1 is renamed to LLaVA-CoT https://arxiv.org/abs/2411.10440v2.
- [2024/11/20] We've released the pretrained weights: https://huggingface.co/Xkev/Llama-3.2V-11B-cot
- [2024/11/18] We've released our paper: https://arxiv.org/abs/2411.10440
- [2024/11/18] Welcome to watch 👀 this repository for the latest updates.
LLaVA-CoT is a visual language model capable of spontaneous, systematic reasoning.
Our 11B model outperforms Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on six challenging multimodal benchmarks.
LLaVA-CoT begins by outlining the problem, interprets relevant information from the image, proceeds step-by-step through reasoning, and ultimately reaches a well-supported conclusion.
You can download the pretrained weights from the Huggingface: Xkev/Llama-3.2V-11B-cot.
You can download the dataset from the Huggingface: Xkev/LLaVA-CoT-100k.
You can use the same code as Llama-3.2-11B-Vision-Instruct to load the model and perform inference.
If you want to use perform inference time scaling, you can use code provided in inference_demo/inference_demo.py.
In order to run the demo, you need to replace this file with the original inference code for Llama-3.2-11B-Vision-Instruct in VLMEvalKit.
Additionally, you need to replace the processing_mllama.py
file in the transformers library with the one provided in inference_demo/processing_mllama.py.
You may use any repository that supports Llama-3.2-11B-Vision-Instruct for finetuning.
We recommend using llama-recipes.
To reproduce our results, you can use the following command:
cd train
pip install llama-recipes
torchrun --nnodes 1 --nproc_per_node 8 --master_port 29500 finetuning.py --enable_fsdp --lr 1e-5 --num_epochs 3 --batch_size_training 4 --model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder LLaVA-CoT --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "datasets/cot_dataset.py" --run_validation False --batching_strategy padding
Remember to modify the data_path
and image_base_path
in train/cot_dataset.py
to your own path (the path to the training dataset).
If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:
@misc{xu2024llavacot,
title={LLaVA-CoT: Let Vision Language Models Reason Step-by-Step},
author={Guowei Xu and Peng Jin and Hao Li and Yibing Song and Lichao Sun and Li Yuan},
year={2024},
eprint={2411.10440},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.10440},
}
- The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
- The service is a research preview intended for non-commercial use only, subject to LLAMA 3.2 COMMUNITY LICENSE AGREEMENT, and Terms of Use of the data generated by OpenAI. Please contact us if you find any potential violations.
- The template is modified from Chat-Univi and LLaVA.