Skip to content

LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning

License

Notifications You must be signed in to change notification settings

PKU-YuanGroup/LLaVA-CoT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This project was previously named LLaVA-o1. We have renamed it to LLaVA-CoT.

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Demo hf arXiv X (formerly Twitter) URL License Hits Replicate Wisemodel

本项目希望构建开源慢思考大模型,由北大-兔展AIGC联合实验室共同发起。

📣 News

🔥 Highlights

LLaVA-CoT is a visual language model capable of spontaneous, systematic reasoning.

Our 11B model outperforms Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on six challenging multimodal benchmarks.

🚀 Demos

LLaVA-CoT begins by outlining the problem, interprets relevant information from the image, proceeds step-by-step through reasoning, and ultimately reaches a well-supported conclusion.

Reasoning Problems

Question
Subtract all tiny shiny balls. Subtract all purple objects. How many objects are left? Options: A. 4, B. 8, C. 2, D. 6
Llama-3.2-11B-Vision-Instruct To determine the correct answer, we need to follow the instructions provided.
The first instruction is to subtract all tiny shiny balls.
Since there are three tiny shiny balls in the image, we would be left with no shiny balls after performing this action.
The second instruction is to subtract all purple objects. However, upon closer inspection, it becomes apparent that there are no shiny purple objects present in the image. Therefore, we cannot subtract any purple objects.
Given that we have already removed all the tiny shiny balls and there were no purple objects to begin with, the correct answer is C. 2, as initially, there were 2 purple objects, but they are not to be subtracted according to the instruction.
LLaVA-CoT Summary stage: What's the problem? What should I do?
<SUMMARY> I will examine the image to identify and count all the objects, focusing on the tiny shiny balls and purple objects. Then, I will subtract these objects from the total count and determine the number of remaining objects. </SUMMARY>
Caption stage: What can I know from the image?
<CAPTION> The image shows various 3D geometric shapes on a flat surface, including spheres, cubes, and cylinders in different colors. Notably, there are small shiny green spheres and a purple cylinder. </CAPTION>
Reasoning stage: How to solve the problem step-by-step?
<REASONING> First, I will count all the objects in the image: there are 10 objects in total. Next, I will identify and subtract the tiny shiny balls. There is one small shiny green sphere. Then, I will identify and subtract the purple objects. There is one purple cylinder. By subtracting the two objects (the shiny ball and the purple cylinder) from the total, I find that 8 objects remain. </REASONING>
Conclusion stage: What is the final answer?
<CONCLUSION> B </CONCLUSION>

Science Problems

Question
An acetal polymer block is fixed to the rigid plates at its top and bottom surfaces. If the top plate displaces 2 mm horizontally when it is subjected to a horizontal force ( P = 2 , \text{kN} ), determine the shear modulus of the polymer. The width of the block is 100 mm. Assume that the polymer is linearly elastic and use small angle analysis.
Options: A: 15 MPa, B: 5 MPa, C: 20 MPa, D: 2 MPa
Llama-3.2-11B-Vision-Instruct To determine the shear modulus of the polymer, we can use the formula for shear modulus in a small angle analysis:
( G = \frac{F}{A \gamma} )
where ( G ) is the shear modulus, ( F ) is the applied force, ( A ) is the area of the block, and ( \gamma ) is the shear strain.
Given that the top plate of the block displaces horizontally by 2 mm when subjected to a horizontal force of 2 kN, we can calculate the shear strain ( \gamma ). Since the displacement is horizontal and the original length of the block is 400 mm, the shear strain ( \gamma ) is:
[ \gamma = \frac{\Delta x}{L} = \frac{2 , \text{mm}}{400 , \text{mm}} = 0.005 ]
Next, we calculate the area ( A ) of the block. The width of the block is given as 100 mm, and assuming the height of the block is also 200 mm (as given in the image analysis), the area ( A ) is:
[ A = 100 , \text{mm} \times 200 , \text{mm} = 20000 , \text{mm}^2 ]
Now, we can substitute these values into the formula for shear modulus:
[ G = \frac{F}{A \gamma} = \frac{2000 , \text{N}}{20000 , \text{mm}^2 \times 0.005} = \frac{2000}{100} = 20 , \text{MPa} ]
Therefore, the correct answer is C: 20 MPa.
LLaVA-CoT Summary stage: What's the problem? What should I do?
<SUMMARY> To solve the problem, I will determine the shear modulus of the acetal polymer block using the relationship between the applied force, displacement, and dimensions of the block. I will use the formula for shear modulus and calculate the result. </SUMMARY>
Caption stage: What can I know from the image?
<CAPTION> The image shows a rectangular block of acetal polymer fixed between two rigid plates. The top plate moves horizontally by 2 mm when a force ( P = 2 , \text{kN} ) is applied. The block's dimensions are given as 400 mm in length, 200 mm in height, and 100 mm in width. </CAPTION>
Reasoning stage: How to solve the problem step-by-step?
<REASONING> The shear modulus ( G ) of the polymer can be calculated using the formula:
[ G = \frac{F}{A \cdot \frac{\Delta x}{L}} ]
where:
- ( F ) is the applied force (2 kN),
- ( A ) is the area of the block's top surface (100 mm x 400 mm = 40,000 mm²),
- ( \Delta x ) is the horizontal displacement (2 mm),
- ( L ) is the height of the block (200 mm).
First, convert the force to Newtons: ( F = 2 , \text{kN} = 2000 , \text{N} ).
Next, calculate the shear modulus:
[ G = \frac{2000}{40,000 \cdot \frac{2}{200}} ]
[ G = \frac{2000}{40,000 \cdot 0.01} ]
[ G = \frac{2000}{400} ]
[ G = 5 , \text{MPa} ]
</REASONING>
Conclusion stage: What is the final answer?
<CONCLUSION> B: 5 MPa </CONCLUSION>

🛠️ Usage

Pretrained weights

You can download the pretrained weights from the Huggingface: Xkev/Llama-3.2V-11B-cot.

Dataset

You can download the dataset from the Huggingface: Xkev/LLaVA-CoT-100k.

Inference

You can use the same code as Llama-3.2-11B-Vision-Instruct to load the model and perform inference.

If you want to use perform inference time scaling, you can use code provided in inference_demo/inference_demo.py.

In order to run the demo, you need to replace this file with the original inference code for Llama-3.2-11B-Vision-Instruct in VLMEvalKit.

Additionally, you need to replace the processing_mllama.py file in the transformers library with the one provided in inference_demo/processing_mllama.py.

Finetuning

You may use any repository that supports Llama-3.2-11B-Vision-Instruct for finetuning.

We recommend using llama-recipes.

To reproduce our results, you can use the following command:

cd train
pip install llama-recipes
torchrun --nnodes 1 --nproc_per_node 8 --master_port 29500 finetuning.py --enable_fsdp --lr 1e-5  --num_epochs 3 --batch_size_training 4 --model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder LLaVA-CoT --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "datasets/cot_dataset.py"  --run_validation False --batching_strategy padding

Remember to modify the data_path and image_base_path in train/cot_dataset.py to your own path (the path to the training dataset).

📝 Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@misc{xu2024llavacot,
      title={LLaVA-CoT: Let Vision Language Models Reason Step-by-Step},
      author={Guowei Xu and Peng Jin and Hao Li and Yibing Song and Lichao Sun and Li Yuan},
      year={2024},
      eprint={2411.10440},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.10440},
}

🙏 Acknowledgement

  • The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
  • The service is a research preview intended for non-commercial use only, subject to LLAMA 3.2 COMMUNITY LICENSE AGREEMENT, and Terms of Use of the data generated by OpenAI. Please contact us if you find any potential violations.
  • The template is modified from Chat-Univi and LLaVA.

About

LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages