Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] LLaVA Pretraining with Mixtral 8×7B #1417

Open
ShawnAn-WHU opened this issue Apr 17, 2024 · 11 comments
Open

[Question] LLaVA Pretraining with Mixtral 8×7B #1417

ShawnAn-WHU opened this issue Apr 17, 2024 · 11 comments

Comments

@ShawnAn-WHU
Copy link

Question

Does anyone have carried out the pretraining with Mixtral 8×7B? When I run the petraining script, one problem occured like the figure shown below. I just add a llava_mixtral.py to the llava/model/language_model and some necessary supplementary code.
image

@martinakaduc
Copy link

I have not faced this issue. Can you give me the reproducing command.

@ShawnAn-WHU
Copy link
Author

@martinakaduc Thank you very much for your prompt reply! Below is my pretraining script. The --model_name_or_path is the model I downloaded from HF mistralai/Mixtral-8x7B-v0.1. Despite the warnings, running this script will produce a mm_projector.bin file. When pretraining, the loss decreases from ~15 to ~6 and does not decrease any more. Can you figure out the problem?

image

@martinakaduc
Copy link

martinakaduc commented Apr 18, 2024

Have you merged my pull request about adding mixtral? If not, you can use my modified repo here: https://github.com/martinakaduc/LLaVA

My pretraining script:
deepspeed llava/train/train_mem.py --deepspeed ./scripts/zero3_offload.json --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 --version plain --data_path ./playground/data/LLaVA-Pretrain/blip_laion_cc_sbu_558k.json --image_folder ./playground/data/LLaVA-Pretrain/images --vision_tower openai/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --tune_mm_mlp_adapter True --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --bf16 True --output_dir ./checkpoints/Mixtral-pt --num_train_epochs 1 --per_device_train_batch_size 32 --per_device_eval_batch_size 4 --gradient_accumulation_steps 2 --evaluation_strategy "no" --save_strategy "steps" --save_steps 200 --save_total_limit 2 --learning_rate 1e-3 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 32768 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to neptune

And fine-tuning script:
deepspeed llava/train/train_mem.py --deepspeed ./scripts/zero3_offload.json --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 --version mistral_instruct --data_path ./playground/data/llava_v1_5_mix665k.json --image_folder ./playground/data --vision_tower openai/clip-vit-large-patch14-336 --pretrain_mm_mlp_adapter ./checkpoints/Mixtral-pt/checkpoint-400/mm_projector.bin --mm_projector_type mlp2x_gelu --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir ./checkpoints/Mixtral-sft --num_train_epochs 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 200 --save_total_limit 2 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 32768 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to neptune

@ShawnAn-WHU
Copy link
Author

@martinakaduc Thank you! I will try it now and find out the problem!

@accupham
Copy link

accupham commented Apr 18, 2024

Interesting, would pretraining on mixtral-8x-22b also be possible?

@martinakaduc
Copy link

I think it is possible. However I have not tested yet.

@ShawnAn-WHU
Copy link
Author

@martinakaduc Hi, I'm using your pretrained MixSUraV model downloaded from HF to finetune on my own dataset. The script I use is like Figure 1, is it correct? If correct, I found it infeasible when using 8 3090 GPUs (24G) even with 4-bit quantification (set --bits 4, like the red rectangle in the figure). The code for model loading is like Figure 2. However, when I use the code shown in Figure 3, only 3 GPUs are more than enough (may be 1 is ok). Is there any difference between these two codes? And could you please tell me your finetuning script and computational resources needed if you have done this? Thank tou so much!
image
image
image

@fisher75
Copy link

Hi, how do you know the training was effecitve? Did you use the default training setting? I LoRA with default parameters and basically no improvement.

@ShawnAn-WHU
Copy link
Author

Hi, how do you know the training was effecitve? Did you use the default training setting? I LoRA with default parameters and basically no improvement.

@fisher75 I have LoRA finetuned with my own dataset using LLaVA-v1.5 and the qualitative results are better than the original LLaVA-v1.5.

@fisher75
Copy link

ShawnAn-WHU

Hi @ShawnAn-WHU thanks for your reply. I am also working on this, may I ask is the improvement is very obvious? May I see the training and inference scripts(mostly I am curious about the parameter settings), btw if possible, may I add your WeChat? Could be very helpful to share some details.

@ShawnAn-WHU
Copy link
Author

@fisher75 Sure, e-mail me your WeChat ID is ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants