Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phi3.5-vision-instruct fine-tuning best practices. (Latex OCR Fine-tuning) #1809

Closed
Jintao-Huang opened this issue Aug 23, 2024 · 2 comments
Closed
Labels
good first issue Good for newcomers

Comments

@Jintao-Huang
Copy link
Collaborator

Jintao-Huang commented Aug 23, 2024

Huggingface Model: https://huggingface.co/microsoft/Phi-3.5-vision-instruct

Fine-tuned Dataset: https://huggingface.co/datasets/linxy/LaTeX_OCR

Usually, fine-tuning a multimodal large model involves using a custom dataset for fine-tuning. Here, we will demonstrate a runnable demo.

Before starting the fine-tuning, please ensure that your environment is properly prepared.

git clone https://github.com/modelscope/ms-swift.git
cd swift
pip install -e .[llm]

Inference

# ModelScope
CUDA_VISIBLE_DEVICES=0 swift infer \
  --model_type phi3_5-vision-instruct \
  --use_flash_attn false

# HuggingFace
USE_HF=1 CUDA_VISIBLE_DEVICES=0 swift infer \
  --model_type phi3_5-vision-instruct \
  --model_id_or_path microsoft/Phi-3.5-vision-instruct \
  --use_flash_attn false

Results

<<< who are you
I am Phi, an AI developed by Microsoft to assist with providing information, answering questions, and helping users find solutions to their queries. How can I assist you today?
--------------------------------------------------
<<< <image>please describe the image.
Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
The image features a close-up of a kitten with striking blue eyes and a white and grey striped coat. The kitten's fur is soft and fluffy, and it appears to be looking directly at the camera with a curious and innocent expression. The background is blurred, which puts the focus entirely on the kitten's face.
--------------------------------------------------
<<<  <image>What is the result of the calculation?
Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png
The result of the calculation 1452 + 45304 is 46756.

GPU Memory:

截屏2024-08-23 18 09 29
@Jintao-Huang
Copy link
Collaborator Author

Jintao-Huang commented Aug 23, 2024

Fine-tuning

The format of the custom dataset is as follows (single image, multiple images, and no image):

{"query": "<image>55555", "response": "66666", "images": ["image_path"]}
{"query": "eeeee<image>eeeee<image>eeeee", "response": "fffff", "history": [], "images": ["image_path1", "image_path2"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response2"], ["query2", "response2"]], "images": []}

Fine-tuning script:

# To modify num_crops, you can use the environment variable: `NUM_CROPS=16` (default is 4).
# ModelScope
CUDA_VISIBLE_DEVICES=0,1,2,3 NPROC_PER_NODE=4 swift sft \
  --model_type phi3_5-vision-instruct \
  --sft_type lora \
  --dataset latex-ocr-print#20000 \
  --deepspeed default-zero2 \
  --output_dir output \
  --num_train_epochs 5 \
  --use_flash_attn false

# HuggingFace
USE_HF=1 CUDA_VISIBLE_DEVICES=0,1,2,3 NPROC_PER_NODE=4 swift sft \
  --model_type phi3_5-vision-instruct \
  --model_id_or_path microsoft/Phi-3.5-vision-instruct \
  --sft_type lora \
  --dataset latex-ocr-print#20000 \
  --deepspeed default-zero2 \
  --output_dir output \
  --num_train_epochs 5 \
  --use_flash_attn false

If you want to use a custom dataset, simply specify as follows:

  --dataset train.jsonl \
  --val_dataset val.jsonl \

One of the data samples:

截屏2024-08-23 19 15 50

Number of trainable parameters:

截屏2024-08-23 19 18 34

GPU Memory

截屏2024-08-23 18 57 02

Training process

截屏2024-08-23 19 04 36

Train Loss (Due to time constraints, we only fine-tuned for 1000 steps):

train_loss (29)

Here is the inference script after fine-tuning, we perform inference on the automatically segmented validation set:

# To run a full test, please set: `--show_dataset_sample -1`
# If using HuggingFace, please add: `USE_HF=1`
# inference only
CUDA_VISIBLE_DEVICES=0 swift infer \
    --ckpt_dir output/phi3_5-vision-instruct/vx-xxx/checkpoint-xxx \
    --load_dataset_config true --use_flash_attn false

# merge-lora & inference
CUDA_VISIBLE_DEVICES=0 swift infer \
    --ckpt_dir output/phi3_5-vision-instruct/vx-xxx/checkpoint-xxx \
    --load_dataset_config true --merge_lora true \
    --safe_serialization false --use_flash_attn false

The results of the fine-tuned model inferring on the validation set: (Due to time constraints, we only fine-tuned for 1000 steps):

截屏2024-08-23 19 34 01

@Jintao-Huang Jintao-Huang added the good first issue Good for newcomers label Aug 25, 2024
@praymich
Copy link

你好,我想问下,以这种方式大概去微调三万多条中文的图文数据,能够让他有中文回复能力,并且学到里面的信息吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants