implement LLaVA using candle
The code is based on https://github.com/haotian-liu/LLaVA, Hence the llava-hf version of config may perform differently.
The llava-hf models contain tokenizer.json, so if you want pure-rust experience, I suggest you to use llava-hf version.
Right now I have tested on liuhaotian/llava-v1.6-vicuna-7b and llava-hf/llava-v1.6-vicuna-7b-hf. The memory use might have room for optimization.
cargo run # default args, use liuhaotian/llava-v1.6-vicuna-7b, default-image is image/llava_logo.png, prompt is "is this a cat?"
cargo run -- --image-file "images/llava_v1_5_radar.jpg" --prompt "what does this picture show?"
cargo run -- --model-path "llava-hf/llava-v1.6-vicuna-7b-hf" # use llava-hf model
-
Download the corresponding weights from Hugging Face
-
Load the model weights and configs
- general llava config(need to rethink what is necessary)
- Vision tower(CLIP)
- image processor(partial, the format of 'size' and 'crop size' not fully compatible with python transformer)
- LLM
- llama/vicuna
- mistral
-
image preprocess
- clip image processor
- 'anyres' image preprocess
- 'pad' image preprocess
-
conv template (partial, only implement conv_llava_v1 and conv_chatml_direct, which is enough for LLaVA v1.6)
-
Model structure Implementation
- Vision tower
- LLM
- modify of llama code
- output embedding result
- generate from embed tensors
- modify of llama code
-
model forward
- Vision tower
- feature select
- LLM
- process of multiple images
- read multiple images
- multiple images patch process
- concat of image features and text features
- truncate of the concat features
- Vision tower
-
main process
- load model
- load image
- load text
- tokenize text
- forward
- single image
- output
- KV cache
- conversation mode
- (long term) web?
-
quantization
- 4-bit
- 8-bit
-
(long term) Expand candle operators, including:
- split
- nonzero
- where
-
top priority migrate to support llava-hf series model
- determine whether it is a llava-hf model
- translate of config
- translate of model
- take care of constant such as image_token_index
- modify of image processor config
-
LoRA
-
contribution to other projects
-
memory optimization for LLaVA 1.6 version
-
(long term)model training c
conda create -n llava python=3.10
pip install transformers protobuf
pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download liuhaotian/llava-v1.6-vicuna-7b
- Tested only on liuhaotian/llava-v1.6-vicuna-7b version