Merge pull request #34 from Efficient-Large-Model/vila1.5

vila1.5 release
NVlabs · May 3, 2024 · 7261d39 · 7261d39
2 parents eaadb1e + 6b941da
commit 7261d39
Show file tree

Hide file tree

Showing 257 changed files with 22,376 additions and 2,607 deletions.
diff --git a/README.md b/README.md
diff --git a/data_prepare/README.md b/data_prepare/README.md
@@ -1,41 +1,34 @@
-
 # Data Preparation for Training VILA
 
 To train VILA, we used the following datasets:
 
-| Stage                   | Datasets                    |
-| ----------------------- | --------------------------- |
-| 1. Initialize projector | CC3M                         |
-| 2. Pre-training         | MMC4-core, COYO-700M subset |
-| 3. SFT                  | LLaVA-1.5, VFLAN, ShareGPT, TextFLAN      |
+| Stage                   | Datasets                                                                         |
+| ----------------------- | -------------------------------------------------------------------------------- |
+| 1. Initialize projector | CC3M                                                                             |
+| 2. Pre-training         | MMC4-core, COYO-700M, ShreGPT4V_pretrain                                                      |
+| 3. SFT                  | LLaVA-Next mixture, VFLAN, WIT, GSM8K-ScRel-SFT, Sherlock, ScienceQA, Shot2story, Video_ChatGPT, Youcook2, Vatex, ShareGPT_Video |
 
-### LLaVa-CC3M-Pretrain
-We use [LLaVA-CC3M-Pretrain-595K](https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K/blob/main/chat.json) to train the visual language projector
 
-```bash
-mkdir -p ./playground/data/LLaVA-Pretrain
-cd ./playground/data/LLaVA-Pretrain
 
-# download chat.json and process
-huggingface-cli download liuhaotian/LLaVA-CC3M-Pretrain-595K chat.json --repo-type dataset --local-dir . --local-dir-use-symlinks False
-mv chat.json LLaVA-CC3M-Pretrain-595K.json
 
-# download images.zip and process
-huggingface-cli download liuhaotian/LLaVA-CC3M-Pretrain-595K images.zip --repo-type dataset --local-dir . --local-dir-use-symlinks False
-unzip images.zip -d images
-```
+### LLaVa-CC3M-Pretrain
+
+We use [LLaVA-CC3M-Pretrain-595K](https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K/blob/main/chat.json) to train the visual language projector
 
 ### MMC4-Core Dataset
-Due to the limit of compute, we pre-train VILA on the smaller core set of MMC4 instead of the full set. 
 
-1. Firstly, download the annotations of the MMC4-core dataset here: https://github.com/allenai/mmc4. We used the non-fewer-face split, and you may need to request the access [here](https://forms.gle/VYtcNY8aYaUANK9f8). 
+Due to the limit of compute, we pre-train VILA on the smaller core set of MMC4 instead of the full set.
+
+1. Firstly, download the annotations of the MMC4-core dataset here: https://github.com/allenai/mmc4. We used the non-fewer-face split, and you may need to request the access [here](https://forms.gle/VYtcNY8aYaUANK9f8).
 
 2. Now modify the input and output path in `mmc4_downloader.py` and run the following script to scrawl the MMC4 images:
+
 ```bash
 cd mmc4
 python mmc4_downloader.py
 ```
-Note that due to the expiration of image urls, you may end up getting a subset of the entire corpus. 
+
+Note that due to the expiration of image urls, you may end up getting a subset of the entire corpus.
 
 The scrawling may take a long time. Optionally, you can also shard the workload over multiple jobs/machines concurrently to speed up the process:
 
@@ -59,39 +52,43 @@ python mmc4_merger.py
 ```
 
 ### COYO-700M Dataset
+
 1. Download the metadata of COYO-700M:
+
 ```bash
 huggingface-cli download kakaobrain/coyo-700m --repo-type dataset --local-dir coyo-700m --local-dir-use-symlinks False
 ```
 
-2. Scrawl the COYO images. Note that here we only keep a 20% subset in each shard with the highest CLIP similarity, to balance compute budget and data quality. 
+2. Scrawl the COYO images. Note that here we only keep a 20% subset in each shard with the highest CLIP similarity, to balance compute budget and data quality.
 
 There are totally 128 shards of annotations. Now download each one with the script:
+
 ```bash
 cd coyo
 for SHARD in {0..127}; do
-    python coyo_downloader.py $SHARD  
+    python coyo_downloader.py $SHARD
 done
 ```
 
 3. Split downloaded COYO data into multiple shards:
+
 ```bash
 python coyo_splitter.py
 ```
 
-###  LLaVA-1.5 Instruction Data
+### LLaVA-1.5 Instruction Data
 
 We use this [file](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json) in our experiments. Please download this dataset from LLaVA authors.
 
 ```bash
-mkdir -p ./playground/data/LLaVA-Pretrain
-cd ./playground/data/LLaVA-Pretrain
 huggingface-cli download liuhaotian/LLaVA-Instruct-150K llava_v1_5_mix665k.json --repo-type dataset
 ```
 
-
 ### VFlan dataset
-1. Download FLAN  datasets:
+
+#### TextFLAN
+
+1. Download FLAN datasets:
 
 ```bash
 huggingface-cli download Open-Orca/FLAN --repo-type dataset --local-dir FLAN --local-dir-use-symlinks False
@@ -104,7 +101,8 @@ cd sft
 python preprocess_flan.py
 ```
 
-### M3IT Dataset
+#### M3IT Dataset
+
 1. Download M3IT datasets:
 
 ```bash
@@ -123,11 +121,68 @@ python preprocess_m3it.py
 python split_vflan.py
 ```
 
-### ShareGPT4v
+### LLaVA-Next mixture
+
+You can follow this [page](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets) to prepare the data mixture that is proposed by LLaVA-Next.
+
+### Shot2story
+
+Please follow this [page](https://github.com/bytedance/Shot2Story/blob/master/DATA.md) to download the videos. The JSON file can be downloaded with
+
+```bash
+huggingface-cli download mit-han-lab/vila-dataset shot2story_shotonly.json
+ --repo-type dataset --local-dir shot2story --local-dir-use-symlinks False
+```
+
 
-The ShareGPT data can be obtained [mit-han-lab/ShareGPT4V](https://huggingface.co/datasets/mit-han-lab/ShareGPT4V).
-    * Note the original ShareGPT4v dataset contains some samples with file ids (sa_XXXX) and repeative response. We filter those bad examples and reduced the samples from 100K -> 96K (for caption) and 1.2m -> 1.17m (for pretraining). Then we re-combine them into a single file.
+### Video_ChatGPT
+
+You can follow this [page](https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/README.md#video-instruction-dataset-open_file_folder) to prepare Video_ChatGPT dataset.
+
+### Youcook2
+
+Please follow this [page](http://youcook2.eecs.umich.edu/) to download the videos. The JSON file can be downloaded with
+
+```bash
+huggingface-cli download mit-han-lab/vila-dataset youcook_filtered_v3.json --repo-type dataset --local-dir youcook2 --local-dir-use-symlinks False
+```
+
+### Vatex
+
+Please follow this [page](https://eric-xw.github.io/vatex-website/download.html) to download the videos. The JSON file can be downloaded with
+
+```bash
+huggingface-cli download mit-han-lab/vila-dataset vatex_filtered_v3.json --repo-type dataset --local-dir vatex --local-dir-use-symlinks False
+```
+
+### ShareGPT_Video
+
+You can follow this [page](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction) to prepare ShareGPT_Video dataset.
+
+### WIT
+
+The original WIT data can be obtained [google-research-datasets/wit](https://github.com/google-research-datasets/wit/tree/main). \* We subsample ~538K english data from the original WIT dataset and curate a llava conversation format JSON file.
 
 ```bash
-huggingface-cli download mit-han-lab/ShareGPT4V --repo-type dataset --local-dir coyo-700m --local-dir-use-symlinks False
+huggingface-cli download mit-han-lab/vila-dataset wit_processed_538k.json --repo-type dataset --local-dir WIT --local-dir-use-symlinks False
 ```
+
+### GSM8K-ScRel-SFT
+
+We add some math data [gsm8k-ScRel](https://github.com/OFA-Sys/gsm8k-ScRel/blob/main/data/train_use.jsonl) to our SFT stage.
+
+### Sherlock
+
+The image files of Sherlock can be obtained from [VisualGenome](https://visualgenome.org/api/v0/api_home.html) and [VCR](https://visualcommonsense.com/download/) separately. The llava conversation format JSON file can be downloaded with
+
+```bash
+huggingface-cli download mit-han-lab/vila-dataset sherlock_317k.json --repo-type dataset --local-dir sherlock --local-dir-use-symlinks False
+```
+
+### ScienceQA 
+
+We use the train split of ScienceQA. The image data of the train split can be obtained from [ScienceQA](https://huggingface.co/datasets/derek-thomas/ScienceQA) or their [huggingface repo](https://huggingface.co/datasets/derek-thomas/ScienceQA). The llava conversation format JSON file can be downloaded with
+
+```bash
+huggingface-cli download mit-han-lab/vila-dataset scienceqa_train_12k.json --repo-type dataset --local-dir scienceqa --local-dir-use-symlinks False
+```
diff --git a/demo_images/av.png b/demo_images/av.png
diff --git a/demo_trt_llm/README.md b/demo_trt_llm/README.md
@@ -0,0 +1,121 @@
+# Run VILA demo on x86_64 machine
+
+## Build TensorRT-LLM
+The first step to build TensorRT-LLM is to fetch the sources:
+```bash
+# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
+apt-get update && apt-get -y install git git-lfs
+git lfs install
+
+git clone https://github.com/NVIDIA/TensorRT-LLM.git 
+cd TensorRT-LLM
+git checkout 66ef1df492f7bc9c8eeb01d7e14db01838e3f0bd
+git submodule update --init --recursive
+git lfs pull
+```
+Create a TensorRT-LLM Docker image and approximate disk space required to build the image is 63 GB:
+```bash
+make -C docker release_build
+```
+
+After launching the docker image, please install the following dependency:
+```bash
+pip install git+https://github.com/bfshi/scaling_on_scales.git
+pip install git+https://github.com/huggingface/[email protected]
+```
+## Build TensorRT engine of VILA model
+
+### For Vila 1.0:
+
+Please refer to the [documentation from TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal#llava-and-vila) to deploy the model.
+
+### For Vila 1.5:
+
+1. Setup
+```bash
+# clone vila
+git clone https://github.com/Efficient-Large-Model/VILA.git
+
+# enter the demo folder
+cd <VILA-repo>/demo_trt_llm
+
+# apply patch to /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py for vila1.5
+sh apply_patch.sh
+
+# download vila checkpoint
+export MODEL_NAME="vila1.5-2.7b"
+git clone https://huggingface.co/Efficient-Large-Model/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
+```
+
+2. TensorRT Engine building using `FP16` and inference
+
+Build TensorRT engine for LLaMA part of VILA from HF checkpoint using `FP16`:
+```bash
+python convert_checkpoint.py \
+    --model_dir tmp/hf_models/${MODEL_NAME} \
+    --output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
+    --dtype float16
+
+trtllm-build \
+    --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
+    --output_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
+    --gemm_plugin float16 \
+    --use_fused_mlp \
+    --max_batch_size 1 \
+    --max_input_len 2048 \
+    --max_output_len 512 \
+    --max_multimodal_len 4096
+```
+
+3. Build TensorRT engines for visual components
+
+```bash
+python build_visual_engine.py --model_path tmp/hf_models/${MODEL_NAME} --model_type vila --vila_path ../
+```
+
+4. Run the example script
+```bash
+python run.py  \
+    --max_new_tokens 100 \
+    --hf_model_dir tmp/hf_models/${MODEL_NAME} \
+    --visual_engine_dir visual_engines/${MODEL_NAME} \
+    --llm_engine_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
+    --image_file=av.png,https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png \
+    --input_text="<image>\n<image>\n Please elaborate what you see in the images?" \
+    --run_profiling
+
+# example output:
+...
+[Q] <image>\n<image>\n Please elaborate what you see in the images?
+[04/30/2024-21:32:11] [TRT-LLM] [I] 
+[A] ['The first image shows a busy street scene with a car driving through a crosswalk. There are several people walking on the sidewalk, and a cyclist is also visible. The second image captures a beautiful sunset with the iconic Merlion statue spouting water into the water body in the foreground. The Merlion statue is a famous landmark in Singapore, and the water spout is a popular feature of the statue.']
+...
+```
+
+5. (Optional) One can also use VILA with other quantization options, like SmoothQuant and INT4 AWQ, that are supported by LLaMA. Instructions in LLaMA README to enable SmoothQuant and INT4 AWQ can be re-used to generate quantized TRT engines for LLM component of VILA.
+```bash
+python quantization/quantize.py \
+     --model_dir tmp/hf_models/${MODEL_NAME} \
+     --output_dir tmp/trt_models/${MODEL_NAME}/int4_awq/1-gpu \
+     --dtype float16 \
+     --qformat int4_awq \
+     --calib_size 32
+
+ trtllm-build \
+     --checkpoint_dir tmp/trt_models/${MODEL_NAME}/int4_awq/1-gpu \
+     --output_dir trt_engines/${MODEL_NAME}/int4_awq/1-gpu \
+     --gemm_plugin float16 \
+     --max_batch_size 1 \
+     --max_input_len 2048 \
+     --max_output_len 512 \
+     --max_multimodal_len 4096
+
+python run.py  \
+    --max_new_tokens 100 \
+    --hf_model_dir tmp/hf_models/${MODEL_NAME} \
+    --visual_engine_dir visual_engines/${MODEL_NAME} \
+    --llm_engine_dir trt_engines/${MODEL_NAME}/int4_awq/1-gpu \
+    --image_file=av.png,https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png \
+    --input_text="<image>\n<image>\n Please elaborate what you see in the images?" \
+    --run_profiling
+```
diff --git a/demo_trt_llm/apply_patch.sh b/demo_trt_llm/apply_patch.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+
+# Define the file to be modified
+FILE_PATH="/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py"
+
+# Backup the original file before modification
+cp $FILE_PATH "${FILE_PATH}.bak"
+
+# Replace the strings
+# sed -i ':a;N;$!ba;s|hf_config = LlavaConfig.from_pretrained(hf_model).text_config|hf_config = LlavaConfig.from_pretrained(hf_model).text_config\n    if hf_config.model_type == "llava_llama":\n        hf_config.llm_cfg["architecture"] = hf_config.llm_cfg["architectures"]\n        hf_config.llm_cfg["dtype"] = hf_config.llm_cfg["torch_dtype"]\n        hf_config = PretrainedConfig.from_dict(hf_config.llm_cfg)|g' $FILE_PATH
+sed -i ':a;N;$!ba;s|if "vila" in model_dir:\n        sys.path.append(model_dir + "/../VILA")\n        from llava.model import LlavaConfig, LlavaLlamaForCausalLM\n        AutoConfig.register("llava_llama", LlavaConfig)\n        AutoModelForCausalLM.register(LlavaConfig, LlavaLlamaForCausalLM)|# if "vila" in model_dir:\n#     sys.path.append(model_dir + "/../VILA")\n#     from llava.model import LlavaConfig, LlavaLlamaForCausalLM\n#     AutoConfig.register("llava_llama", LlavaConfig)\n#     AutoModelForCausalLM.register(LlavaConfig, LlavaLlamaForCausalLM)|g' $FILE_PATH
+
+# Inform the user
+echo "Replacement done. Original file backed up as ${FILE_PATH}.bak"
diff --git a/demo_trt_llm/av.png b/demo_trt_llm/av.png