To train VILA, we used the following datasets:
Stage | Datasets |
---|---|
1. Initialize projector | CC3M |
2. Pre-training | MMC4-core, COYO-700M subset |
3. SFT | LLaVA-1.5, VFLAN, ShareGPT, TextFLAN |
We use LLaVA-CC3M-Pretrain-595K to train the visual language projector
mkdir -p ./playground/data/LLaVA-Pretrain
cd ./playground/data/LLaVA-Pretrain
# download chat.json and process
huggingface-cli download liuhaotian/LLaVA-CC3M-Pretrain-595K chat.json --repo-type dataset --local-dir . --local-dir-use-symlinks False
mv chat.json LLaVA-CC3M-Pretrain-595K.json
# download images.zip and process
huggingface-cli download liuhaotian/LLaVA-CC3M-Pretrain-595K images.zip --repo-type dataset --local-dir . --local-dir-use-symlinks False
unzip images.zip -d images
Due to the limit of compute, we pre-train VILA on the smaller core set of MMC4 instead of the full set.
-
Firstly, download the annotations of the MMC4-core dataset here: https://github.com/allenai/mmc4. We used the non-fewer-face split, and you may need to request the access here.
-
Now modify the input and output path in
mmc4_downloader.py
and run the following script to scrawl the MMC4 images:
cd mmc4
python mmc4_downloader.py
Note that due to the expiration of image urls, you may end up getting a subset of the entire corpus.
The scrawling may take a long time. Optionally, you can also shard the workload over multiple jobs/machines concurrently to speed up the process:
# provide the start and end index of the jsonl shard. There are 23098 - 14 shards totally
# python mmc4_downloader.py <start_idx> <end_idx>
python mmc4_downloader.py 0 1000 # worker 1
python mmc4_downloader.py 1000 2000 # worker 2
- Filter out invalid samples in MMC4:
python mmc4_filter_and_counter.py
- Merge images and text into a unified pickle file for each shard:
python mmc4_merger.py
- Download the metadata of COYO-700M:
huggingface-cli download kakaobrain/coyo-700m --repo-type dataset --local-dir coyo-700m --local-dir-use-symlinks False
- Scrawl the COYO images. Note that here we only keep a 20% subset in each shard with the highest CLIP similarity, to balance compute budget and data quality.
There are totally 128 shards of annotations. Now download each one with the script:
cd coyo
for SHARD in {0..127}; do
python coyo_downloader.py $SHARD
done
- Split downloaded COYO data into multiple shards:
python coyo_splitter.py
We use this file in our experiments. Please download this dataset from LLaVA authors.
mkdir -p ./playground/data/LLaVA-Pretrain
cd ./playground/data/LLaVA-Pretrain
huggingface-cli download liuhaotian/LLaVA-Instruct-150K llava_v1_5_mix665k.json --repo-type dataset
- Download FLAN datasets:
huggingface-cli download Open-Orca/FLAN --repo-type dataset --local-dir FLAN --local-dir-use-symlinks False
- Preprocess FLAN dataset (sample 1M data from 378M samples):
cd sft
python preprocess_flan.py
- Download M3IT datasets:
huggingface-cli download MMInstruction/M3IT --repo-type dataset --local-dir M3IT --local-dir-use-symlinks False
- Preprocess M3IT dataset:
python preprocess_m3it.py
- (Optional) Split FLAN+M3IT into multiple chunks to reduce CPU memory pressure during training:
python split_vflan.py
The ShareGPT data can be obtained mit-han-lab/ShareGPT4V. * Note the original ShareGPT4v dataset contains some samples with file ids (sa_XXXX) and repeative response. We filter those bad examples and reduced the samples from 100K -> 96K (for caption) and 1.2m -> 1.17m (for pretraining). Then we re-combine them into a single file.
huggingface-cli download mit-han-lab/ShareGPT4V --repo-type dataset --local-dir coyo-700m --local-dir-use-symlinks False