Data Preparation for Training VILA

To train VILA, we used the following datasets:

Stage	Datasets
1. Initialize projector	CC3M
2. Pre-training	MMC4-core, COYO-700M subset
3. SFT	LLaVA-1.5, VFLAN, ShareGPT, TextFLAN

LLaVa-CC3M-Pretrain

We use LLaVA-CC3M-Pretrain-595K to train the visual language projector

mkdir -p ./playground/data/LLaVA-Pretrain
cd ./playground/data/LLaVA-Pretrain

# download chat.json and process
huggingface-cli download liuhaotian/LLaVA-CC3M-Pretrain-595K chat.json --repo-type dataset --local-dir . --local-dir-use-symlinks False
mv chat.json LLaVA-CC3M-Pretrain-595K.json

# download images.zip and process
huggingface-cli download liuhaotian/LLaVA-CC3M-Pretrain-595K images.zip --repo-type dataset --local-dir . --local-dir-use-symlinks False
unzip images.zip -d images

MMC4-Core Dataset

Due to the limit of compute, we pre-train VILA on the smaller core set of MMC4 instead of the full set.

Firstly, download the annotations of the MMC4-core dataset here: https://github.com/allenai/mmc4. We used the non-fewer-face split, and you may need to request the access here.
Now modify the input and output path in mmc4_downloader.py and run the following script to scrawl the MMC4 images:

cd mmc4
python mmc4_downloader.py

Note that due to the expiration of image urls, you may end up getting a subset of the entire corpus.

The scrawling may take a long time. Optionally, you can also shard the workload over multiple jobs/machines concurrently to speed up the process:

# provide the start and end index of the jsonl shard. There are 23098 - 14 shards totally
# python mmc4_downloader.py <start_idx> <end_idx>
python mmc4_downloader.py 0 1000  # worker 1
python mmc4_downloader.py 1000 2000  # worker 2

Filter out invalid samples in MMC4:

python mmc4_filter_and_counter.py

Merge images and text into a unified pickle file for each shard:

python mmc4_merger.py

COYO-700M Dataset

Download the metadata of COYO-700M:

huggingface-cli download kakaobrain/coyo-700m --repo-type dataset --local-dir coyo-700m --local-dir-use-symlinks False

Scrawl the COYO images. Note that here we only keep a 20% subset in each shard with the highest CLIP similarity, to balance compute budget and data quality.

There are totally 128 shards of annotations. Now download each one with the script:

cd coyo
for SHARD in {0..127}; do
    python coyo_downloader.py $SHARD  
done

Split downloaded COYO data into multiple shards:

python coyo_splitter.py

LLaVA-1.5 Instruction Data

We use this file in our experiments. Please download this dataset from LLaVA authors.

mkdir -p ./playground/data/LLaVA-Pretrain
cd ./playground/data/LLaVA-Pretrain
huggingface-cli download liuhaotian/LLaVA-Instruct-150K llava_v1_5_mix665k.json --repo-type dataset

VFlan dataset

Download FLAN datasets:

huggingface-cli download Open-Orca/FLAN --repo-type dataset --local-dir FLAN --local-dir-use-symlinks False

Preprocess FLAN dataset (sample 1M data from 378M samples):

cd sft
python preprocess_flan.py

M3IT Dataset

Download M3IT datasets:

huggingface-cli download MMInstruction/M3IT --repo-type dataset --local-dir M3IT --local-dir-use-symlinks False

Preprocess M3IT dataset:

python preprocess_m3it.py

(Optional) Split FLAN+M3IT into multiple chunks to reduce CPU memory pressure during training:

python split_vflan.py

ShareGPT4v

The ShareGPT data can be obtained mit-han-lab/ShareGPT4V. * Note the original ShareGPT4v dataset contains some samples with file ids (sa_XXXX) and repeative response. We filter those bad examples and reduced the samples from 100K -> 96K (for caption) and 1.2m -> 1.17m (for pretraining). Then we re-combine them into a single file.

huggingface-cli download mit-han-lab/ShareGPT4V --repo-type dataset --local-dir coyo-700m --local-dir-use-symlinks False

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data Preparation for Training VILA

LLaVa-CC3M-Pretrain

MMC4-Core Dataset

COYO-700M Dataset

LLaVA-1.5 Instruction Data

VFlan dataset

M3IT Dataset

ShareGPT4v

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data Preparation for Training VILA

LLaVa-CC3M-Pretrain

MMC4-Core Dataset

COYO-700M Dataset

LLaVA-1.5 Instruction Data

VFlan dataset

M3IT Dataset

ShareGPT4v