First version of the code has been released.
This is the official implementation for TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
by Tingyu Qu, Mingxiao Li, Tinne Tuytelaars, Marie-Francine Moens.
We explore various visual tokens compression strategies. Our TS-LLaVA achieves the state-of-the-art performance among trianing-free video LLMs.
Ranked #10: the average accuracy for multple choice questions on MLVU-test
To create conda env, please run:
conda env create -n llava --file llava.yml
conda activate llava
- Two packages, i.e. llava and flash-attention, are commented out from the yml file, as direct installation can cause problems. Please refer to the original LLaVA repo for installing them.
- One can also directly follow the installation process as recorded in the original LLaVA repo.
The checkpoints for LLaVA-v1.6 can be found here:
git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b .ckpt/llava-v1.6-vicuna-7b
git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-34b .ckpt/llava-v1.6-34b
- After downloading, the checkpoints should be stored in the ckpt folder.
[Optional] To enable GPT evaluation for open-ended video QA, please do the following:
export OPENAI_API_KEY=$YOUR_OPENAI_API_KEY
-
We prepare the ground-truth question and answer files based on
IG-VLM
andSF-LLaVA
, and put them under playground/gt_qa_files.- NExT-QA: Download the
NExT_QA.csv
fromhere
- EgoSchema: Download the
EgoSchema.csv
fromhere
- IntentQA: Download the
IntentQA.csv
fromhere
If you want to run our model for Open-Ended VideoQA and video-based Text Generation, please download the datasets as:
- MSVD-QA: Download the
MSVD_QA.csv
fromhere
- MSRVTT-QA: Download the
MSRVTT_QA.csv
fromhere
- TGIF-QA: Download the
TGIF_FrameQA.csv
fromhere
- Activitynet-QA: Download the
Activitynet_QA.csv
from thehere
- VCGBench
- Download all files under
text_generation_benchmark
- Reformat the files by running
python scripts/data/prepare_vcgbench_qa_file.py --qa_folder $TEXT_GENERATION_BENCHMARK
- Download all files under
- NExT-QA: Download the
-
Reformatting the files:
- After getting the csv files, please reformat the files (apart from VCGBench) by running
python scripts/data/prepare_{DATASET}_file.py --qa_file $PATH_TO_CSV_FILE
- replace DATASET with the names of the dataset. Check the
scripts/data
to make sure the name is correct.
- After getting the csv files, please reformat the files (apart from VCGBench) by running
-
Download the raw videos from the official websites.
-
Multiple Choice VideoQA
-
Open-Ended VideoQA & video-based Text Generation:
- [Recomanded] Option 1: Follow the instruction in
Video-LLaVA
to download raw videos. - Option 2: Download videos from the data owners.
- [Recomanded] Option 1: Follow the instruction in
-
Store the videos to the dir of your choice (
BASE_VIDEO_DIR
), and replaceBASE_VIDEO_DIR
in scripts when needed
-
- Download the data:
- By default, we use all the visible GPUs on the node for the model inference. To manually select GPUs, please modify
CUDA_VISIBLE_DEVICES
in the scripts accordingly. - Please note that the model inference of TS-LLaVA-34B requires GPUs with at least 80G memory.
- In each scripts, change
CKPT_NAME
andmodel_path
accordingly.
cd scripts/infer_videos
bash run_qa_{DATASET_NAME}.sh {AGGREGATION_METHOD} {NUM_FRAMES} {NUM_SAMPLED_TOKENS} {PROMPT_VERSION} {IMAGE_ASPECT_RATIO}
The evaluation is automatically done after inference
- replace DATASET_NAME to one of {nextqa, egoschema, intentqa}
AGGREGATION_METHOD
refers to the visual token compression method of choice. The default for TS-LLaVA isV2
, you can select fromX1
,X2
,X3
: only use the thumbnail image.Z1
,Z2
,Z3
: using multiple thumbnail images. (remember to sed the total number of frames divisible to the number of frames per thumbnail image)Y1
,Y2
,Y3
: use both thumbnail image and sampled visual tokens. And prepend thumbnail image tokens to sampled visual tokens.V1
,V2
,V3
: similar asY1
,Y2
,Y3
. But sampled tokens are prepended to thumbnail image tokens.W1
,W2
,W3
&U1
,U2
,U3
: using multiple thumbnail images with sampled visual tokens (for ablation studies, remember to set the number of sample tokens accordingly).- Here 1, 2 and 3 correspond to using 4, 6, and 8 frames per thumbnail image, respectively.
- For details, please refer to llava_arch.py
NUM_FRAMES
refers to the total number of frames used. The default for TS-LLaVA is 50.NUM_SAMPLED_TOKENS
refers to the number of sampled tokens. The default for TS-LLaVA is 2880.PROMPT_VERSION
refers to the textual prompt version used. The default for TS-LLaVA isv4
. Please refer to get_prompt.py for more informationIMAGE_ASPECT_RATIO
refers to the type of image aspect ratio. The default for TS-LLaVA isresize
, which resizes each frame to 336$\times$336.
The default arguments AGGREGATION_METHOD
, NUM_FRAMES
, NUM_SAMPLED_TOKENS
, PROMPT_VERSION
and IMAGE_ASPECT_RATIO
are the same as Multiple Choice VideoQA.
cd scripts/infer_videos
bash run_qa_mlvu_mcqa.sh V2 50 2880 v4 resize
Submit the resulting json file to the official evaluation server (https://github.com/JUNJIE99/MLVU) for evaluation
cd scripts/infer_videos
bash run_qa_mlvu_mcqa.sh V2 50 2880 v4 resize {INPUT_FORMAT}
The evaluation is automatically done after inference
- In the script, change
video_dir
,gt_file_qa
andoutput_dir
accordingly for different subtasks. - The sixth argument
INPUT_FORMAT
refers to the input format of visual contents, which corresponds to the subtask of choice. It should be eithervideo
orimage
.
The default value for PROMPT_VERSION
is v3
. The rest are the same as Multiple Choice VideoQA.
cd scripts/infer_videos
bash run_qa_{DATASET_NAME}.sh V2 50 2880 v3 resize
- Same as Multiple Choice VideoQA. Replace DATASET_NAME to one of {msvd, msrvtt, anet, tgif}
cd scripts/eval
bash eval_{DATASET_NAME}.sh V2 50 2880 v3 resize {API_KEY}
- Use your own api_key from openai for
API_KEY
.
For VCGBench (Video ChatGPT), the inference and evaluation procedures are similar. Please refer to run_gen_qa_{TASK_TYPE}.sh and eval_gen_qa.sh
We extend our gratitude to the following awesome projects: LLaVA, FreeVA, IG-VLM and SF-LLaVA.
@article{qu2024tsllava,
title={TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models},
author={Tingyu Qu and Mingxiao Li and Tinne Tuytelaars and Marie-Francine Moens},
year={2024},
journal={arXiv preprint arXiv:2411.11066},
}