Recent advancements in large-scale video-language models, demonstrate remarkable capabilities in real-time planning and interaction with real-world environments, yet their training is constrained by high computational costs and limited annotated datasets. Traditional methods, like video compression and sliding window techniques, often compromise critical visual information or disrupt semantic flow. In addition, current predesigned QA benchmarks fail to adequately assess long video understanding due to inherent biases from static image features and the base LLM. To address these issues, we introduce VideoLLaMB, a framework utilizing Memory Bridge Layers with recurrent memory tokens to encode entire video content without discarding vital information. We also propose SceneTilling algorithm to split video into semantic units to keep the semantic flow. Finally, We present the "Needle in a Video Haystack" benchmark to evaluate long video understanding over needle of different modalities comprehensively.
Table of Contents
- Install
- Quick Start with CLI
- Streaming Caption with CLI
- Demo
- Train
- Evaluate
- Model Zoo
- Citation
- Acknowledgement
- Clone this repository and navigate to VideoLLaMB folder
git clone https://github.com/nlco-bigai/VideoLLaMB.git
cd VideoLLaMB
- Install Package
conda create -n videollamb python=3.10 -y
conda activate videollamb
pip install --upgrade pip
pip install -e .
conda install ffmpeg
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install flash-attn --no-build-isolation --no-cache-dir
Download the checkpoint, place it to the checkpoints
directory, then run following command:
python -m llava.serve.cli --model-path checkpoints/videollamb-llava-1.5-7b --video-file XXX.mp4
Download the checkpoint, place it to the checkpoints
directory, then run following command:
python -m llava.serve.cli_streaming --model_path checkpoints/videollamb-llava-1.5-7b
streaming_demo.mp4
Download the checkpoint, place it to the checkpoints
directory, then run following command:
python -m llava.serve.gradio_demo
demo_gradio.mp4
- Prepare data
We combine the video instruction from PLLaVA and image instruction from LLaVA for training. Please check DATA for details.
- Prapare model weights for initialization
Our model is initialized on LLaVA, you can download the llava-v1.5-7b, and put it to checkpoints/llava-v1.5-7b
. For visual encoders, we select them from LanguageBind, you can download LanguageBind_Image and LanguageBind_Video_merge, and put them to checkpoints/LanguageBind_Image
and checkpoints/LanguageBind_Video_merge
- Start Training
Training takes 23 hours for LLaVA-1.5-7B in 4-A800-80G
bash scripts/finetune_video_image.slurm # bash
sbatch scripts/finetune_video_image.slurm # slurm cluster
We also provide a script to backpropagate the LLM loss to the bridge for each recurrent iteration.
bash scripts/finetune_video_image_loss.slurm # bash
sbatch scripts/finetune_video_image_loss.slurm # slurm cluster
- Prepare data
We provide evaluation pipelines for EgoScheme, NExTQA, EgoPlan, and MVBench. Please check DATA for details.
- Start Evaluating
a. Traditional Benchmark
bash scripts/eval/egoschema.sh # egoschema
bash scripts/eval/nextqa.sh # nextqa
bash scripts/eval/egoplan.sh # egoplan
bash scripts/eval/mvbench.sh # mvbench
b. MM-NIAVH
check our benchmark Needle In A Video Haystack (NIAVH)
Model | Base Model | Training Data | Download Link |
---|---|---|---|
VideoLLaMB-7B | llava-v1.5-7b | magic_json, LLaVA | 🤗videollamb-llava-1.5-7b |
VideoLLaMB-7B-Mem (MM-NIAVH) | llava-v1.5-7b | magic_json, LLaVA | 🤗videollamb-mem-llava-1.5-7b |
Model:
Data:
Demo:
@misc{mm-niavh,
title={MLLM Pressure Test: Needle In A Video Haystack},
author={Wang, Yuxuan and Xie, Cihang and Liu, Yang and Zheng, Zilong},
publisher={github},
url={https://github.com/bigai-nlco/NeedleInAVideoHaystack},
year={2024}
}
@article{videollamb,
title={VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges},
author={Wang, Yuxuan and Xie, Cihang and Liu, Yang and Zheng, Zilong},
journal={arxiv},
year={2024}
}