π‘ Some other multimodal-LLM projects from our team may interest you β¨.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng*, Sicong Leng*, Hang Zhang*, Yifei Xin*, Xin Li*, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing
VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Sicong Leng*, Hang Zhang*, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
Sicong Leng*, Yun Xing*, Zesen Cheng*, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss
Zesen Cheng*, Hang Zhang*, Kehan Li*, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, Lidong Bing
- [2025.01.21] Release models and inference code of VideoLLaMA 3.
VideoLLaMA 3 is a series of multimodal foundation models with frontier image and video understanding capacity.
Basic Dependencies:
- Python >= 3.10
- Pytorch >= 2.4.0
- CUDA Version >= 11.8
- transformers >= 4.46.3
Install required packages:
[Inference-only]
pip install torch==2.4.0 torchvision==0.17.0 --extra-index-url https://download.pytorch.org/whl/cu118
pip install flash-attn --no-build-isolation
pip install transformers==4.46.3 accelerate==1.0.1
pip install decord ffmpeg-python imageio opencv-python
[Training]
git clone https://github.com/DAMO-NLP-SG/VideoLLaMA3
cd VideoLLaMA3
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
Model | Base Model | HF Link |
---|---|---|
VideoLLaMA3-7B | Qwen2.5-7B | DAMO-NLP-SG/VideoLLaMA3-7B |
VideoLLaMA3-2B | Qwen2.5-1.5B | DAMO-NLP-SG/VideoLLaMA3-2B |
VideoLLaMA3-7B-Image | Qwen2.5-7B | DAMO-NLP-SG/VideoLLaMA3-7B-Image |
VideoLLaMA3-2B-Image | Qwen2.5-1.5B | DAMO-NLP-SG/VideoLLaMA3-2B-Image |
We also upload the tuned vision encoder of VideoLLaMA3-7B for wider application:
Model | Base Model | HF Link |
---|---|---|
VideoLLaMA3-7B Vision Encoder | siglip-so400m-patch14-384 | DAMO-NLP-SG/VL3-SigLIP-NaViT |
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
device = "cuda:0"
model_path = "DAMO-NLP-SG/VideoLLaMA3-7B"
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
device_map={"": device},
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
conversation = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{"type": "video", "video": {"video_path": "./assets/cat_and_chicken.mp4", "fps": 1, "max_frames": 128}},
{"type": "text", "text": "What is the cat doing?"},
]
},
]
inputs = processor(
conversation=conversation,
add_system_prompt=True,
add_generation_prompt=True,
return_tensors="pt"
)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
if "pixel_values" in inputs:
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
output_ids = model.generate(**inputs, max_new_tokens=128)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(response)
For more cases, please refer to examples.
Checkout inference notebooks that demonstrate how to use VideoLLaMA3 on various applications such as single image understanding, multi-image understanding, visual referring and grounding, video understanding, etc.
It is highly recommended to try our online demo first.
Otherwise, you can launch a gradio app locally:
python inference/launch_gradio_demo.py --model-path DAMO-NLP-SG/VideoLLaMA3-7B
options:
--model-path MODEL_PATH, --model_path MODEL_PATH
--server-port SERVER_PORT, --server_port SERVER_PORT
Optional. Port of the model server.
--interface-port INTERFACE_PORT, --interface_port INTERFACE_PORT
Optional. Port of the gradio interface.
--nproc NPROC
Optional. Number of model processes.
Coming soon...
If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:
@article{damonlpsg2025videollama3,
title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding},
author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
journal={arXiv preprint arXiv:2501.13106},
year={2025},
url = {https://arxiv.org/abs/2501.13106}
}
@article{damonlpsg2024videollama2,
title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
journal={arXiv preprint arXiv:2406.07476},
year={2024},
url = {https://arxiv.org/abs/2406.07476}
}
@article{damonlpsg2023videollama,
title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
author = {Zhang, Hang and Li, Xin and Bing, Lidong},
journal = {arXiv preprint arXiv:2306.02858},
year = {2023},
url = {https://arxiv.org/abs/2306.02858}
}
Our VideoLLaMA3 is built on top of SigLip and Qwen2.5. We also learned a lot from the implementation of LLaVA-OneVision, InternVL2, and Qwen2VL. Besides, our VideoLLaMA3 benefits from tons of open-source efforts. We sincerely appreciate these efforts and compile a list in ACKNOWLEDGEMENT.md to express our gratitude. If your work is used in VideoLLaMA3 but not mentioned in either this repo or the technical report, feel free to let us know β€οΈ.
This project is released under the Apache 2.0 license as found in the LICENSE file. The service is a research preview intended for non-commercial use ONLY, subject to the model Licenses of Qwen, Terms of Use of the data generated by OpenAI and Gemini, and Privacy Practices of ShareGPT. Please get in touch with us if you find any potential violations.