GitHub - DAMO-NLP-SG/VideoLLaMA3: Frontier Multimodal Foundation Models for Image and Video Understanding

VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

💡 Some other multimodal-LLM projects from our team may interest you ✨.

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng*, Sicong Leng*, Hang Zhang*, Yifei Xin*, Xin Li*, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing

VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Sicong Leng*, Hang Zhang*, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
Sicong Leng*, Yun Xing*, Zesen Cheng*, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing

Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss
Zesen Cheng*, Hang Zhang*, Kehan Li*, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, Lidong Bing

📰 News

[2025.01.21] Release models and inference code of VideoLLaMA 3.

🌟 Introduction

VideoLLaMA 3 is a series of multimodal foundation models with frontier image and video understanding capacity.

💡Click here to show detailed performance on video benchmarks

💡Click here to show detailed performance on image benchmarks

🛠️ Requirements and Installation

Basic Dependencies:

Python >= 3.10
Pytorch >= 2.4.0
CUDA Version >= 11.8
transformers >= 4.46.3

Install required packages:

[Inference-only]

pip install torch==2.4.0 torchvision==0.17.0 --extra-index-url https://download.pytorch.org/whl/cu118

pip install flash-attn --no-build-isolation
pip install transformers==4.46.3 accelerate==1.0.1
pip install decord ffmpeg-python imageio opencv-python

[Training]

git clone https://github.com/DAMO-NLP-SG/VideoLLaMA3
cd VideoLLaMA3
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

🌎 Model Zoo

Model	Base Model	HF Link
VideoLLaMA3-7B	Qwen2.5-7B	DAMO-NLP-SG/VideoLLaMA3-7B
VideoLLaMA3-2B	Qwen2.5-1.5B	DAMO-NLP-SG/VideoLLaMA3-2B
VideoLLaMA3-7B-Image	Qwen2.5-7B	DAMO-NLP-SG/VideoLLaMA3-7B-Image
VideoLLaMA3-2B-Image	Qwen2.5-1.5B	DAMO-NLP-SG/VideoLLaMA3-2B-Image

We also upload the tuned vision encoder of VideoLLaMA3-7B for wider application:

Model	Base Model	HF Link
VideoLLaMA3-7B Vision Encoder	siglip-so400m-patch14-384	DAMO-NLP-SG/VL3-SigLIP-NaViT

🤖 Inference

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

device = "cuda:0"
model_path = "DAMO-NLP-SG/VideoLLaMA3-7B"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    device_map={"": device},
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {"type": "video", "video": {"video_path": "./assets/cat_and_chicken.mp4", "fps": 1, "max_frames": 128}},
            {"type": "text", "text": "What is the cat doing?"},
        ]
    },
]

inputs = processor(
    conversation=conversation,
    add_system_prompt=True,
    add_generation_prompt=True,
    return_tensors="pt"
)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
if "pixel_values" in inputs:
    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
output_ids = model.generate(**inputs, max_new_tokens=128)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(response)

For more cases, please refer to examples.

CookBook

Checkout inference notebooks that demonstrate how to use VideoLLaMA3 on various applications such as single image understanding, multi-image understanding, visual referring and grounding, video understanding, etc.

🤗 Demo

It is highly recommended to try our online demo first.

Otherwise, you can launch a gradio app locally:

python inference/launch_gradio_demo.py --model-path DAMO-NLP-SG/VideoLLaMA3-7B

options:
  --model-path MODEL_PATH, --model_path MODEL_PATH
  --server-port SERVER_PORT, --server_port SERVER_PORT
  	Optional. Port of the model server.
  --interface-port INTERFACE_PORT, --interface_port INTERFACE_PORT
  	Optional. Port of the gradio interface.
  --nproc NPROC
  	Optional. Number of model processes.

🗝️ Training & Evaluation

Coming soon...

📑 Citation

If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:

@article{damonlpsg2025videollama3,
  title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding},
  author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
  journal={arXiv preprint arXiv:2501.13106},
  year={2025},
  url = {https://arxiv.org/abs/2501.13106}
}

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}

@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}

👍 Acknowledgement

Our VideoLLaMA3 is built on top of SigLip and Qwen2.5. We also learned a lot from the implementation of LLaVA-OneVision, InternVL2, and Qwen2VL. Besides, our VideoLLaMA3 benefits from tons of open-source efforts. We sincerely appreciate these efforts and compile a list in ACKNOWLEDGEMENT.md to express our gratitude. If your work is used in VideoLLaMA3 but not mentioned in either this repo or the technical report, feel free to let us know ❤️.

🔒 License

This project is released under the Apache 2.0 license as found in the LICENSE file. The service is a research preview intended for non-commercial use ONLY, subject to the model Licenses of Qwen, Terms of Use of the data generated by OpenAI and Gemini, and Privacy Practices of ShareGPT. Please get in touch with us if you find any potential violations.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
evaluation		evaluation
inference		inference
scripts		scripts
videollama3		videollama3
.gitignore		.gitignore
ACKNOWLEDGEMENT.md		ACKNOWLEDGEMENT.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

📰 News

🌟 Introduction

🛠️ Requirements and Installation

🌎 Model Zoo

🤖 Inference

CookBook

🤗 Demo

🗝️ Training & Evaluation

📑 Citation

👍 Acknowledgement

🔒 License

About

Releases

Packages

Contributors 8

Languages

License

DAMO-NLP-SG/VideoLLaMA3

Folders and files

Latest commit

History

Repository files navigation

VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

📰 News

🌟 Introduction

🛠️ Requirements and Installation

🌎 Model Zoo

🤖 Inference

CookBook

🤗 Demo

🗝️ Training & Evaluation

📑 Citation

👍 Acknowledgement

🔒 License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages