Skip to content

Frontier Multimodal Foundation Models for Image and Video Understanding

License

Notifications You must be signed in to change notification settings

DAMO-NLP-SG/VideoLLaMA3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

If our project helps you, please give us a star ⭐ on GitHub to support us. πŸ™πŸ™

hf_space hf_checkpoint
License Hits GitHub issues GitHub closed issues

πŸ’‘ Some other multimodal-LLM projects from our team may interest you ✨.

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng*, Sicong Leng*, Hang Zhang*, Yifei Xin*, Xin Li*, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing
github github arXiv

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing
github github arXiv

VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Sicong Leng*, Hang Zhang*, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing
github github arXiv

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
Sicong Leng*, Yun Xing*, Zesen Cheng*, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing
github github arXiv

Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss
Zesen Cheng*, Hang Zhang*, Kehan Li*, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, Lidong Bing
github github arXiv

πŸ“° News

  • [2025.01.21] Release models and inference code of VideoLLaMA 3.

🌟 Introduction

VideoLLaMA 3 is a series of multimodal foundation models with frontier image and video understanding capacity.

πŸ’‘Click here to show detailed performance on video benchmarks
πŸ’‘Click here to show detailed performance on image benchmarks

πŸ› οΈ Requirements and Installation

Basic Dependencies:

  • Python >= 3.10
  • Pytorch >= 2.4.0
  • CUDA Version >= 11.8
  • transformers >= 4.46.3

Install required packages:

[Inference-only]

pip install torch==2.4.0 torchvision==0.17.0 --extra-index-url https://download.pytorch.org/whl/cu118

pip install flash-attn --no-build-isolation
pip install transformers==4.46.3 accelerate==1.0.1
pip install decord ffmpeg-python imageio opencv-python

[Training]

git clone https://github.com/DAMO-NLP-SG/VideoLLaMA3
cd VideoLLaMA3
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

🌎 Model Zoo

Model Base Model HF Link
VideoLLaMA3-7B Qwen2.5-7B DAMO-NLP-SG/VideoLLaMA3-7B
VideoLLaMA3-2B Qwen2.5-1.5B DAMO-NLP-SG/VideoLLaMA3-2B
VideoLLaMA3-7B-Image Qwen2.5-7B DAMO-NLP-SG/VideoLLaMA3-7B-Image
VideoLLaMA3-2B-Image Qwen2.5-1.5B DAMO-NLP-SG/VideoLLaMA3-2B-Image

We also upload the tuned vision encoder of VideoLLaMA3-7B for wider application:

Model Base Model HF Link
VideoLLaMA3-7B Vision Encoder siglip-so400m-patch14-384 DAMO-NLP-SG/VL3-SigLIP-NaViT

πŸ€– Inference

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

device = "cuda:0"
model_path = "DAMO-NLP-SG/VideoLLaMA3-7B"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    device_map={"": device},
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {"type": "video", "video": {"video_path": "./assets/cat_and_chicken.mp4", "fps": 1, "max_frames": 128}},
            {"type": "text", "text": "What is the cat doing?"},
        ]
    },
]

inputs = processor(
    conversation=conversation,
    add_system_prompt=True,
    add_generation_prompt=True,
    return_tensors="pt"
)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
if "pixel_values" in inputs:
    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
output_ids = model.generate(**inputs, max_new_tokens=128)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(response)

For more cases, please refer to examples.

CookBook

Checkout inference notebooks that demonstrate how to use VideoLLaMA3 on various applications such as single image understanding, multi-image understanding, visual referring and grounding, video understanding, etc.

πŸ€— Demo

It is highly recommended to try our online demo first.

Otherwise, you can launch a gradio app locally:

python inference/launch_gradio_demo.py --model-path DAMO-NLP-SG/VideoLLaMA3-7B

options:
  --model-path MODEL_PATH, --model_path MODEL_PATH
  --server-port SERVER_PORT, --server_port SERVER_PORT
  	Optional. Port of the model server.
  --interface-port INTERFACE_PORT, --interface_port INTERFACE_PORT
  	Optional. Port of the gradio interface.
  --nproc NPROC
  	Optional. Number of model processes.

πŸ—οΈ Training & Evaluation

Coming soon...

πŸ“‘ Citation

If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:

@article{damonlpsg2025videollama3,
  title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding},
  author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
  journal={arXiv preprint arXiv:2501.13106},
  year={2025},
  url = {https://arxiv.org/abs/2501.13106}
}

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}

@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}

πŸ‘ Acknowledgement

Our VideoLLaMA3 is built on top of SigLip and Qwen2.5. We also learned a lot from the implementation of LLaVA-OneVision, InternVL2, and Qwen2VL. Besides, our VideoLLaMA3 benefits from tons of open-source efforts. We sincerely appreciate these efforts and compile a list in ACKNOWLEDGEMENT.md to express our gratitude. If your work is used in VideoLLaMA3 but not mentioned in either this repo or the technical report, feel free to let us know ❀️.

πŸ”’ License

This project is released under the Apache 2.0 license as found in the LICENSE file. The service is a research preview intended for non-commercial use ONLY, subject to the model Licenses of Qwen, Terms of Use of the data generated by OpenAI and Gemini, and Privacy Practices of ShareGPT. Please get in touch with us if you find any potential violations.

About

Frontier Multimodal Foundation Models for Image and Video Understanding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published