FlexAttention for Efficient High-Resolution Vision-Language Models

Overview

This repository contains the official code for FlexAttention for Efficient High-Resolution Vision-Language Models.

News

Jan 2025: Training documentation released.
July 2024: Open-source codebase and evaluation.
July 2024: Accepted by ECCV'2024!

Installation

conda create -n flexattention python=3.9
conda activate flexattention
pip install -e .
pip install -e ".[train]"
pip install -e ./transformers

Checkpoint

You can download our 7B model checkpoint from huggingface and put it into checkpoints folder.

Evaluation

TextVQA

Follow this instruction to download the textvqa evaluaton images and annotation, and extract to datasets/eval/textvqa.
Run the multi-gpu inference:

torchrun --nproc_per_node 3 scripts/evaluation/eval_textvqa.py --dist --model-path checkpoints/llava-v1.5-7b-flexattn --id llava-v1.5-7b-flexattn

It will generate a file similar to answer_textvqa_llava-v1.5-7b-flexattn_xxx.jsonl on the folder root.

Run the evaluation script:

bash scripts/evaluation/get_textvqa_score.sh ANSWER_FILE

V* Bench

Download the dataset from huggingface.

git lfs install
git clone https://huggingface.co/datasets/craigwu/vstar_bench

Run the multi-gpu inference:

# Attribute
torchrun --nproc_per_node 3 scripts/evaluation/eval_vbench.py --dist --model-path checkpoints/llava-v1.5-7b-flexattn --id llava-v1.5-7b-flexattn --subset direct_attributes

# Spatial
torchrun --nproc_per_node 3 scripts/evaluation/eval_vbench.py --dist --model-path checkpoints/llava-v1.5-7b-flexattn --id llava-v1.5-7b-flexattn --subset relative_position

MagnifierBench

Download the dataset from here, and extract it to datasets/eval/.
Run the multi-gpu inference:

torchrun --nproc_per_node 3 scripts/evaluation/eval_magnifier.py --dist --model-path checkpoints/llava-v1.5-7b-flexattn --id llava-v1.5-7b-flexattn

Training

Prepare Data

First, follow LLaVA's instruction to prepare the image and annotation. The overall folder structure will look like this:

playground
├── llava_v1_5_mix665k
       ├── llava_v1_5_mix665k.json
       ├── coco
       │   └── train2017
       ├── gqa
       │   └── images
       ├── ocr_vqa
       │   └── images
       ├── textvqa
       │   └── train_images
       └── vg
           ├── VG_100K
           └── VG_100K_2

Then, run the data cleaning script to clean the data.

python tools/prepare_data.py

In this script, we perform the following actions:

Insert a placeholder <image> tag for samples containing only text.
Correct any incorrect image file extensions found in the original data.
Remove samples that use non-existent images.

You can directly download the prepared file here.

Start Training

Finally, run the training script:

bash scripts/train/llava-v1.5-7b-flexattn.sh

Acknowledgement

LLaVA: the codebase that our project build on. Thanks for their amazing code and model.

Citation

If our work is useful or relevant to your research, please kindly recognize our contributions by citing our paper:

@inproceedings{li2025flexattention,
  title={Flexattention for efficient high-resolution vision-language models},
  author={Li, Junyan and Chen, Delin and Cai, Tianle and Chen, Peihao and Hong, Yining and Chen, Zhenfang and Shen, Yikang and Gan, Chuang},
  booktitle={European Conference on Computer Vision},
  pages={286--302},
  year={2025},
  organization={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.devcontainer		.devcontainer
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
assets		assets
llava		llava
scripts		scripts
tools		tools
transformers		transformers
utils		utils
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FlexAttention for Efficient High-Resolution Vision-Language Models

Overview

News

Installation

Checkpoint

Evaluation

TextVQA

V* Bench

MagnifierBench

Training

Prepare Data

Start Training

Acknowledgement

Citation

About

Releases

Packages

Languages

License

UMass-Foundation-Model/FlexAttention

Folders and files

Latest commit

History

Repository files navigation

FlexAttention for Efficient High-Resolution Vision-Language Models

Overview

News

Installation

Checkpoint

Evaluation

TextVQA

V* Bench

MagnifierBench

Training

Prepare Data

Start Training

Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages