Model Zoo

We recommend using the largest clip-flant5-xxl (11B) for optimal performance.

CLIP-FlanT5 (Stage-2 VQA)

Version	Size	Checkpoint	Winoground	EqBen-Mini	DrawBench	EditBench	COCO-T2I	TIFA160	Pick-a-Pic	GenAI-Bench-Image	GenAI-Bench-Video
CLIP-FlanT5-XXL	11B	zhiqiulin/clip-flant5-xxl	46.0	47.9	85.3	77.0	85.0	71.2	84.0	64.1	63.2
CLIP-FlanT5-XL	3B	zhiqiulin/clip-flant5-xl	34.8	39.3	82.8	74.5	80.7	68.8	84.0	61.8	60.2

Projector weights (Stage-1 Captioning)

These are projector weights we have pretrained.

NOTE: When you use our pretrained projectors for training on VQA data, it is very important to use the same base LLM (FlanT5) and vision encoder as the one we used for pretraining the projector. Otherwise, the performance will be very poor.

Base LLM	Vision Encoder	Projection	Pretrain Data	Pretraining schedule	Download
FlanT5-11B	CLIP-L-336px	MLP-2x	LCS-558K	1e	projector
FlanT5-3B	CLIP-L-336px	MLP-2x	LCS-558K	1e	projector

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MODEL_ZOO.md

MODEL_ZOO.md

Model Zoo

CLIP-FlanT5 (Stage-2 VQA)

Projector weights (Stage-1 Captioning)

Files

MODEL_ZOO.md

Latest commit

History

MODEL_ZOO.md

File metadata and controls

Model Zoo

CLIP-FlanT5 (Stage-2 VQA)

Projector weights (Stage-1 Captioning)