license: cc-by-4.0 task_categories:
- text-to-image
- zero-shot-classification language:
- en size_categories:
- 10M<n<100M
Raw/Long/Short Caption | Huggingface Dataset |
---|---|
CC3M+YFCC15M+CC12M | Link |
- Homepage: DreamLIP homepage
- Repository: DreamLIP repository
- Paper: DreamLIP: Language-Image Pre-training with Long Captions
DreamLIP-Long-Captions is a dataset consisting of ~30M image annotations, i.e. detailed long captions. In contrast with the curated style of other synthetic image caption annotations, DreamLIP-30M utilizes pre-trained Multi-modality Large Language Model to obtain detailed descriptions with an average length of 247. More precisely, the detailed descriptions are generated by asking the ShareGPT4V/InstructBLIP/LLava1.5 the question "Describe the image in detail". Meanwhile, we also provide the generated short caption by prompting "Describe the image in one sentence". The question of detailed long captions has little impact on the diversity of answers, so we can obtain comprehensive captions of each image.
Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen and Yujun Shen.
We distribute the image url with long captions under a standard Creative Common CC-BY-4.0 license. The individual images are under their own copyrights.
@inproceedings{DreamLIP,
title={DreamLIP: Language-Image Pre-training with Long Captions},
author={Zheng, Kecheng and Zhang, Yifei and Wu, Wei and Lu, Fan and Ma, Shuailei and Jin, Xin and Chen, Wei and Shen, Yujun},
booktitle={ECCV},
year={2024}
}
This dataset is based on CC3M, and thanks for the nice work! We also thank InstructBLIP, ShareGPT4V and LLAVA for the pretrained models.