This repository is for the paper PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM (under review).
- We updated the arXiv paper with PosterGen as a supplement for PosterLLaVa.
- The preprocessed dataset (saliency maps and inpainted background images) is provided to save repetitive effort.
Now we are heading from layout toward the real sense of graphic poster design. A text-to-poster pipeline, PosterGen, will soon be available online to support the real-world application of PosterLLaVA. There are some previewing examples:
Notice that we only authorize the use of the proposed dataset for scientific research. One should NOT use it for commercial purposes without our authorization (please refer to CC-BY-NC license.)
[2024.03.26] Release online demo and pre-trained model on hugging faceπ€.
[2024.06.05] Release arXiv paperπ.
[2024.07.04] Release QB-Poster datasetπ. (raw files contain original poster images and JSON annotations, inpainting and saliency detection techniques are needed for obtaining background images and saliency maps. Our paper used lama for inpainting and basenet for saliency detection.)
[2024.07.04] Release User-Constrained datasetπ. (only include user-constraint annotation files. please refer to the CGL-dataset and PosterLayout dataset to get the poster images and bounding box annotations.)
[2024.07.04] Release data pre-processing, training, and inferencing code.
[2024.08.29] An automatic text-to-poster system PosterGen (with PosterLLaVA as the backbone) will soon be open-sourced to supplement this work.
[2024.11.26] Updated the arXiv paper including the PosterGen section.
[2024.11.26] Uploaded the preprocessed saliency maps and inpainted background images of QB-Poster. Please refer to the inpainted background images used for training (with extra 0.5x random-selected regions inpainted to avoid overfitting), and evaluation, and the saliency maps extracted with basenet.
[Coming Soon] Release evaluation code.
[Coming Soon] Release notebook demo for PosterGen.
Run the following code to build the environment.
pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
Download the dataset files and arrange them as follows (QB-Poster as an example). Run the saliency detection method to get 'saliency_map' and the inpainting method to get 'inpainted_1x' and 'inpainted_1d5x' (used for inference and training respectively; notice we randomly inpainted 0.5x more regions besides the ground-truth bounding box area to avoid overfitting.)
βββ data
β βββ prompt_template.txt
β βββ qbposter <--
β βββ get_prompt.py
| βββ raw
β βββ original_poster
β βββ saliency_map
β βββ inpainted_1x
β βββ inpainted_1d5x
β βββ annotation.json
...
βββ README.md
Run the data preprocessing script.
python data/qbposter/get_prompt.py
Ultimately you will get two processed JSON files (each containing instruction-answer pairs) like this.
βββ data
β βββ prompt_template.txt
β βββ qbposter
β βββ get_prompt.py
β βββ qbposter_train_instruct.json <--
β βββ qbposter_val_instruct.json <--
...
βββ README.md
Please download LLaVa-v1.5 pre-trained checkpoint and CLIP vision encoder first and put it in the 'huggingface' subfolder.
βββ data
βββ huggingface <--
| βββ llava-v1.5-7b
| βββ clip-vit-large-patch14-336
βββ scripts
| βββ qbposter
| βββ finetune.sh <--
| βββ inference.sh
...
βββ README.md
Then run the following script.
qbposter/finetune.sh
Please download the pre-trained PosterLLaVa_v0 checkpoint, which is initialized with LLaVa-v1.5 checkpoint and fine-tuned on the following combined datasets.
- 7k banner layouts from Ad Banner dataset.
- 60k commercial poster layouts from CGL-dataset and PosterLayout with text constraints.
- 4k social media poster layouts from QB-Poster dataset.
Put it in the 'pretrained_model' subfolder.
βββ data
βββ huggingface
βββ pretrained_model <--
| βββ posterllava_v0
βββ scripts
| βββ qbposter
| βββ finetune.sh
| βββ inference.sh <--
...
βββ README.md
Then run the following script to generate JSON format layout.
qbposter/inference.sh
Coming Soon...
If you find this project/paper useful, please give us a star/citation.
@misc{yang2024posterllava,
title={PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM},
author={Tao Yang and Yingmin Luo and Zhongang Qi and Yang Wu and Ying Shan and Chang Wen Chen},
year={2024},
eprint={2406.02884},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2406.02884},
}