[CVPR 2023]ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing
Zequn Zeng,
Hao Zhang,
Zhengjue Wang,
Ruiying Lu,
Dongsheng Wang,
Bo Chen
- [2023/4] Adding demo on Huggingface Space and Colab!
- [2023/3] ConZIC is publicly released!
Please download CLIP and BERT from Huggingface Space.
SketchyCOCOcaption benchmark in our work is available here.
Environments setup.
pip install -r requirements.txt
ConZIC supports arbitary generation orders by change order. You can increase alpha for more fluency, beta for more image content. Notably, there is a trade-off between fluency and image-matching degree.
Sequential: update tokens in classical left to right order. At each iteration, the whole sentence will be updated.
python demo.py --run_type "caption" --order "sequential" --sentence_len 10 --caption_img_path "./examples/girl.jpg" --samples_num 1
--lm_model "bert-base-uncased" --match_model "openai/clip-vit-base-patch32"
--alpha 0.02 --beta 2.0
Shuffled: update tokens in random shuffled generation order, different orders resulting in different captions.
python demo.py --run_type "caption" --order "shuffle" --sentence_len 10 --caption_img_path "./examples/girl.jpg" --samples_num 3
--lm_model "bert-base-uncased" --match_model "openai/clip-vit-base-patch32"
--alpha 0.02 --beta 2.0
Random: only randomly select a position and then update this token at each iteration, high diversity due to high randomness.
python demo.py --run_type "caption" --order "random" --sentence_len 10 --caption_img_path "./examples/girl.jpg" --samples_num 3
--lm_model "bert-base-uncased" --match_model "openai/clip-vit-base-patch32"
--alpha 0.02 --beta 2.0
ConZIC supports many text-related controllable signals. For examples:
Sentiments(positive/negative): you can increase gamma for higher controllable degree, there is also a trade-off.
python demo.py
--run_type "controllable" --control_type "sentiment" --sentiment_type "positive"
--order "sequential" --sentence_len 10 --caption_img_path "./examples/girl.jpg" --samples_num 1
--lm_model "bert-base-uncased" --match_model "openai/clip-vit-base-patch32"
--alpha 0.02 --beta 2.0 --gamma 5.0
Part-of-speech(POS): it will meet the predefined POS templete as much as possible.
python demo.py
--run_type "controllable" --control_type "pos" --order "sequential"
--pos_type "your predefined POS templete"
--sentence_len 10 --caption_img_path "./examples/girl.jpg" --samples_num 1
--lm_model "bert-base-uncased" --match_model "openai/clip-vit-base-patch32"
--alpha 0.02 --beta 2.0 --gamma 5.0
Length: change sentence_len.
We highly recommend to use the following WebUI demo in your browser from the local url: http://127.0.0.1:7860.
pip install gradio
python app.py --lm_model "bert-base-uncased" --match_model "openai/clip-vit-base-patch32"
You can also use the demo.launch() function to create a public link used by anyone to access the demo from their browser by setting share=True.
Please cite our work if you use it in your research:
@article{zeng2023conzic,
title={ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing},
author={Zeng, Zequn and Zhang, Hao and Wang, Zhengjue and Lu, Ruiying and Wang, Dongsheng and Chen, Bo},
journal={arXiv preprint arXiv:2303.02437},
year={2023}
}
If you have any questions, please contact [email protected] or [email protected].
This code is based on the bert-gen and MAGIC.
Thanks for Jiaqing Jiang providing huggingface and Colab demo.