A Survey of Instruction-Guided Image and Media Editing in LLM Era
A collection of academic articles, published methodology, and datasets on the subject of Instruction-Guided Image and Media Editing.
A sortable version is available here: https://awesome-instruction-editing.github.io/
📌 We are actively tracking the latest research and welcome contributions to our repository and survey paper. If your studies are relevant, please feel free to create an issue or a pull request.
📰 2024-11-15: Our paper Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era has been revised into version 1 with new methods and dicussions.
If you find this work helpful in your research, welcome to cite the paper and give a ⭐.
Please read and cite our paper:
Nguyen, T.T., Ren, Z., Pham, T., Huynh, T.T., Nguyen, P.L., Yin, H., and Nguyen, Q.V.H., 2024. Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM Era. arXiv preprint arXiv:2411.09955.
@article{nguyen2024instruction,
title={Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era},
author={Thanh Tam Nguyen and Zhao Ren and Trinh Pham and Thanh Trung Huynh and Phi Le Nguyen and Hongzhi Yin and Quoc Viet Hung Nguyen},
journal={arXiv preprint arXiv:2411.09955},
year={2024}
}
Paper Title | Venue | Year | Focus |
---|---|---|---|
A Survey of Multimodal Composite Editing and Retrieval | arXiv | 2024 | Media Retrieval |
INFOBENCH: Evaluating Instruction Following Ability in Large Language Models | arXiv | 2024 | Text Editing |
Multimodal Image Synthesis and Editing: The Generative AI Era | TPAMI | 2023 | X-to-Image Generation |
LLM-driven Instruction Following: Progresses and Concerns | EMNLP | 2023 | Text Editing |
Dataset | #Items | #Papers Used | Link |
---|---|---|---|
Reason-Edit | 12.4M+ | 1 | Link |
MagicBrush | 10K | 1 | Link |
InstructPix2Pix | 500K | 1 | Link |
EditBench | 240 | 1 | Link |
Dataset | #Items | #Papers Used | Link |
---|---|---|---|
Conceptual Captions | 3.3M | 1 | Link |
CoSaL | 22K+ | 1 | Link |
ReferIt | 19K+ | 1 | Link |
Oxford-102 Flowers | 8K+ | 1 | Link |
LAION-5B | 5.85B+ | 1 | Link |
MS-COCO | 330K | 2 | Link |
DeepFashion | 800K | 2 | Link |
Fashion-IQ | 77K+ | 1 | Link |
Fashion200k | 200K | 1 | Link |
MIT-States | 63K+ | 1 | Link |
CIRR | 36K+ | 1 | Link |
Dataset | #Items | #Papers Used | Link |
---|---|---|---|
CoDraw | 58K+ | 1 | Link |
Dataset | #Items | #Papers Used | Link |
---|---|---|---|
i-CLEVR | 70K+ | 1 | Link |
Dataset | #Items | #Papers Used | Link |
---|---|---|---|
ADE20K | 27K+ | 1 | Link |
Dataset | #Items | #Papers Used | Link |
---|---|---|---|
Oxford-III-Pets | 7K+ | 1 | Link |
Dataset | #Items | #Papers Used | Link |
---|---|---|---|
NYUv2 | 408K+ | 1 | Link |
Dataset | #Items | #Papers Used | Link |
---|---|---|---|
Laion-Aesthetics V2 | 2.4B+ | 1 | Link |
Dataset | #Items | #Papers Used | Link |
---|---|---|---|
CelebA-Dialog | 202K+ | 1 | Link |
Flickr-Faces-HQ | 70K | 2 | Link |
Category | Evaluation Metrics | Formula | Usage |
---|---|---|---|
Perceptual Quality | Learned Perceptual Image Patch Similarity (LPIPS) | Measures perceptual similarity between images, with lower scores indicating higher similarity. | |
Structural Similarity Index (SSIM) | Measures visual similarity based on luminance, contrast, and structure. | ||
Fréchet Inception Distance (FID) | Measures the distance between the real and generated image feature distributions. | ||
Inception Score (IS) | Evaluates image quality and diversity based on label distribution consistency. | ||
Structural Integrity | Peak Signal-to-Noise Ratio (PSNR) | Measures image quality based on pixel-wise errors, with higher values indicating better quality. | |
Mean Intersection over Union (mIoU) | Assesses segmentation accuracy by comparing predicted and ground truth masks. | ||
Mask Accuracy | Evaluates the accuracy of generated masks. | ||
Boundary Adherence | Measures how well edits preserve object boundaries. | ||
Semantic Alignment | Edit Consistency | Measures the consistency of edits across similar prompts. | |
Target Grounding Accuracy | Evaluates how well edits align with specified targets in the prompt. | ||
Embedding Space Similarity | Measures similarity between the edited and reference images in feature space. | ||
Decomposed Requirements Following Ratio (DRFR) | Assesses how closely the model follows decomposed instructions. | ||
User-Based Metrics | User Study Ratings | Captures user feedback through ratings of image quality. | |
Human Visual Turing Test (HVTT) | Measures the ability of users to distinguish between real and generated images. | ||
Click-through Rate (CTR) | Tracks user engagement by measuring image clicks. | ||
Diversity and Fidelity | Edit Diversity | Measures the variability of generated images. | |
GAN Discriminator Score | Assesses the authenticity of generated images using a GAN discriminator. | ||
Reconstruction Error | Measures the error between the original and generated images. | ||
Edit Success Rate | Quantifies the success of applied edits. | ||
Consistency and Cohesion | Scene Consistency | Measures how edits maintain overall scene structure. | |
Color Consistency | Measures color preservation between edited and original regions. | ||
Shape Consistency | Quantifies how well shapes are preserved during edits. | ||
Pose Matching Score | Assesses pose consistency between original and edited images. | ||
Robustness | Noise Robustness | Evaluates model robustness to noise. | |
Perceptual Quality | A subjective quality metric based on human judgment. |
Disclaimer
Feel free to contact us if you have any queries or exciting news. In addition, we welcome all researchers to contribute to this repository and further contribute to the knowledge of this field.
If you have some other related references, please feel free to create a Github issue with the paper information. We will glady update the repos according to your suggestions. (You can also create pull requests, but it might take some time for us to do the merge)