-
Notifications
You must be signed in to change notification settings - Fork 1
Datasets and evaluation metrics used for quality benchmarking
- Introduction
- Human Preference Dataset v2 (HPD v2)
- Human Preference Score v2 (HPS v2)
- Results and Observations
OneDiff is an out-of-the-box library for accelerating diffusion models, and it is important to effectively evaluate the quality of images generated after applying acceleration techniques.
The OneDiffGenMetrics currently uses test data from the Human Preference Dataset v2 (HPD v2), which includes 3,200 prompts, to calculate various evaluation metrics. These metrics include the Human Preference Score v2 (HPS v2), CLIP Score, Aesthetic Score, and Inception Score (IS).
HPD v2 is a large-scale text-to-image dataset capturing human preferences across images from various sources. The dataset is structured as follows:
By fine-tuning the CLIP model on the training data of HPD v2, HPS v2 was developed, a scoring model that can more accurately predict human preferences on text-generated images. Regarding the specifics, each instance in the training set contains a pair of images {x1, x2} and a prompt p. If image x1 is more favored than x2, it is labeled as y = [1, 0]; otherwise, it is labeled as y = [0, 1]. The CLIP model can be considered a scoring function s, which calculates the similarity between the prompt p and the image x:
In this model, τ is a temperature scalar learned by the CLIP model, and θ represents the parameters within CLIP. The predicted preferences
θ is optimized by minimizing the KL divergence:
Eventually, the ViT-H/14 model trained using OpenCLIP is fine-tuned on HPD v2 over 4,000 steps, focusing on the last 20 layers of the CLIP image encoder and the last 11 layers of the CLIP text encoder.
Using HPS v2, the quality assessments of OneDiff's Compile, DeepCache, and quantization techniques are as follows:
Optimization Technique | Paintings | Photo | Concept-Art | Anime | Average Score |
---|---|---|---|---|---|
OneDiff Quant + OneDiff DeepCache (EE) | 28.51 ± 0.4962 | 26.91 ± 0.4605 | 28.42 ± 0.3953 | 30.50 ± 0.3470 | 28.58 |
OneDiff Quant (EE) | 30.05 ± 0.3897 | 28.26 ± 0.4339 | 30.04 ± 0.3807 | 31.79 ± 0.3224 | 30.04 |
OneDiff DeepCache (CE) | 28.45 ± 0.3816 | 27.03 ± 0.3348 | 28.56 ± 0.3517 | 30.49 ± 0.3626 | 28.63 |
OneDiff Compile (CE) | 30.07 ± 0.3789 | 28.42 ± 0.2491 | 30.17 ± 0.2834 | 31.73 ± 0.3485 | 30.10 |
Pytorch | 30.07 ± 0.3887 | 28.43 ± 0.2726 | 30.16 ± 0.2686 | 31.74 ± 0.3691 | 30.10 |
It is observed that OneDiff's Compile acceleration is nearly lossless, showing no change in the Average Score compared to Pytorch. Although quantization and DeepCache slightly lower scores, they offer faster inference speeds.
The training data for HPD v2 is not yet published. However, based on the HPD v2 test data, OneDiffGenMetrics also calculates CLIP Score, Aesthetic Score, and IS to more comprehensively assess the quality of images produced after acceleration with OneDiff.
The comparison results on these metrics can be found at https://github.com/siliconflow/OneDiffGenMetrics/blob/main/README.md#sdxl. The results from OneDiff Compile remain consistent with those from Pytorch across these metrics.
Details about these metrics can be found at: