Datasets and evaluation metrics used for quality benchmarking

Introduction
Human Preference Dataset v2 (HPD v2)
Human Preference Score v2 (HPS v2)
Results and Observations

Introduction

OneDiff is an out-of-the-box library for accelerating diffusion models, and it is important to effectively evaluate the quality of images generated after applying acceleration techniques.

The OneDiffGenMetrics currently uses test data from the Human Preference Dataset v2 (HPD v2), which includes 3,200 prompts, to calculate various evaluation metrics. These metrics include the Human Preference Score v2 (HPS v2), CLIP Score, Aesthetic Score, and Inception Score (IS).

Human Preference Dataset v2 (HPD v2)

HPD v2 is a large-scale text-to-image dataset capturing human preferences across images from various sources. The dataset is structured as follows:

Human Preference Score v2 (HPS v2)

By fine-tuning the CLIP model on the training data of HPD v2, HPS v2 was developed, a scoring model that can more accurately predict human preferences on text-generated images. Regarding the specifics, each instance in the training set contains a pair of images {x1, x2} and a prompt p. If image x1 is more favored than x2, it is labeled as y = [1, 0]; otherwise, it is labeled as y = [0, 1]. The CLIP model can be considered a scoring function s, which calculates the similarity between the prompt p and the image x:

formula

In this model, τ is a temperature scalar learned by the CLIP model, and θ represents the parameters within CLIP. The predicted preferences $\hat{y}_i$ are computed as follows:

formula

θ is optimized by minimizing the KL divergence:

formula

Eventually, the ViT-H/14 model trained using OpenCLIP is fine-tuned on HPD v2 over 4,000 steps, focusing on the last 20 layers of the CLIP image encoder and the last 11 layers of the CLIP text encoder.

Results and Observations

Using HPS v2, the quality assessments of OneDiff's Compile, DeepCache, and quantization techniques are as follows:

Optimization Technique	Paintings	Photo	Concept-Art	Anime	Average Score
OneDiff Quant + OneDiff DeepCache (EE)	28.51 ± 0.4962	26.91 ± 0.4605	28.42 ± 0.3953	30.50 ± 0.3470	28.58
OneDiff Quant (EE)	30.05 ± 0.3897	28.26 ± 0.4339	30.04 ± 0.3807	31.79 ± 0.3224	30.04
OneDiff DeepCache (CE)	28.45 ± 0.3816	27.03 ± 0.3348	28.56 ± 0.3517	30.49 ± 0.3626	28.63
OneDiff Compile (CE)	30.07 ± 0.3789	28.42 ± 0.2491	30.17 ± 0.2834	31.73 ± 0.3485	30.10
Pytorch	30.07 ± 0.3887	28.43 ± 0.2726	30.16 ± 0.2686	31.74 ± 0.3691	30.10

It is observed that OneDiff's Compile acceleration is nearly lossless, showing no change in the Average Score compared to Pytorch. Although quantization and DeepCache slightly lower scores, they offer faster inference speeds.

The training data for HPD v2 is not yet published. However, based on the HPD v2 test data, OneDiffGenMetrics also calculates CLIP Score, Aesthetic Score, and IS to more comprehensively assess the quality of images produced after acceleration with OneDiff.

The comparison results on these metrics can be found at https://github.com/siliconflow/OneDiffGenMetrics/blob/main/README.md#sdxl. The results from OneDiff Compile remain consistent with those from Pytorch across these metrics.

Details about these metrics can be found at:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets and evaluation metrics used for quality benchmarking

Introduction

Human Preference Dataset v2 (HPD v2)

Human Preference Score v2 (HPS v2)

Results and Observations

Clone this wiki locally