Skip to content

Datasets and evaluation metrics used for quality benchmarking

Li Xiang edited this page Apr 17, 2024 · 5 revisions
  1. Introduction
  2. Human Preference Dataset v2 (HPD v2)
  3. Human Preference Score v2 (HPS v2)
  4. Results and Observations

Introduction

OneDiff is an out-of-the-box library for accelerating diffusion models, and it is important to effectively evaluate the quality of images generated after applying acceleration techniques.

The OneDiffGenMetrics currently uses test data from the Human Preference Dataset v2 (HPD v2), which includes 3,200 prompts, to calculate various evaluation metrics. These metrics include the Human Preference Score v2 (HPS v2), CLIP Score, Aesthetic Score, and Inception Score (IS).

Human Preference Dataset v2 (HPD v2)

HPD v2 is a large-scale text-to-image dataset capturing human preferences across images from various sources. The dataset is structured as follows:

img

Human Preference Score v2 (HPS v2)

By fine-tuning the CLIP model on the training data of HPD v2, HPS v2 was developed, a scoring model that can more accurately predict human preferences on text-generated images. Regarding the specifics, each instance in the training set contains a pair of images {x1, x2} and a prompt p. If image x1 is more favored than x2, it is labeled as y = [1, 0]; otherwise, it is labeled as y = [0, 1]. The CLIP model can be considered a scoring function s, which calculates the similarity between the prompt p and the image x:

formula

In this model, τ is a temperature scalar learned by the CLIP model, and θ represents the parameters within CLIP. The predicted preferences $\hat{y}_i$ are computed as follows:

formula

θ is optimized by minimizing the KL divergence:

formula

Eventually, the ViT-H/14 model trained using OpenCLIP is fine-tuned on HPD v2 over 4,000 steps, focusing on the last 20 layers of the CLIP image encoder and the last 11 layers of the CLIP text encoder.

Results and Observations

Using HPS v2, the quality assessments of OneDiff's Compile, DeepCache, and quantization techniques are as follows:

Optimization Technique Paintings Photo Concept-Art Anime Average Score
OneDiff Quant + OneDiff DeepCache (EE) 28.51 ± 0.4962 26.91 ± 0.4605 28.42 ± 0.3953 30.50 ± 0.3470 28.58
OneDiff Quant (EE) 30.05 ± 0.3897 28.26 ± 0.4339 30.04 ± 0.3807 31.79 ± 0.3224 30.04
OneDiff DeepCache (CE) 28.45 ± 0.3816 27.03 ± 0.3348 28.56 ± 0.3517 30.49 ± 0.3626 28.63
OneDiff Compile (CE) 30.07 ± 0.3789 28.42 ± 0.2491 30.17 ± 0.2834 31.73 ± 0.3485 30.10
Pytorch 30.07 ± 0.3887 28.43 ± 0.2726 30.16 ± 0.2686 31.74 ± 0.3691 30.10

It is observed that OneDiff's Compile acceleration is nearly lossless, showing no change in the Average Score compared to Pytorch. Although quantization and DeepCache slightly lower scores, they offer faster inference speeds.

The training data for HPD v2 is not yet published. However, based on the HPD v2 test data, OneDiffGenMetrics also calculates CLIP Score, Aesthetic Score, and IS to more comprehensively assess the quality of images produced after acceleration with OneDiff.

The comparison results on these metrics can be found at https://github.com/siliconflow/OneDiffGenMetrics/blob/main/README.md#sdxl. The results from OneDiff Compile remain consistent with those from Pytorch across these metrics.

Details about these metrics can be found at: