Skip to content

Latest commit

 

History

History
85 lines (64 loc) · 4.1 KB

README.md

File metadata and controls

85 lines (64 loc) · 4.1 KB

Training Diffusion Model with Multi-Objective Reinforcement Learning

View our Poster for comprehensive information of our project.

Our repository's code is referenced from DDPO and MORL_Baseline.

Motivation

Human preference is complex. Usually, we have multi-objective rewards. For example 30% compressibility and 70% aesthetic quality. Furthermore, considering multi-objective rewards can somehow solve the over optimization problem (Kevin Black 2024), since it gives constrain to each reward in the preference.

Installation

Requires Python 3.10 or newer. Install the package in your virtual env:

pip install -r requires.txt

Usage

Set up the configuration of accelerate. Choose your device and set up distributed training.

accelerate config

Training:

python -m accelerate.commands.launch mo_train.py

Method

Our method combines DDPO and PGMORL. The algorithm consists of warm-up stage and evolutionary stage. The goal of PGMORL is to approximate the Pareto front, which consists of policies that represent optimal trade-offs among the objectives.

  1. Warm-up stage
    • Generate task set $\mathcal{T}={(\pi_i,\omega_i)_{i=1}^n}$ by initial policies and evenly distributed weight vectors. Here $\pi_i$ is LORA layers.
    • Every agent updates on its own: where $r(x_0,c)$ is multi-dimension rewards.
    • Every agent does sampling and updating independently. We only update the LORA layer and share the pretrained UNet. Storing $(\text{F}(\pi_i), \text{F}(\pi_i^{'}), \omega_i)$ in history $\mathcal{R}$ for further prediction.
$$\nabla_{\theta} \mathcal{J}_{\text{DDRL}}(\omega_i) = \mathbb{E} \left[ \sum_{t=0}^{T} \frac{p_{\theta}(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}, \mathbf{c})}{p_{\theta_{\text{old}}}(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}, \mathbf{c})} \nabla_{\theta} \log p_{\theta}(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}, \mathbf{c}) \, \omega_i^{\top} r(\mathbf{x}_{0}, \mathbf{c}) \right]$$
  1. Evolutionary stage
    • Fit improvement predition models for each policy from history data $\mathcal{R}$.
    • Sample $K$ candidate weight in the objective space. Given $K \times N$ candidate points, we want to select n of them that can maximize hyper volume and minimize sparsity.
    • We can iteratively update our population and keep iteract with environment with new task set $\mathcal{T}={(\pi_i,\omega_i)_{i=1}^n}$.

Experiments

Here we experiment two reward combinations: Aesthetic Score + Compressibility & Aesthetic Score + Incompressibility

Aesthetic Score + Compressibility

Pareto Front:

HyperVolume:

Aesthetic Score + Incompressibility

Pareto Front:

HyperVolume:

Limitations and Future Works

  • Computation Efficiency: PGMORL is an evolutionary algorithm which needs many populations to search the objective space. Since diffusion model is computation costly, we have hard time to fit few agents into GPU memory. Furthermore, due to the limited GPU memory, we can not train agents parallely. In single epoch, we need to train $n$ agents, which increase training time.
  • Sample Efficiency: DDPO requires large number of samples to reach high reward. Is it the best way to fine tune the powerful pretrained DDPM?