View our Poster for comprehensive information of our project.
Our repository's code is referenced from DDPO and MORL_Baseline.
Human preference is complex. Usually, we have multi-objective rewards. For example 30% compressibility and 70% aesthetic quality. Furthermore, considering multi-objective rewards can somehow solve the over optimization problem (Kevin Black 2024), since it gives constrain to each reward in the preference.
Requires Python 3.10 or newer. Install the package in your virtual env:
pip install -r requires.txt
Set up the configuration of accelerate. Choose your device and set up distributed training.
accelerate config
Training:
python -m accelerate.commands.launch mo_train.py
Our method combines DDPO and PGMORL. The algorithm consists of warm-up stage and evolutionary stage. The goal of PGMORL is to approximate the Pareto front, which consists of policies that represent optimal trade-offs among the objectives.
- Warm-up stage
- Generate task set
$\mathcal{T}={(\pi_i,\omega_i)_{i=1}^n}$ by initial policies and evenly distributed weight vectors. Here$\pi_i$ is LORA layers. - Every agent updates on its own: where
$r(x_0,c)$ is multi-dimension rewards. - Every agent does sampling and updating independently. We only update the LORA layer and share the pretrained UNet. Storing
$(\text{F}(\pi_i), \text{F}(\pi_i^{'}), \omega_i)$ in history$\mathcal{R}$ for further prediction.
- Generate task set
- Evolutionary stage
- Fit improvement predition models for each policy from history data
$\mathcal{R}$ . - Sample
$K$ candidate weight in the objective space. Given$K \times N$ candidate points, we want to select n of them that can maximize hyper volume and minimize sparsity. - We can iteratively update our population and keep iteract with environment with new task set
$\mathcal{T}={(\pi_i,\omega_i)_{i=1}^n}$ .
- Fit improvement predition models for each policy from history data
Here we experiment two reward combinations: Aesthetic Score + Compressibility & Aesthetic Score + Incompressibility
-
Computation Efficiency: PGMORL is an evolutionary algorithm which needs many populations to search the objective space. Since diffusion model is computation costly, we have hard time to fit few agents into GPU memory. Furthermore, due to the limited GPU memory, we can not train agents parallely. In single epoch, we need to train
$n$ agents, which increase training time. - Sample Efficiency: DDPO requires large number of samples to reach high reward. Is it the best way to fine tune the powerful pretrained DDPM?