This repository is for our paper "COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences".
Please run pip install -r requirements.txt
to install the required packages.
To run COMAL, you will likely need at least 4 GPUs with 48GB of memory each. The code is tested on a machine with 8 NVIDIA A6000 Ada GPUs.
To run COMAL, please use the following command: bash comal.sh
.
The script comal.sh
performs iterative preference optimization with the following steps:
- Sampling candidate outputs from the LLM.
- Scoring the candidate outputs using the preference model.
- Data processing and precomputing the log probabilities of the output pairs.
- Training: updating the LLM using INPO.
- Evaluating the LLM.
Below are the model checkpoints we trained with different iterative preference optimization methods.
Iter-IPO is the iterative IPO method, INPO-Small is the INPO method with a small regularization coefficient, INPO-Large is the INPO method with a large regularization coefficient, and COMAL is the COMAL method.
The best checkpoints produced by different algorithms are provided below, with the corresponding training round.
Method | Checkpoint | Round |
---|---|---|
COMAL | yale-nlp/comal-qwen2-1.5b | 7 |
INPO-Small | yale-nlp/comal-qwen2-1.5b-inpo-small | 5 |
INPO-Large | yale-nlp/comal-qwen2-1.5b-inpo-large | 4 |
The checkpoints produced at the end of each training round are provided below. Each training round consists 6 training iterations. Please refer to the paper for more details.
comal.sh
: Script for running COMAL.data_processing.py
: Contains the code for post-processing the preference model annotations into training data for COMAL.data_utils.py
: Utility functions for training data loading.eval.py
: Evaluation script for COMAL.get_logprobs.py
: Script for extracting log probabilities from an LLM/policy.losses.py
: Loss functions for training COMAL.dpo.py
: DPO training.ipo.py
: IPO training.mle.py
: MLE training.inpo.py
: INPO training.sampling.py
: Sampling candidate outputs from an LLM.scoring.py
: Scoring output pairs using a preference model.utils.py
: Utility functions.vllm_model.py
: VLLM model definition.fsdp_config.yaml
: Configuration file for FSDP.
data/prompts
: Contains the prompts used for training and evaluation. The prompts are from the UltraFeedback dataset. Please cite their work if you use these prompts.exps
: Contains the results of the experiments. A new directory is created for each experiment, with the name specified incomal.sh
.prompts
: Contains the prompt used for the preference model.