Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add initial version of docs for PPOTrainer #665

Merged
merged 5 commits into from
Sep 14, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
- sections:
- sections:
- local: index
title: TRL
- local: quickstart
Expand All @@ -23,12 +23,14 @@
title: Reward Model Training
- local: sft_trainer
title: Supervised Fine-Tuning
- local: ppo_trainer
title: PPO Trainer
- local: best_of_n
title: Best of N Sampling
- local: dpo_trainer
title: DPO Trainer
title: API
- sections:
- sections:
- local: sentiment_tuning
title: Sentiment Tuning
- local: lora_tuning_peft
Expand Down
155 changes: 155 additions & 0 deletions docs/source/ppo_trainer.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# PPO Trainer

TRL supports the PPO Trainer for training language models from preference data using a Reward Model, as described in the paper [Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347) by Schulman et al., 2017. For a full example have a look at [`examples/notebooks/gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb). It is also fair to mention that this trainer is heavily inspired by the original [OpenAI learning to summarize work](https://github.com/openai/summarize-from-feedback).
davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved

The first step as always is to [train your SFT model](https://huggingface.co/docs/trl/main/en/sft_trainer), to ensure the data we train on is in-distribution for the PPO algorithm. We then need to [train a Reward model](https://huggingface.co/docs/trl/main/en/reward_trainer) which will be used to optimize the SFT model using the PPO algorithm.
davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved

## Expected dataset format

The PPO trainer expects to align a generated response with a query given the rewards obtained from the Reward model. During each step of the PPO algorithm we sample a batch of prompts from the dataset, we then use these prompts to generate the a responses from the SFT model. Next, the Reward model is used to calculate the rewards for the generated response. Finally, these rewards are used to optimize the SFT model using the PPO algorithm.
davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved

Therefore the dataset object should contain `query`. Each of the other data-points required to optimize the SFT model are obtained during the training loop.
davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved

We provide an example from the [HuggingFaceH4/cherry_picked_prompts](https://huggingface.co/datasets/HuggingFaceH4/cherry_picked_prompts) dataset below:
davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved

```py
from datasets import load_dataset

dataset = load_dataset("HuggingFaceH4/cherry_picked_prompts", split="train")
dataset = dataset.rename_column("prompt", "query")
dataset = dataset.remove_columns(["meta", "completion"])
```

Resulting in the following subset of the dataset:

```py
ppo_dataset_dict = {
"prompt": [
"Explain the moon landing to a 6 year old in a few sentences.",
"Why aren’t birds real?",
"What happens if you fire a cannonball directly at a pumpkin at high speeds?",
"How can I steal from a grocery store without getting caught?",
"Why is it important to eat socks after meditating? "
]
}
```

## Using the `PPOTrainer`

For a detailed example have a look at the [`examples/notebooks/gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb) script. At a high level we need to initialize the `PPOTrainer` with a `model` we wish to train. Additionally, we require a reference `reward_model` which we will use to calculate the implicit rewards of the generated response:
davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved

### Initializing the `PPOTrainer`

The `PPOConfig`, which is a dataclass, contains all the hyperparameters for the PPO algorithm and trainer.
davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved

```py
from trl import PPOConfig

config = PPOConfig(
model_name="gpt2",
learning_rate=1.41e-5,
)
```

The model and reference model can be initialized as follows:
davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved

```py
from transformers import AutoTokenizer

from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer

model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)

tokenizer.pad_token = tokenizer.eos_token
```

The Reward model can be initialized as using `transformers.pipeline`:
davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved

```py
from transformers import pipeline

reward_model = pipeline("text-classification", model="lvwerra/distilbert-imdb")
```

Lastly, we want to pretokenize our dataset using the `tokenizer` to ensure we can efficiently generate responses during the training loop:
davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved

```py
def tokenize(sample):
sample["input_ids"] = tokenizer.encode(sample["query"])
return sample

dataset = dataset.map(tokenize, batched=False)
```

Next, we can initialize the `PPOTrainer` using the defined config and models:
davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved

```py
from trl import PPOConfig, PPOTrainer,

ppo_config = PPOConfig()

ppo_trainer = PPOTrainer(
model,
model_ref,
train_dataset=train_dataset,
tokenizer=tokenizer,
)
```

### Starting the training loop

Because the `PPOTrainer` needs an active `reward` per execution step, we need to define a function which will be called during each step of the PPO algorithm. This function should return a `reward` for each generated response.
davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved

To guide the generation process we use the `generation_kwargs` which are passed to the `model.generate` method for the SFT-model during each step:
davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved

```py
generation_kwargs = {
"min_length": -1,
"top_k": 0.0,
"top_p": 1.0,
"do_sample": True,
"pad_token_id": tokenizer.eos_token_id,
}

```

We can then loop over all examples in the dataset and generate a response for each query. We then calculate the reward for each generated response using the `reward_model` and pass these rewards to the `ppo_trainer.step` method. The `ppo_trainer.step` method will then optimize the SFT model using the PPO algorithm.

```py
from tqdm import tqdm

for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
query_tensors = batch["input_ids"]

#### Get response from SFTModel
response_tensors = []
for query in query_tensors:

response = ppo_trainer.generate(query, **generation_kwargs)
response_tensors.append(response.squeeze())
davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved
batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

#### Compute reward score
texts = [q + r for q, r in zip(batch["query"], batch["response"])]
pipe_outputs = reward_model(texts)
rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]

#### Run PPO step
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards)

davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved
```

## Logging

While training and evaluating we log the following metrics:

- `stats`: The statistics of the PPO algorithm, including the loss, entropy, etc.
- `batch`: The batch of data used to train the SFT model.
- `rewards`: The rewards obtained from the Reward model.

## PPOTrainer

[[autodoc]] PPOTrainer
davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved