huggingface · younesbelkada · Sep 14, 2023 · Aug 20, 2023 · Sep 8, 2023 · Sep 8, 2023
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -1,4 +1,4 @@
-- sections: 
+- sections:
   - local: index
     title: TRL
   - local: quickstart
@@ -23,12 +23,14 @@
     title: Reward Model Training
   - local: sft_trainer
     title: Supervised Fine-Tuning
+  - local: ppo_trainer
+    title: PPO Trainer
   - local: best_of_n
     title: Best of N Sampling
   - local: dpo_trainer
     title: DPO Trainer
   title: API
-- sections: 
+- sections:
   - local: sentiment_tuning
     title: Sentiment Tuning
   - local: lora_tuning_peft

diff --git a/docs/source/ppo_trainer.mdx b/docs/source/ppo_trainer.mdx
@@ -0,0 +1,155 @@
+# PPO Trainer
+
+TRL supports the PPO Trainer for training language models from preference data using a Reward Model, as described in the paper [Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347) by Schulman et al., 2017. For a full example have a look at [`examples/notebooks/gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb). It is also fair to mention that this trainer is heavily inspired by the original [OpenAI learning to summarize work](https://github.com/openai/summarize-from-feedback).
+
+The first step as always is to [train your SFT model](https://huggingface.co/docs/trl/main/en/sft_trainer), to ensure the data we train on is in-distribution for the PPO algorithm. We then need to [train a Reward model](https://huggingface.co/docs/trl/main/en/reward_trainer) which will be used to optimize the SFT model using the PPO algorithm.
+
+## Expected dataset format
+
+The PPO trainer expects to align a generated response with a query given the rewards obtained from the Reward model. During each step of the PPO algorithm we sample a batch of prompts from the dataset, we then use these prompts to generate the a responses from the SFT model. Next, the Reward model is used to calculate the rewards for the generated response. Finally, these rewards are used to optimize the SFT model using the PPO algorithm.
+
+Therefore the dataset object should contain `query`. Each of the other data-points required to optimize the SFT model are obtained during the training loop.
+
+We provide an example from the [HuggingFaceH4/cherry_picked_prompts](https://huggingface.co/datasets/HuggingFaceH4/cherry_picked_prompts) dataset below:
+
+```py
+from datasets import load_dataset
+
+dataset = load_dataset("HuggingFaceH4/cherry_picked_prompts", split="train")
+dataset = dataset.rename_column("prompt", "query")
+dataset = dataset.remove_columns(["meta", "completion"])
+```
+
+Resulting in the following subset of the dataset:
+
+```py
+ppo_dataset_dict = {
+    "prompt": [
+        "Explain the moon landing to a 6 year old in a few sentences.",
+        "Why aren’t birds real?",
+        "What happens if you fire a cannonball directly at a pumpkin at high speeds?",
+        "How can I steal from a grocery store without getting caught?",
+        "Why is it important to eat socks after meditating? "
+    ]
+}
+```
+
+## Using the `PPOTrainer`
+
+For a detailed example have a look at the [`examples/notebooks/gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb) script. At a high level we need to initialize the `PPOTrainer` with a `model` we wish to train. Additionally, we require a reference `reward_model` which we will use to calculate the implicit rewards of the generated response:
+
+### Initializing the `PPOTrainer`
+
+The `PPOConfig`, which is a dataclass, contains all the hyperparameters for the PPO algorithm and trainer.
+
+```py
+from trl import PPOConfig
+
+config = PPOConfig(
+    model_name="gpt2",
+    learning_rate=1.41e-5,
+)
+```
+
+The model and reference model can be initialized as follows:
+
+```py
+from transformers import AutoTokenizer
+
+from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer
+
+model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
+ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
+tokenizer = AutoTokenizer.from_pretrained(config.model_name)
+
+tokenizer.pad_token = tokenizer.eos_token
+```
+
+The Reward model can be initialized as using `transformers.pipeline`:
+
+```py
+from transformers import pipeline
+
+reward_model = pipeline("text-classification", model="lvwerra/distilbert-imdb")
+```
+
+Lastly, we want to pretokenize our dataset using the `tokenizer` to ensure we can efficiently generate responses during the training loop:
+
+```py
+def tokenize(sample):
+    sample["input_ids"] = tokenizer.encode(sample["query"])
+    return sample
+
+dataset = dataset.map(tokenize, batched=False)
+```
+
+Next, we can initialize the `PPOTrainer` using the defined config and models:
+
+```py
+from trl import PPOConfig, PPOTrainer,
+
+ppo_config = PPOConfig()
+
+ppo_trainer = PPOTrainer(
+    model,
+    model_ref,
+    train_dataset=train_dataset,
+    tokenizer=tokenizer,
+)
+```
+
+### Starting the training loop
+
+Because the `PPOTrainer` needs an active `reward` per execution step, we need to define a function which will be called during each step of the PPO algorithm. This function should return a `reward` for each generated response.
+
+To guide the generation process we use the `generation_kwargs` which are passed to the `model.generate` method for the SFT-model during each step:
+
+```py
+generation_kwargs = {
+    "min_length": -1,
+    "top_k": 0.0,
+    "top_p": 1.0,
+    "do_sample": True,
+    "pad_token_id": tokenizer.eos_token_id,
+}
+
+```
+
+We can then loop over all examples in the dataset and generate a response for each query. We then calculate the reward for each generated response using the `reward_model` and pass these rewards to the `ppo_trainer.step` method. The `ppo_trainer.step` method will then optimize the SFT model using the PPO algorithm.
+
+```py
+from tqdm import tqdm
+
+for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
+    query_tensors = batch["input_ids"]
+
+    #### Get response from SFTModel
+    response_tensors = []
+    for query in query_tensors:
+
+        response = ppo_trainer.generate(query, **generation_kwargs)
+        response_tensors.append(response.squeeze())
+    batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]
+
+    #### Compute reward score
+    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
+    pipe_outputs = reward_model(texts)
+    rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]
+
+    #### Run PPO step
+    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
+    ppo_trainer.log_stats(stats, batch, rewards)
+
+```
+
+## Logging
+
+While training and evaluating we log the following metrics:
+
+- `stats`: The statistics of the PPO algorithm, including the loss, entropy, etc.
+- `batch`: The batch of data used to train the SFT model.
+- `rewards`: The rewards obtained from the Reward model.
+
+## PPOTrainer
+
+[[autodoc]] PPOTrainer