From 8506d5e5972685df1a9fca85703b35149ce83ff6 Mon Sep 17 00:00:00 2001 From: Nelson Alves Date: Sun, 1 Nov 2020 18:01:11 -0300 Subject: [PATCH] :poop: Adiciona notebook de PPO --- .vscode/settings.json | 3 + .../Actor Critic/PPO/PPO.ipynb" | 438 ++++++++++++++++++ 2 files changed, 441 insertions(+) create mode 100644 .vscode/settings.json create mode 100644 "Aprendizado por Refor\303\247o Profundo/Actor Critic/PPO/PPO.ipynb" diff --git a/.vscode/settings.json b/.vscode/settings.json new file mode 100644 index 0000000..ccbee1c --- /dev/null +++ b/.vscode/settings.json @@ -0,0 +1,3 @@ +{ + "python.pythonPath": "/home/nelson/anaconda3/envs/torch/bin/python" +} \ No newline at end of file diff --git "a/Aprendizado por Refor\303\247o Profundo/Actor Critic/PPO/PPO.ipynb" "b/Aprendizado por Refor\303\247o Profundo/Actor Critic/PPO/PPO.ipynb" new file mode 100644 index 0000000..250a364 --- /dev/null +++ "b/Aprendizado por Refor\303\247o Profundo/Actor Critic/PPO/PPO.ipynb" @@ -0,0 +1,438 @@ +{ + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5-final" + }, + "orig_nbformat": 2, + "kernelspec": { + "name": "Python 3.8.5 64-bit ('torch': conda)", + "display_name": "Python 3.8.5 64-bit ('torch': conda)", + "metadata": { + "interpreter": { + "hash": "a5cd74ba85a3b6a037c59ac3f3634fcdd9437555c9fe253dd51f04000fcd493e" + } + } + } + }, + "nbformat": 4, + "nbformat_minor": 2, + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Proximal Policy Optimization (PPO)\n", + "\n", + "Como vimos na aula de A2C, uma função objetivo muito utilizada é:\n", + "\n", + "$$\n", + " J(\\theta) = \\mathbb{E}_{s,a\\sim\\pi_\\theta} [A^{\\pi_\\theta}_w(s,a)], \\qquad\n", + " \\nabla_\\theta J(\\theta) = \\mathbb{E}_{s,a\\sim\\pi_\\theta} [\\nabla_\\theta \\log \\pi_\\theta(a|s)\\cdot A^{\\pi_\\theta}_w(s,a)].\n", + "$$\n", + "\n", + "Os índices na função _advantage_ $A$ indicam que $A$ depende tanto dos pesos $w$ utilizados para calcular o estimar de cada estado, quanto da política $\\pi_\\theta$, que determina quais trajetórias o agente vai seguir dentro do ambiente.\n", + "\n", + "> Obs: pode-se mostrar que essa formulação é equivalente à formulação que utiliza somatórias no tempo:\n", + "$$\n", + " J(\\theta) = \\mathbb{E}_{(s_0,a_0,\\dots)\\sim\\pi_\\theta} \\left[\\sum_{t=0}^\\infty \\gamma^t A^{\\pi_\\theta}_w(s_t,a_t)\\right], \\qquad\n", + " \\nabla_\\theta J(\\theta) = \\mathbb{E}_{(s_0,a_0,\\dots)\\sim\\pi_\\theta} \\left[\\sum_{t=0}^\\infty \\nabla_\\theta \\log \\pi_\\theta(a_t|s_t)\\cdot A^{\\pi_\\theta}_w(s_t,a_t)\\right].\n", + "$$\n", + "\n", + "Note que uma pequena variação no espaço de parâmetros ($\\Delta\\theta = \\alpha\\nabla_\\theta J$) pode causar uma grande variação no espaço de políticas. Isso significa que, em geral, a taxa de aprendizado $\\alpha$ não pode ser muito alta; caso contrário, corremos o risco de obter uma nova política que não funcione. Consequentemente, a eficiência amostral de A2C também é limitada.\n", + "\n", + "\n", + "## Trust Region Policy Optimization (TRPO)\n", + "\n", + "Uma maneira de resolver esse problema é limitar as variações na política. Para isso, vamos utilizar a divergência KL $KL(\\pi_1 || \\pi_2)$, que pode ser, simplificadamente, encarada como uma medida da diferença entre duas políticas (ou, em geral, duas distribuições de probabilidade).\n", + "\n", + "TRPO define uma região de confiança (trust region) para garantir que a política nova não se distancie demais da política antiga:\n", + "$$E_{s\\sim\\pi_{\\theta_{\\mathrm{old}}}}\\bigl[KL\\bigl(\\pi_{\\mathrm{old}}(\\cdot|s)\\,||\\,\\pi(\\cdot|s)\\bigr)\\bigr] \\le \\delta.$$\n", + "\n", + "No entanto, maximizar a função objetivo de A2C sujeito a essas restrições é um pouco complicado. Então, vamos utilizar uma aproximação da função objetivo de A2C:\n", + "\n", + "$$L(\\theta_{\\mathrm{old}},\\theta) = E_{s,a\\sim\\pi_{\\theta_{\\mathrm{old}}}} \\left[\\frac{\\pi_\\theta(a|s)}{\\pi_{\\theta_{\\mathrm{old}}}(a|s)} A^{\\pi_{\\theta_{\\mathrm{old}}}}(s,a)\\right].$$\n", + "\n", + "Ou seja, TRPO consiste em:\n", + "$$\\text{maximizar } E_{s,a\\sim\\pi_{\\theta_{\\mathrm{old}}}} \\left[\\frac{\\pi_\\theta(a|s)}{\\pi_{\\theta_{\\mathrm{old}}}(a|s)} A^{\\pi_{\\theta_{\\mathrm{old}}}}(s,a)\\right] \\text{ sujeito a } E_{s\\sim\\pi_{\\theta_{\\mathrm{old}}}}\\bigl[KL\\bigl(\\pi_{\\mathrm{old}}(\\cdot|s)\\,||\\,\\pi(\\cdot|s)\\bigr)\\bigr] \\le \\delta.$$\n", + "\n", + "> Para entender como chegamos $L(\\theta_{\\mathrm{old}},\\theta)$ é uma aproximação de $J(\\theta)$, podemos fazer:\n", + "\\begin{align*}\n", + "J(\\theta) &= E_{\\pi_\\theta}[A^{\\pi_\\theta}(s,a)] \\\\\n", + " &= E_{\\pi_\\theta}[A^{\\pi_{\\theta_{\\mathrm{old}}}}(s,a)] \\\\\n", + "\t\t&= \\sum_{s,a} \\rho_{\\pi_\\theta}(s)\\cdot \\pi_\\theta(a|s) \\cdot A^{\\pi_{\\theta_{\\mathrm{old}}}}(s,a) \\\\\n", + "\t\t&= \\sum_{s,a} \\rho_{\\pi_\\theta}(s)\\cdot \\pi_{\\theta_{\\mathrm{old}}}(a|s) \\cdot \\frac{\\pi_\\theta(a|s)}{\\pi_{\\theta_{\\mathrm{old}}}(a|s)}A^{\\pi_{\\theta_{\\mathrm{old}}}}(s,a) \\\\\n", + "\t\t&\\approx \\sum_{s,a} \\rho_{\\pi_{\\theta_{\\mathrm{old}}}}(s)\\cdot \\pi_{\\theta_{\\mathrm{old}}}(a|s) \\cdot \\frac{\\pi_\\theta(a|s)}{\\pi_{\\theta_{\\mathrm{old}}}(a|s)}A^{\\pi_{\\theta_{\\mathrm{old}}}}(s,a) \\\\\n", + "\t\t&= E_{\\pi_{\\theta_{\\mathrm{old}}}} \\left[\\frac{\\pi_\\theta(a|s)}{\\pi_{\\theta_{\\mathrm{old}}}(a|s)} A^{\\pi_\\theta}(s,a)\\right]\n", + "\\end{align*}\n", + "\n", + "\n", + "## Proximal Policy Optimization (PPO)\n", + "\n", + "Como já foi mencionado, a restrição ($KL < \\delta$) imposta em TRPO torna o algoritmo relativamente complicado. PPO é uma tentativa de simplificar esse algoritmo. Ao invés de utilizar trust regions, PPO mexe diretamente com a função objetivo:\n", + "\n", + "$$\n", + " L(\\theta_{\\mathrm{old}},\\theta) = E_{s,a\\sim\\pi_{\\theta_{\\mathrm{old}}}} \\Bigl[\\min\\left(r A^{\\pi_{\\theta_{\\mathrm{old}}}}(s,a),\\, \\operatorname{clip}(r,1-\\varepsilon,1+\\varepsilon) A^{\\pi_{\\theta_{\\mathrm{old}}}}(s,a)\\right)\\Bigr],\n", + " \\quad\n", + " r = \\frac{\\pi_\\theta(a|s)}{\\pi_{\\theta_{\\mathrm{old}}}(a|s)}.\n", + "$$\n", + "Essa função pode ser reescrita como:\n", + "$$\n", + " L(\\theta_{\\mathrm{old}},\\theta) = E_{s,a\\sim\\pi_{\\theta_{\\mathrm{old}}}} \\Bigl[\\min\\left(r A^{\\pi_{\\theta_{\\mathrm{old}}}}(s,a),\\, g(\\varepsilon, A^{\\pi_{\\theta_{\\mathrm{old}}}}(s,a))\\right)\\Bigr],\n", + " \\quad\n", + " g(\\varepsilon, A) = \\begin{cases}\n", + " (1+\\varepsilon) A, & A \\ge 0 \\\\\n", + " (1-\\varepsilon) A, & A < 0.\n", + " \\end{cases}\n", + "$$\n", + "\n", + "Nota-se que:\n", + "- Quando a vantagem é positiva, se $r$ aumentar, então $L$ aumenta. No entanto, esse benefício é limitado pelo clip: se $r > 1+\\varepsilon$, não há mais benefício para $r$ aumentar.\n", + "- Quando a vantagem é positiva, se $r$ diminuir, então $L$ aumenta. No entanto, esse benefício é limitado pelo clip: se $r M 1-\\varepsilon$, não há mais benefício para $r$ diminuir." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Rede Divida" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import torch.nn as nn\n", + "import torch.nn.functional as F\n", + "from torch.distributions import Categorical\n", + "class ActorCritic(nn.Module):\n", + " def __init__(self, observation_shape, action_shape):\n", + " super(ActorCritic, self).__init__()\n", + " self.policy1 = nn.Linear(observation_shape, 64)\n", + " self.policy2 = nn.Linear(64, 64)\n", + " self.policy3 = nn.Linear(64, action_shape)\n", + " \n", + " self.value1 = nn.Linear(observation_shape, 64)\n", + " self.value2 = nn.Linear(64, 64)\n", + " self.value3 = nn.Linear(64, 1)\n", + "\n", + " def forward(self, state):\n", + " dists = torch.tanh(self.policy1(state))\n", + " dists = torch.tanh(self.policy2(dists))\n", + " dists = F.softmax(self.policy3(dists), dim=-1)\n", + " probs = Categorical(dists)\n", + " \n", + " v = torch.tanh(self.value1(state))\n", + " v = torch.tanh(self.value2(v))\n", + " v = self.value3(v)\n", + "\n", + " return probs, v" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Experience Replay" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "class ExperienceReplay:\n", + " \"\"\"Experience Replay Buffer para A2C.\"\"\"\n", + " def __init__(self, max_length, observation_space):\n", + " \"\"\"Cria um Replay Buffer.\n", + "\n", + " Parâmetros\n", + " ----------\n", + " max_length: int\n", + " Tamanho máximo do Replay Buffer.\n", + " observation_space: int\n", + " Tamanho do espaço de observação.\n", + " \"\"\"\n", + " self.length = 0\n", + " self.max_length = max_length\n", + "\n", + " self.states = np.zeros((max_length, observation_space), dtype=np.float32)\n", + " self.actions = np.zeros((max_length), dtype=np.int32)\n", + " self.rewards = np.zeros((max_length), dtype=np.float32)\n", + " self.next_states = np.zeros((max_length, observation_space), dtype=np.float32)\n", + " self.dones = np.zeros((max_length), dtype=np.float32)\n", + " self.logp = np.zeros((max_length), dtype=np.float32)\n", + "\n", + " def update(self, states, actions, rewards, next_states, dones, logp):\n", + " \"\"\"Adiciona uma experiência ao Replay Buffer.\n", + "\n", + " Parâmetros\n", + " ----------\n", + " state: np.array\n", + " Estado da transição.\n", + " action: int\n", + " Ação tomada.\n", + " reward: float\n", + " Recompensa recebida.\n", + " state: np.array\n", + " Estado seguinte.\n", + " done: int\n", + " Flag indicando se o episódio acabou.\n", + " \"\"\"\n", + " self.states[self.length] = states\n", + " self.actions[self.length] = actions\n", + " self.rewards[self.length] = rewards\n", + " self.next_states[self.length] = next_states\n", + " self.dones[self.length] = dones\n", + " self.logp[self.length] = logp\n", + " self.length += 1\n", + "\n", + " def sample(self):\n", + " \"\"\"Retorna um batch de experiências.\n", + " \n", + " Parâmetros\n", + " ----------\n", + " batch_size: int\n", + " Tamanho do batch de experiências.\n", + "\n", + " Retorna\n", + " -------\n", + " states: np.array\n", + " Batch de estados.\n", + " actions: np.array\n", + " Batch de ações.\n", + " rewards: np.array\n", + " Batch de recompensas.\n", + " next_states: np.array\n", + " Batch de estados seguintes.\n", + " dones: np.array\n", + " Batch de flags indicando se o episódio acabou.\n", + " \"\"\"\n", + " self.length = 0\n", + "\n", + " return (self.states, self.actions, self.rewards, self.next_states, self.dones, self.logp)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import torch\n", + "import torch.optim as optim\n", + "\n", + "class PPO:\n", + " def __init__(self, observation_space, action_space, lr=7e-4, gamma=0.99, lam=0.95, vf_coef=0.5, entropy_coef=0.005,clip_param =0.2, epochs =10, n_steps=5):\n", + " self.device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", + "\n", + " self.gamma = gamma\n", + " self.lam = lam\n", + " self.vf_coef = vf_coef\n", + " self.entropy_coef = entropy_coef\n", + " self.clip_param = clip_param\n", + " self.epochs = epochs\n", + "\n", + " self.n_steps = n_steps\n", + " self.memory = ExperienceReplay(n_steps, observation_space.shape[0])\n", + "\n", + " self.actorcritic = ActorCritic(observation_space.shape[0], action_space.n).to(self.device)\n", + " self.actorcritic_optimizer = optim.Adam(self.actorcritic.parameters(), lr=lr)\n", + "\n", + " def act(self, state):\n", + " state = torch.FloatTensor(state).to(self.device).unsqueeze(0)\n", + " probs, _ = self.actorcritic.forward(state)\n", + " action = probs.sample()\n", + " log_prob = probs.log_prob(action)\n", + " return action.cpu().detach().item(), log_prob\n", + "\n", + " def remember(self, state, action, reward, next_state, done, logp):\n", + " self.memory.update(state, action, reward, next_state, done, logp)\n", + "\n", + " def compute_gae(self, rewards, dones, v, v2):\n", + " T = len(rewards)\n", + "\n", + " returns = torch.zeros_like(rewards)\n", + " gaes = torch.zeros_like(rewards)\n", + " \n", + " future_gae = torch.tensor(0.0, dtype=rewards.dtype)\n", + " next_return = torch.tensor(v2[-1], dtype=rewards.dtype)\n", + "\n", + " not_dones = 1 - dones\n", + " deltas = rewards + not_dones * self.gamma * v2 - v\n", + "\n", + " for t in reversed(range(T)):\n", + " returns[t] = next_return = rewards[t] + self.gamma * not_dones[t] * next_return\n", + " gaes[t] = future_gae = deltas[t] + self.gamma * self.lam * not_dones[t] * future_gae\n", + "\n", + " gaes = (gaes - gaes.mean()) / (gaes.std() + 1e-8) # Normalização\n", + "\n", + " return gaes, returns\n", + "\n", + " def train(self):\n", + " if self.memory.length < self.n_steps:\n", + " return\n", + "\n", + " (states, actions, rewards, next_states, dones, old_logp) = self.memory.sample()\n", + "\n", + " states = torch.FloatTensor(states).to(self.device)\n", + " actions = torch.FloatTensor(actions).to(self.device)\n", + " rewards = torch.FloatTensor(rewards).unsqueeze(-1).to(self.device)\n", + " next_states = torch.FloatTensor(next_states).to(self.device)\n", + " dones = torch.FloatTensor(dones).unsqueeze(-1).to(self.device)\n", + " old_logp = torch.FloatTensor(old_logp).to(self.device)\n", + "\n", + " for epoch in range(self.epochs):\n", + " probs, v = self.actorcritic.forward(states)\n", + " with torch.no_grad():\n", + " _, v2 = self.actorcritic.forward(next_states)\n", + "\n", + " new_logp = probs.log_prob(actions)\n", + "\n", + " advantages, returns = self.compute_gae(rewards, dones, v, v2)\n", + "\n", + " ratio = (new_logp.unsqueeze(-1) - old_logp.unsqueeze(-1)).exp()\n", + " surr1 = ratio * advantages.detach()\n", + " surr2 = torch.clamp(ratio, 1.0 - self.clip_param, 1.0 + self.clip_param) * advantages.detach()\n", + "\n", + " entropy = probs.entropy().mean()\n", + "\n", + " policy_loss = - torch.min(surr1,surr2).mean()\n", + " value_loss = self.vf_coef * F.mse_loss(v, returns.detach())\n", + " entropy_loss = -self.entropy_coef * entropy\n", + "\n", + " self.actorcritic_optimizer.zero_grad()\n", + " (policy_loss + entropy_loss + value_loss).backward()\n", + " self.actorcritic_optimizer.step()\n", + "\n", + " return policy_loss + entropy_loss + value_loss" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Treinando" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "import math\n", + "from collections import deque\n", + "\n", + "def train(agent, env, total_timesteps):\n", + " total_reward = 0\n", + " episode_returns = deque(maxlen=20)\n", + " avg_returns = []\n", + "\n", + " state = env.reset()\n", + " timestep = 0\n", + " episode = 0\n", + "\n", + " while timestep < total_timesteps:\n", + " action, log_prob = agent.act(state)\n", + " next_state, reward, done, _ = env.step(action)\n", + " agent.remember(state, action, reward, next_state, done, log_prob.detach().cpu().numpy())\n", + " loss = agent.train()\n", + " timestep += 1\n", + "\n", + " total_reward += reward\n", + "\n", + " if done:\n", + " episode_returns.append(total_reward)\n", + " episode += 1\n", + " next_state = env.reset()\n", + "\n", + " if episode_returns:\n", + " avg_returns.append(np.mean(episode_returns))\n", + "\n", + " total_reward *= 1 - done\n", + " state = next_state\n", + "\n", + " ratio = math.ceil(100 * timestep / total_timesteps)\n", + "\n", + " avg_return = avg_returns[-1] if avg_returns else np.nan\n", + " \n", + " print(f\"\\r[{ratio:3d}%] timestep = {timestep}/{total_timesteps}, episode = {episode:3d}, avg_return = {avg_return:10.4f}\", end=\"\")\n", + "\n", + " return avg_returns" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "[100%] timestep = 75000/75000, episode = 480, avg_return = 265.2000" + ] + } + ], + "source": [ + "import gym\n", + "\n", + "env = gym.make(\"CartPole-v1\")\n", + "agente = PPO(env.observation_space, env.action_space)\n", + "returns = train(agente, env, 75000)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": "
", + "image/svg+xml": "\n\n\n\n \n \n \n \n 2020-11-01T17:56:04.930085\n image/svg+xml\n \n \n Matplotlib v3.3.2, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n", + "image/png": "\n" + }, + "metadata": { + "needs_background": "light" + } + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "plt.plot(returns, 'r')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "👷 Provavelmente tem alguma coisa errado 👷" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ] +} \ No newline at end of file