reinforcement-learning/PolicyGradient at master · j-min/reinforcement-learning

History

Name		Name	Last commit message	Last commit date
parent directory ..
CliffWalk Actor Critic Solution.ipynb		CliffWalk Actor Critic Solution.ipynb
CliffWalk REINFORCE with Baseline Solution.ipynb		CliffWalk REINFORCE with Baseline Solution.ipynb
Continuous MountainCar Actor Critic Solution.ipynb		Continuous MountainCar Actor Critic Solution.ipynb
README.md		README.md

README.md

Policy Gradient Methods

Learning Goals

Understand the difference between value-based and policy-based Reinforcement LEarning
Understand the REINFORCE Algorithm (Monte Carlo Policy Gradient)
Understand Actor-Critic (AC) algorithms
Understand Advantage Functions
Understand Deterministic Policy Gradients (Optional)
Understand how to scale up Policy Gradient methods using asynchronous actor critic and Neural Networks (Optional)

Summary

Idea: Instead of parameterizing the value function and doing greedy policy improvement we parameterize the policy and do gradient descent into a direction that improves it.
Sometimes the policy is easier to approximate than the value function. Also, we need a parameterized policy to deal with continuous action spaces and environments where we need to act stochastically.
Policy Score Function J(theta): Intuitively, it measures how good our policy is. For example, we can use the average value or average reward under a policy as our objective.
Common choices for the policy function: Softmax for discrete actions, Gaussian parameters for continuous actions.
Policy Gradient Theorem: grad(J(theta)) = Ex[grad(log(pi(s, a))) * Q(s, a)]. Basically, we move our policy into a direction of more reward.
REINFORCE (Monte Carlo Policy Gradient): We substitute a samples return g_t form an episode for Q(s, a) to make an update. Unbiased but high variance.
Baseline: Instead of measuring the absolute goodness of an action we want to know how much better than "average" it is to take an action given a state. E.g. some states are naturally bad and always give negative reward. This is called the advantage and is defined as Q(s, a) - V(s). We use that for our policy update, e.g. g_t - V(s) for REINFORCE.
Actor Critic: Instead of waiting until the end of an episode as in REINFORCE we use bootstrapping and make an update at each step. To do that we also train a Critic Q(theta) that approximates the value function. Now we have two function approximators: One of the policy, one for the critic. This is basically TD, but for Policy Gradients.
A good estimate of the advantage function in the Actor Critic algorithm is the td error. Our update then becomes grad(J(theta)) = Ex[grad(log(pi(s, a))) * td_error].
Can use policy gradients with td-lambda, eligibility traces, and so on.
Deterministic Policy Gradients: Useful for high-dimensional continuous action spaces where stochastic policy gradients are expensive to compute. The idea is to update the policy in the direction of the gradient of the action-value function. To ensure exploration we can use an off-policy actor critic algorithm with added noise in action selection.
Deep Deterministic Policy Gradients: Apply tricks from DQN to Deterministic Policy Gradients ;)
Asynchronous Advantage Actor Critic (A3C): Instead of using an experience replay buffer as in DQN use multiple agents on different threads to explore the state spaces and make decorrelated updates to the actor and the critic.

Lectures & Readings

Required:

David Silver's RL Course Lecture 7 - Policy Gradient Methods (video, slides)

Optional:

Reinforcement Learning: An Introduction - Chapter 11: Policy Gradient Methods (Under Construction)
Deterministic Policy Gradient Algorithms
Deterministic Policy Gradient Algorithms (Talk)
Continuous control with deep reinforcement learning
Deep Deterministic Policy Gradients in TensorFlow
Asynchronous Methods for Deep Reinforcement Learning
Deep Reinforcement Learning: A Tutorial (Policy Gradient Section)

Exercises

REINFORCE with Baseline
- Exercise
- [Solution](CliffWalk REINFORCE with Baseline Solution.ipynb)
Actor Critic with Baseline
- Exercise
- [Solution](CliffWalk Actor Critic Solution.ipynb)
Actor Critic with Baseline for Continuous Action Spaces
- Exercise
- [Solution](Continuous MountainCar Actor Critic Solution.ipynb)
Deterministic Policy Gradients for Continuous Action Spaces (WIP)
Deep Deterministic Policy Gradients (WIP)
Asynchronous Advantage Actor Critic (A3C) (WIP)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PolicyGradient

PolicyGradient

README.md

Policy Gradient Methods

Learning Goals

Summary

Lectures & Readings

Exercises

Files

PolicyGradient

Directory actions

More options

Directory actions

More options

Latest commit

History

PolicyGradient

Folders and files

parent directory

README.md

Policy Gradient Methods

Learning Goals

Summary

Lectures & Readings

Exercises