- Understand the difference between Prediction and Control
- Know how to use the MC method for predicting state values and state-action values
- Understand the on-policy first-visit MC control algorithm
- Understand off-policy MC control algorithms
- Understand Weighted Importance Sampling
- Understand the benefits of MC algorithms over the Dynamic Programming approach
- Dynamic Programming approaches assume complete knowledge of the environment (the MDP). In practice we often don't have full knowledge of how the world works.
- Monte Carlo (MC) methods can learn directly from experience collected by interacting with the environment. An episode of experience is a series of
(State, Action, Reward, Next State)
tuples. - MC methods work based on episodes. We sample episodes of experience and make updates to our estimates at the end of each episode. MC methods have high variance (due to lots of random decisions within an episode) but are unbiased.
- MC Policy Evaluation: Given a policy, we want to estimate the state-value function V(s). Sample episodes of experience and estimate V(s) to be the reward received from that state onwards averaged across all of your experience. The same technique works for the action-value function Q(s, a). Given enough samples this is proven to converge.
- MC Control: Idea is the same as for Dynamic Programming. Use MC Policy Evaluation to evaluate the current policy then improve the policy greedily. The Problem: How do we ensure that we explore all states if we don't know the full environment?
- Solution to exploration problem: Use epsilon-greedy policies instead of full greedy policies. When making a decision act randomly with probability epsilon. This will learn the optimal epsilon-greedy policy.
- Off-Policy Learning: How can we learn about the actual optimal (greedy) policy while following an exploratory (epsilon greedy) policy? We can use importance sampling, which weighs returns by their probability of occuring under the policy we want to learn about.
Required:
- Reinforcement Learning: An Introduction - Chapter 5: Monte Carlo Methods
Optional:
- David Silver's RL Course Lecture 4 - Model-Free Prediction (video, slides)
- David Silver's RL Course Lecture 5 - Model-Free Control (video, slides)
- [Get familar with the Blackjack environment (Blackjack-v0)](Blackjack Playground.ipynb)
- Implement the Monte Carlo Prediction to estimate state-action values
- [Exercise](MC Prediction.ipynb)
- [Solution](MC Prediction Solution.ipynb)
- Implement the on-policy first-visit Monte Carlo Control algorithm
- [Exercise](MC Control with Epsilon-Greedy Policies.ipynb)
- [Solution](MC Control with Epsilon-Greedy Policies Solution.ipynb)
- Implement the off-policy every-visit Monte Carlo Control using Weighted Important Sampliing algorithm
- [Exercise](Off-Policy MC Control with Weighted Importance Sampling.ipynb)
- [Solution](Off-Policy MC Control with Weighted Importance Sampling Solution.ipynb)