diff --git a/_posts/2016-11-16-Deep-Learning-Research-Review-Week-2:-Reinforcement-Learning.html b/_posts/2016-11-16-Deep-Learning-Research-Review-Week-2:-Reinforcement-Learning.html index 4563fe437c21c..96f999f808c61 100644 --- a/_posts/2016-11-16-Deep-Learning-Research-Review-Week-2:-Reinforcement-Learning.html +++ b/_posts/2016-11-16-Deep-Learning-Research-Review-Week-2:-Reinforcement-Learning.html @@ -20,7 +20,7 @@
The first category, supervised learning, is the one you may be most familiar with. It relies on the idea of creating a function or model based on a set of training data, which contains inputs and their corresponding labels. Convolutional Neural Networks are a great example of this, as the images are the inputs and the outputs are the classifications of the images (dog, cat, etc).
Unsupervised learning seeks to find some sort of structure within data through methods of cluster analysis. One of the most well-known ML clustering algorithms, K-Means, is an example of unsupervised learning.
-Reinforcement learning is the task of learning what actions to take, given a certain situation/environment, so as to maximize a reward signal. The interesting difference between supervised and reinforcement is that this reward signal simply tells you whether the action (or input) that the agent takes is good or bad. It doesn’t tell you anything about what the best action is. Contrast this to CNNs where the corresponding label for each image input is a definite instruction of what the output should be for each input. Another unique component of RL is that an agent’s actions will affect the subsequent data that it receives. For example, an agent’s action of moving left instead of right means that the agent will receive different input from the environment at the next time step. Let’s look at an example to start off.
+Reinforcement learning is the task of learning what actions to take, given a certain situation/environment, so as to maximize a reward signal. The interesting difference between supervised and reinforcement learning is that this reward signal simply tells you whether the action (or input) that the agent takes is good or bad. It doesn’t tell you anything about what the best action is. Contrast this to CNNs where the corresponding label for each image input is a definite instruction of what the output should be for each input. Another unique component of RL is that an agent’s actions will affect the subsequent data it receives. For example, an agent’s action of moving left instead of right means that the agent will receive different input from the environment at the next time step. Let’s look at an example to start off.
The RL Problem
So, let’s first think about what have in a reinforcement learning problem. Let’s imagine a tiny robot in a small room. We haven’t programmed this robot to move or walk or take any action. It’s just standing there. This robot is our agent.
@@ -34,7 +34,7 @@Now that we have all the components, what do we do with this MDP? Well, we want to solve it, of course. By solving an MDP, you’ll be able to find the optimal behavior (policy) that maximizes the amount of reward the agent can expected to get from any state in the environment.
+Now that we have all the components, what do we do with this MDP? Well, we want to solve it, of course. By solving an MDP, you’ll be able to find the optimal behavior (policy) that maximizes the amount of reward the agent can expect to get from any state in the environment.
Solving the MDP
We can solve an MDP and get the optimum policy through the use of dynamic programming and specifically through the use of policy iteration (there is another technique called value iteration, but won’t go into that right now). The idea is that we take some initial policy π1 and evaluate the state value function for that policy. The way we do this is through the Bellman expectation equation.
@@ -47,14 +47,14 @@Now, we’re going to go through the same process of policy evaluation and policy improvement, except we replace our state value function V with our action value function Q. Now, I’m going to skip over the details of what changes with the evaluation/improvement steps. To understand MDP free evaluation and improvement methods, topics such as Monte Carlo Learning, Temporal Difference Learning, and SARSA would require whole blogs just themselves (If you are interested, though, please take a listen to David Silver’s Lecture 4 and Lecture 5). Right now, however, I’m going to jump ahead to value function approximation and the methods discussed in the AlphaGo and Atari Papers, and hopefully that should give a taste of modern RL techniques. The main takeaway is that we want to find the optimal policy π* that maximizes our action value function Q.
Value Function Approximation
-So, if you think about everything we’ve learned up until this point, we’ve treated our problem in a relatively simplistic way. Look at the above Q equation. We’re taking a specific state S and action A, and then computing a number that basically tells us what the expected return is. Now let’s imagine that our agent moves 1 millimeter to the right. This means we have a whole new state S’, and now we’re going to have to compute a Q value for that. In real world RL problems, there are millions and millions of states so it’s important that our value functions understand generalization in that we don’t have to store a completely separate value function for every possible state. The solution is to use a Q value function approximation that is able to generalize to unknown states.
+So, if you think about everything we’ve learned up until this point, we’ve treated our problem in a relatively simplistic way. Look at the above Q equation. We’re taking in a specific state S and action A, and then computing a number that basically tells us what the expected return is. Now let’s imagine that our agent moves 1 millimeter to the right. This means we have a whole new state S’, and now we’re going to have to compute a Q value for that. In real world RL problems, there are millions and millions of states so it’s important that our value functions understand generalization in that we don’t have to store a completely separate value function for every possible state. The solution is to use a Q value function approximation that is able to generalize to unknown states.
So, what we want is some function, let’s call is Qhat, that gives a rough approximation of the Q value given some state S and some action A.
-This function is going to take in S, A, and a good old weight vector W (Once you see that W, you already know we’re bringing in some gradient descent). It is going to compute the dot product between x (which is just a feature vector that represents S and A) and W. The way we’re going to improve this function is by calculating the loss between the true Q value (let’s just assume that it’s given to us for now) and the output of the approximate function.
+This function is going to take in S, A, and a good old weight vector W (Once you see that W, you already know we’re bringing in some gradient descent ). It is going to compute the dot product between x (which is just a feature vector that represents S and A) and W. The way we’re going to improve this function is by calculating the loss between the true Q value (let’s just assume that it’s given to us for now) and the output of the approximate function.
After we compute the loss, we use gradient descent to find the minimum value, at which point we will have our optimal W vector. This idea of function approximation is going to be very key when taking a look at the papers a little later.
Just One More Thing
-Before getting to the papers, just wanted to touch on one last thing. An interesting discussion with the topic of reinforcement learning is that of exploration vs exploitation. Exploitation is the agent’s process of taking what it already knows, and then making the actions that it knows will produce the maximum reward. This sounds great, right? The agent will always be making the best action based on its current knowledge. However, there is a key phrase in that statement. It’s current knowledge. If the agent hasn’t explored enough of the state space, it can’t possibly know whether it is really taking the best possible action. This idea of taking actions with the main purpose of exploring the state space is called exploration.
+Before getting to the papers, just wanted to touch on one last thing. An interesting discussion with the topic of reinforcement learning is that of exploration vs exploitation. Exploitation is the agent’s process of taking what it already knows, and then making the actions that it knows will produce the maximum reward. This sounds great, right? The agent will always be making the best action based on its current knowledge. However, there is a key phrase in that statement. Current knowledge. If the agent hasn’t explored enough of the state space, it can’t possibly know whether it is really taking the best possible action. This idea of taking actions with the main purpose of exploring the state space is called exploration.
This idea can be easily related to a real world example. Let’s say you have a choice of what restaurant to eat at tonight. You (acting as the agent) know that you like Mexican food, so in RL terms, going to a Mexican restaurant will be the action that maximizes your reward, or happiness/satisfaction in this case. However, there is also a choice of Italian food, which you’ve never had before. There’s a possibility that it could be better than Mexican food, or could be a lot worse. This tradeoff between whether to exploit an agent’s past knowledge vs trying something new in hope of discovering a greater reward is one of the major challenges in reinforcement learning (and in our daily lives tbh).
Other Resources for Learning RL
Phew. That was a lot of info. By no means, however, was that a comprehensive overview of the field. If you’d like a more in-depth overview of RL, I’d strongly recommend these resources.