-
Notifications
You must be signed in to change notification settings - Fork 763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A bug in the implementation #16
Comments
Thanks for noticing this issue. I've kept repeating this mistake since I read "Human-level control through deep reinforcement learning" in the wrong way. |
The problem here was that we clipped the error in such a way that the gradient would be zero everywhere outside of the clip region. This is obviously not desired as pointed out by @karpathy and was thus fixed by using the Huber loss. For details, see https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b and devsisters/DQN-tensorflow#16
Here's a code snippet from the DeepMind implementation of dqn (NeuralQLearner.lua) available at https://sites.google.com/a/deepmind.com/dqn/
Doesn't this mean that the original implementation has the same issue? Maybe this should be mentioned as a difference to the original implementation? |
@vuoristo The original implementation directly clips the error term for the gradient, as described in the paper. |
Hi, I believe that the linear range of the Huber function should be 2. That results into:
Explanation
You don't want to do large updates, because this is an estimation anyway. This could also be the only source of large updates, because you clip the reward to But to qualify as correct Q-learning, you want to compute an estimate over possible future states s_. This what you would get with MSE, or Huber loss with behaving as MSE in Intuition: Note: Some additional learning rate tuning might be necessary. Discussion welcome. Opinion from @karpathy welcome. |
reward is clipped, but not the Q value. Distance between Q values can be larger than 2. |
@ppwwyyxx I assume you say that the error of Q can be larger than 2. That's true, and that's where Huber loss helps - avoids large updates. When it converges, the error is 0, however. What I was saying is that because the maximal distance between possible reward is 2, the delta argument of Huber loss function should also be 2. With this loss function the Q-learning should converge to the same solution as with MSE - giving proper Q value estimations (means):
That's how the Q-learning is defined. With delta=1, it converges to a different solution, where Q values are different.
Also see corresponding reddit discussion. |
Hello, I spotted what I believe might be a bug in the DQN implementation on line 291 here:
https://github.com/devsisters/DQN-tensorflow/blob/master/dqn/agent.py#L291
The code tries to clip the
self.delta
withtf.clip_by_value
, I assume with the intention of being robust when the discrepancy in Q is above a threshold:However, the
clip_by_value
function's local gradient outside of themin_delta, max_delta
range is zero. Therefore, with the current code whenever the discrepancy is above min/max delta, the gradient becomes exactly zero in backprop. This might not be what you intend, and is certainly not standard, I believe.I think you probably want to clip the gradient here, not the raw Q. In that case you would have to use the Huber loss:
and use this on
this.delta
instead oftf.square
. This would have the desired effect of increased robustness to outliers.The text was updated successfully, but these errors were encountered: