Authors: [DeepMind] Hado van Hasselt, Arthur Guez, David Silver
Year: 2015
Algorithm: Double DQN
Links: [arxiv]
- Action value overestimation in Deep Q-learning
- Double Q-learning
- Q-learning algorithm is known to overestimate action values under certain conditions.
- Overestimation of action values causes instabilities.
- Q-learning
- Double Q-learning: two value functions are learned by assigning each experience randomly to update one of the two value functions. One is used to determine the greedy policy and the other to determine its value.
- Estimation errors of any kind can induce an upward bias, regardless of whether these errors are due to environmental noise, function approximation, non-stationarity, or any other source.
- Theorem:
- For Q-Learning, the overoptimism increases with the number of actions, but Double Q-learning is unbiased.
- Double DQN
- The idea of Double Q-learning is to reduce overestimations by decomposing the max operation in the target into action selection and action evaluation. The paper proposed to evaluate the greedy policy according to the online network (), but using the target network () to estimate its value.
- Empirical results show that double Q-learning can be used at scale to successfully reduce overoptimism, resulting in more stable and reliable learning.