Implementation of Double DQN #52

corywalker · 2015-11-23T03:07:41Z

I was interested in implementing Double DQN in this source code, so here are my changes. Feel free to pull these into the main codebase. I didn't change much, since the Double DQN algorithm is not much different from that described in the Nature paper. I couldn't get the original tests to pass, so I was not able to add a test for Double DQN. I did test everything though, by running experiments with Breakout. Here is the performance over time:

Of course, the differences here are negligible and Breakout was named in the Double DQN paper as not having a real change under Double DQN. If I had more computing resources, I could test on the games which Double DQN makes a significant difference. Here is perhaps a more useful plot that shows how Double DQN seems to reduce value overestimates:

And here is the change required for Double DQN:

If you don't have the time to look over the changes or to test them yourself, I understand. At least this PR will allow others to use it easily if need be.

References:

van Hasselt, H., Guez, A., & Silver, D. (2015). Deep Reinforcement Learning with Double Q-learning. arXiv preprint arXiv:1509.06461.

moscow25 · 2015-11-23T07:49:52Z

Awesome! Been hoping someone implements double QN since the paper came out. Thanks!

On Nov 22, 2015, at 10:07 PM, Cory Walker [email protected] wrote:

I was interested in implementing Double DQN in this source code, so here are my changes. Feel free to pull these into the main codebase. I didn't change much, since the Double DQN algorithm is not much different from that described in the Nature paper. I couldn't get the original tests to pass, so I was not able to add a test for Double DQN. I did test everything though, by running experiments with Breakout. Here is the performance over time:

Of course, the differences here are negligible and Breakout was named in the Double DQN paper as not having a real change under Double DQN. If I had more computing resources, I could test on the games which Double DQN makes a significant difference. Here is perhaps a more useful plot that shows how Double DQN seems to reduce value overestimates:

And here is the change required for Double DQN:

If you don't have the time to look over the changes or to test them yourself, I understand. At least this PR will allow others to use it easily if need be.

References:

van Hasselt, H., Guez, A., & Silver, D. (2015). Deep Reinforcement Learning with Double Q-learning. arXiv preprint arXiv:1509.06461.

You can view, comment on, or merge this pull request online at:

#52

Commit Summary

Double DQN support.
Bug fix, some testing code.
Checkpoint before instance shutdown.
Prepare for pull request.
File Changes

M deep_q_rl/launcher.py (5)
M deep_q_rl/q_network.py (22)
A deep_q_rl/run_double.py (66)
M deep_q_rl/run_nature.py (1)
M deep_q_rl/run_nips.py (1)
M deep_q_rl/test/test_q_network.py (12)
Patch Links:

https://github.com/spragunr/deep_q_rl/pull/52.patch
https://github.com/spragunr/deep_q_rl/pull/52.diff
—
Reply to this email directly or view it on GitHub.

spragunr · 2015-11-24T17:32:24Z

Thanks for the PR. I'm behind on reviewing, but I'm hoping to get caught up in late December / early January. It looks like the changes aren't very disruptive so there shouldn't be an issue merging.

alito · 2015-11-25T13:19:22Z

Excellent. I'm starting a test run on space invaders since it's one where they saw a big increase. I'll let you know how it goes in a couple of days

alito · 2015-11-30T12:06:02Z

Plot from space invaders.

This, while not being up to scratch with DeepMind's results, is, I think, much better than any result I've seen with the deep_q_rl implementation.

It's very slow to learn but seems very stable. I might try again but raising the learning rate.

moscow25 · 2015-11-30T12:53:24Z

Very nice! This is with the double Q-RL? It just switching the network
every X steps, right?

I'm also impressed. My performance with deep_Q_RL never came close to their
reports, either...

Best,
N

On Mon, Nov 30, 2015 at 7:06 AM, Alejandro Dubrovsky <
[email protected]> wrote:

Plot from space invaders.
[image: spaceinvadersdoubleq]
https://cloud.githubusercontent.com/assets/775207/11471057/9dd9c402-97b6-11e5-9b5a-ff571ea1e4a1.png

This, while not being up to scratch with DeepMind's results, is, I think,
much better than any result I've seen with the deep_q_rl implementation.

It's very slow to learn but seems very stable. I might try again but
raising the learning rate.

—
Reply to this email directly or view it on GitHub
#52 (comment).

corywalker · 2015-11-30T18:08:14Z

@alito Thanks for the examination. Do you mind sharing the results.csv and perhaps the results.csv from any other Space Invaders models that you have trained?

Also, here is a newer paper from DeepMind that claims better performance than Double DQN: http://arxiv.org/abs/1511.06581

Could be interesting to implement.

alito · 2015-12-01T11:48:00Z

Here is results.csv for this run (note the extra column in there):
http://organicrobot.com/deepqrl/results-doubleq.csv

I don't seem to have, or at least kept, a recent results.csv. I've got a few from June that didn't learn at all, and a few from the NIPS era. I've put one up from May which seems to be the best I've got, but I don't think there's a good comparison.

http://organicrobot.com/deepqrl/results-20150527.csv

I'm running a plain version now, but it will take a while to see what's going on.

Also, there's this: http://arxiv.org/abs/1511.05952 from last week, which, aside from doing better, it has the plot of epoch vs reward for all 57 games. From those, it seems like even their non-double Q implementation is very stable, or at least more stable than deep_q_rl seems to be at the moment.

Minor change to update the citation for Double DQN.

moscow25 · 2015-12-02T14:12:08Z

Thanks Alejandro. I, for one am curious to see how this comparison shakes
out for you. When I ran deep-Q-RL the first time with Theano, it didn't
really learn for me, also.

The Prioritized Replay paper that you mentioned has been sitting on my
desk, as it may also apply to my poker AI problems. Choosing the best
replay batch set is a pain, once you have a lot of so-so data... and I
think others who got better learning results from deep-Q-RL talked a lot
about it starting to forget parts of the game, as it got better at others...

I have always suspected that they sample the games data in a more clever
way than the original paper gets into. Sometimes, it's just easier to say
you did the simple thing. So curious to see if they have now come clean :-)

Best,
Nikolai

On Tue, Dec 1, 2015 at 6:48 AM, Alejandro Dubrovsky <
[email protected]> wrote:

Here is results.csv for this run (note the extra column in there):
http://organicrobot.com/deepqrl/results-doubleq.csv

I don't seem to have, or at least kept, a recent results.csv. I've got a
few from June that didn't learn at all, and a few from the NIPS era. I've
put one up from May which seems to be the best I've got, but I don't think
there's a good comparison.

http://organicrobot.com/deepqrl/results-20150527.csv

I'm running a plain version now, but it will take a while to see what's
going on.

Also, there's this: http://arxiv.org/abs/1511.05952 from last week,
which, aside from doing better, it has the plot of epoch vs reward for all
57 games. From those, it seems like even their non-double Q implementation
is very stable, or at least more stable than deep_q_rl seems to be at the
moment.

—
Reply to this email directly or view it on GitHub
#52 (comment).

alito · 2015-12-04T11:56:54Z

The run without double-q hasn't finished, but it's not going to go anywhere from its current state. I've put the results up:
http://organicrobot.com/deepqrl/results-20151201.csv

Here's the plot:

It does better than I expected. Looks stable if nothing else. Double-Q looks like a substantial improvement in this case.

@moscow25 they've released their code, so I suspect they are not cheating in any way they haven't mentioned. I haven't tested their code though, but it wouldn't be hard to find out if they aren't doing as well as they claimed on their papers.

moscow25 · 2015-12-04T19:07:26Z

Awesome!

I meant that tongue in cheek. Any yes, they released code, so it happened :-)

Just saying that it's always hard to specify a tech system precisely, especially in 7 pages. And this presumes that people who wrote the system remember every decision explored and taken.

Glad to see the double Q RL working so well. I kept starting ok but then diverging into NaN territory why I ran the (Lasagne version) on this when it came out. Seeing to converge more steady now is great. The idea from that paper is simple and glad it just works.

Over-optimism is a huge problem for my high variance poker AI problems. So optimistic to try this version now. Thanks again for running the baseline.

Best,
Nikolai

On Dec 4, 2015, at 6:56 AM, Alejandro Dubrovsky [email protected] wrote:

The run without double-q hasn't finished, but it's not going to go anywhere from its current state. I've put the results up:
http://organicrobot.com/deepqrl/results-20151201.csv

Here's the plot:

It does better than I expected. Looks stable if nothing else. Double-Q looks like a substantial improvement in this case.

@moscow25 they've released their code, so I suspect they are not cheating in any way they haven't mentioned. I haven't tested their code though, but it wouldn't be hard to find out if they aren't doing as well as they claimed on their papers.

—
Reply to this email directly or view it on GitHub.

stokasto · 2016-02-26T14:23:53Z

There seems to be a bug in your implementation: as far as I can see you are calculating maxaction based on q_vals (which contains the Q values for s_t and NOT s_{t+1}).
To fix this you have to do a second forward pass through the current q network, using the next state.
That would look like this:
`

    q_vals = lasagne.layers.get_output(self.l_out, states / input_scale)

    if self.freeze_interval > 0:
        next_q_vals = lasagne.layers.get_output(self.next_l_out,
                                                next_states / input_scale)
    else:
        next_q_vals = lasagne.layers.get_output(self.l_out,
                                                next_states / input_scale)
        next_q_vals = theano.gradient.disconnected_grad(next_q_vals)

    if self.use_double:
        # also get q values for next_states
        q_vals_next_current = lasagne.layers.get_output(self.l_out, next_states / input_scale)
        maxaction = T.argmax(q_vals_next_current, axis=1, keepdims=False)
        temptargets = next_q_vals[T.arange(batch_size),maxaction].reshape((-1, 1))
        target = (rewards +
                  (T.ones_like(terminals) - terminals) *
                  self.discount * temptargets)

`

alito · 2016-12-30T12:31:06Z

Note by @stokasto sounds right. I'll do some testing

corywalker added 4 commits November 11, 2015 23:51

Double DQN support.

3e4db23

Bug fix, some testing code.

e2aa85e

Checkpoint before instance shutdown.

91236a8

Prepare for pull request.

94f7d72

Update citation

f666801

Minor change to update the citation for Double DQN.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of Double DQN #52

Implementation of Double DQN #52

corywalker commented Nov 23, 2015

moscow25 commented Nov 23, 2015

spragunr commented Nov 24, 2015

alito commented Nov 25, 2015

alito commented Nov 30, 2015

moscow25 commented Nov 30, 2015

corywalker commented Nov 30, 2015

alito commented Dec 1, 2015

moscow25 commented Dec 2, 2015

alito commented Dec 4, 2015

moscow25 commented Dec 4, 2015

stokasto commented Feb 26, 2016

alito commented Dec 30, 2016

Implementation of Double DQN #52

Are you sure you want to change the base?

Implementation of Double DQN #52

Conversation

corywalker commented Nov 23, 2015

moscow25 commented Nov 23, 2015

spragunr commented Nov 24, 2015

alito commented Nov 25, 2015

alito commented Nov 30, 2015

moscow25 commented Nov 30, 2015

corywalker commented Nov 30, 2015

alito commented Dec 1, 2015

moscow25 commented Dec 2, 2015

alito commented Dec 4, 2015

moscow25 commented Dec 4, 2015

stokasto commented Feb 26, 2016

alito commented Dec 30, 2016