This repository has been archived by the owner on Jul 29, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
3ff08fa
commit f40aa51
Showing
1 changed file
with
36 additions
and
71 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,97 +1,62 @@ | ||
GGP - AlphaZero like learning | ||
============================== | ||
|
||
[General Game Playing](https://en.wikipedia.org/wiki/General_game_playing) experiments with | ||
reinforcement learning related to | ||
[AlphaZero](https://www.nature.com/articles/nature24270.epdf?author_access_token=VJXbVjaSHxFoctQQ4p2k4tRgN0jAjWel9jnR3ZoTv0PVW4gB86EEpGqTRDtpIz-2rmo8-KG06gqVobU5NSCFeHILHcVFUeMsbvwS-lxjqQGg98faovwjxeTUgZAUMnRQ) | ||
and [Thinking Fast And Slow](https://arxiv.org/abs/1705.08439v4). | ||
|
||
Just started - WIP. Lots of hand holding going on. | ||
Everything here is a WIP. | ||
|
||
Based on [GGPLib](https://github.com/richemslie/ggplib). | ||
|
||
|
||
Breakthrough games | ||
------------------ | ||
The first attempt was to try learning the game Breakthrough. | ||
|
||
After 10 generations, with bugs and misunderstandings in the implementation along the way, it can | ||
now beat a vanilla MCTS player which is doing ~250k playouts per turn. The trained network based | ||
player in the same time frame does only ~400 'iterations' (running on CPU). That is roughly 3 | ||
orders of magnitude difference in playouts. | ||
|
||
It does however play similar to the MCTS players (and MCS player it was initially trained on) - | ||
which is very kamikaze like, throwing away pawns for no reason. I am hoping after a bunch more | ||
training generations, it finally figures out that this is not the greatest idea. | ||
|
||
However, what is interesting in these games is - at the end of the game and throwing away 60% of its pawns, | ||
it seems to always have a nice base to protect itself whereas the MCTS player is wide open. | ||
|
||
Here are the last four games (with gen8, currupted gen9 network with bad PUCT values) | ||
This is the third attempt, where I started completely over from fresh. | ||
|
||
* [game 1](http://www.ggp.org/view/all/matches/a8468283b34055be6e315951499d57d7af21fa67/) | ||
* [game 2](http://www.ggp.org/view/all/matches/7fb8f051dd46d51bd491bcb28f66d7629344e1fd/) | ||
* [game 3](http://www.ggp.org/view/all/matches/db0f4ce99613445ddcf89b12b160d1e58686975e/) | ||
* [game 4](http://www.ggp.org/view/all/matches/a67ea10203cf74b367f7f4e16dfaa3c7923a21af/) | ||
The first run was riddled with bugs and misunderstandings, the second was completely unstable due | ||
to bad selection of moves for training, so its 3rd time lucky I hope. This time the entire process | ||
is 95% automated, and distributed - with little to none hand holding. After 24 hours of training | ||
using 1 gpu and 6 cores, 25 generations was complete. That is 155k samples to train the network | ||
on. I pitched the trained network against Gurgeh - which is a very fast MCTS player (which is | ||
doing 1.2 million tree playouts per move). The matches gave the players 15 seonds thinking time | ||
per move. The network based PUCT player (which was set at 800 iterations) is only using a 1-3 | ||
second of. It wins quite comfortable: | ||
|
||
* [game 1](http://www.ggp.org/view/all/matches/82745815a8ab7ea9a80be4c03626c04d7608eebb/) | ||
* [game 2](http://www.ggp.org/view/all/matches/3097bd5b1a64df66d611e612357f7ddf0a802988/) | ||
|
||
Note the game is notoriously bad for MCTS players. Most players in GGP community generally add | ||
heuristics on top of MCTS play and play significantly stronger than the above. | ||
heuristics on top of MCTS play and play significantly stronger than the above. Gurgeh doesn't | ||
employ any special heuristics, so this is a raw MCTS player. Next up is against galvanise. | ||
|
||
|
||
Self play | ||
--------- | ||
These are some self play games, with only 42 iterations. In other words it plays according to what | ||
the policy network has learned. Admittedly not brilliant, but it has improved. | ||
These are some self play games, only using the policy part of the network (greedily taking the most | ||
probable move) - of gen 20 versus gen 25. Each generation here is basically 5k new games, where | ||
each generation takes approximately 1 hour to generate and train. | ||
|
||
* [gen8 v gen2](http://www.ggp.org/view/all/matches/a0d60b298799ea9be497ae54b719acbc0316a365/) | ||
* [gen2 v gen8](http://www.ggp.org/view/all/matches/b761a1029b667935fe13c988246a0bf5bc9d19c6/) | ||
|
||
|
||
Current status | ||
-------------- | ||
The inputs to the neural network are combination of planes and input nodes. In GGP | ||
[GDL](http://alloyggp.blogspot.co.uk/2012/12/the-game-description-language.html) defines the 'base' | ||
states via prolog like rules. | ||
* [gen25 v gen20](http://www.ggp.org/view/all/matches/91d2cf9cefc7075b33152e0127b1f3e7b12aeef1/) | ||
* [gen20 v gen25](http://www.ggp.org/view/all/matches/dc77c121f3958d2cbefcc75f8430dad8f2b52312/) | ||
|
||
Base states that look as if they are coordinates in a 2D board are turned into planes and fed into | ||
a residual part of the network. The non-coordinate base-states go through a single fully connected | ||
layer. GDL does not specifiy whether these are coordinates or not, and for now these are hard | ||
coded. It shouldn't be too hard to infer them - at least with non-contrived GDL. | ||
|
||
Both lines of the network are concatenated and fed to two outputs heads: a single list of moves and the | ||
final score of the game (similar to policy and score in AlphaZero). | ||
gen25 is very aggressive in exchanging pieces, which is an interesting tactic. | ||
|
||
The trained data is initially sampled from self play of a dumb Monte Carlo search player. Each | ||
turn takes 0.25 seconds and it was enough to prime the network and learn the rules of the game. | ||
This differs from AlphaZero which skips right to the chase and starts from a completely random | ||
network. One thing that seems apparent is that after 10 generations it still plays similar to this | ||
dumb player, and feels like it is caught in a local optima right from the get go. Next time around | ||
I plan to start from a random neural network and set number of iterations to be very small - thus | ||
quickly learning the rules, without learning what a good move is. | ||
|
||
Subsequently the network is trained in generations. A single sample is taken from a self play game | ||
using Monte Carlo playouts with the policy and score values, using PUCT and Dirichlet noise for | ||
exploration. This is played approximately, using different configurations of a Monte Carlo base | ||
player to arrive at a single state to evaluate thoroughly (800 iterations) for the policy part. | ||
Then it plays to the end without noise to come up with a final score. This cuts down in computation | ||
significantly and is a clever hack (credit ThinkingFastSlow). | ||
|
||
A generation is made up of ~10k of these samples and then the neural network retrains from scratch. | ||
|
||
|
||
Comments | ||
-------- | ||
The hyperparameters are very sensitive and getting these right for playing and training seems quite | ||
hard. | ||
|
||
I've no idea what to do with duplicate states. Should later generations be allowed to update with | ||
a better quality policies and scores - and by replace can just append to the buffer. What about | ||
duplicate states in the same generation. Also keeping these in sync across machines is a pain. | ||
|
||
PUCT constant is the new UCT constant. Can never find that quite the right value for all | ||
circumstances. Hence why I made it auto tuning for each node in galvanise. And I just found the | ||
players don't play well at all without dirichlet noise. Which means my training player has been | ||
producing basically rubbish. Initialising each edge to be zero, exploration will never happen, and | ||
dirichlet noise will at least release some of those nodes. | ||
|
||
Current status | ||
-------------- | ||
The trained data is initially sampled from mostly random rollouts. This is very fast - and used to | ||
just let the network learn the rules. For each random rollout (may have some extra bells such as | ||
greedily taking wins) - there is no search. One random sample is taking from each rollout. | ||
Subsequently the real training happens. The network is trained in generations. A single sample is | ||
taken from a self play game, where the self play game only uses the policy part of the network with | ||
a random choice of move (moves with higher probability have more chance of being played than moves | ||
with lower probability). Upon selection of a game state, 800 iterations of a PUCT player is | ||
performed and the distribution of visits on the root node defines a policy distribution to train | ||
on. Finally the game is then played to the end starting from the selected state - via the policy | ||
part of the network only taking the most probably move at each step. By generating the data this | ||
way, and by increasing to larger networks as the number of samples increase (ie starting with a | ||
tiny neural network) - and a bunch of other optimisations - all in all sped up training by over a | ||
magnitude over the first couple of attempts. With the extra bonus that the network does seem to | ||
improve this time around. |