A novel parallel UCT algorithm with linear speedup and negligible performance loss. This package provides a demo on Atari games (see Running). To allow easy extension to other environments, we wrote the code in an extendable way, and modification on only two files are needed for other environments (see Run on your own environments).
This work has been accepted by ICLR 2020 for oral full presentation (48/2594).
This Google Colaboratory link contains a demo on the PongNoFrameskip-v0 environment. We thank @lvisdd for building this useful demo!
Note: For full details of WU-UCT, please refer to our Arxiv or OpenReview paper.
We use the above figure to demonstrate the main problem caused by parallelizing UCT. (a) illustrates the four main steps of UCT: selection, expansion, simulation, and backpropagation. (b) a demonstration of the ideal (but unrealistic) parallel algorithm, i.e., the return V (cumulative reward) is available as soon as simulations start (in real-world cases they are observable only after simulations complete). (c) if we parallelize UCT naively, problems such as collapes of exploration or exploitation failure will happen. Specifically, since less statistics are available at the selection step, the algorithm cannot choose the "best" node to query. (d) we propose to keep track of the on-going but non-terminated simulations (called unobserved samples) to correct and compensate the outdated statistics. This allows performing principled selection step on parallel settings, allowing WU-UCT to achieve linear speedup as well as negligible performance loss.
WU-UCT achieves ideal speedup under up to 16 workers, also without performance degradation.
Clear advantage compared to baseline parallel approaches, in terms of both speed and accuracy.
Our implementation of the system consists of a master process and two sets of slave workers, i.e., expansion workers and simulation workers. With a clear division of labor, we parallel the most time-consuming expansion and simulation step, while maintain the sequential structure in the selection and backpropagation step.
The breakdown of time consumption (tested with 16 expansion and simulation workers) indicates we successfully parallelize the most time-consuming expansion and simulation process and maintains time-consumption of other steps relatively small.
- Python 3.x
- PyTorch 1.0
- Gym (with atari) 0.14.0
- Numpy 1.17.2
- Scipy 1.3.1
- OpenCV-Python 4.1.1.26
- Download or clone the repository.
- Run with the default settings:
python3 main.py --model WU-UCT
- For additional hyperparameters please have a look at main.py (they are also listed below), where descriptions are also included. For example, if you want to run the game PongNoFrameskip-v0 with 200 MCTS rollouts, simply run:
python3 main.py --model WU-UCT --env-name PongNoFrameskip-v0 --MCTS-max-steps 200
or if you want to record the video of gameplay, run:
python3 main.py --model WU-UCT --env-name PongNoFrameskip-v0 --record-video
- A full list of parameters
- --model: MCTS model to use (currently support WU-UCT and UCT).
- --env-name: name of the environment.
- --MCTS-max-steps: number of simulation steps in the planning phase.
- --MCTS-max-depth: maximum planning depth.
- --MCTS-max-width: maximum width for each node.
- --gamma: environment discount factor.
- --expansion-worker-num: number of expansion workers.
- --simulation-worker-num: number of simulation workers.
- --seed: random seed for the environment.
- --max-episode-length: a strict upper bound of environment's episode length.
- --policy: default policy (see above).
- --device: support "cpu", "cuda:x", and "cuda". If entered "cuda", it will use all available cuda devices. Usually used to load the policy.
- --record-video: see above.
- --mode: MCTS or Distill, see Planning with prior policy.
The code currently support three default policies (policy used to perform simulation): Random, PPO, DistillPPO (to use them, change the “--policy” parameter). To use the PPO and DistillPPO policy, corresponding policy files need to be put in ./Policy/PPO/PolicyFiles. PPO policy files can be generated by Atari_PPO_training (or see Github). For example, by running
cd Utils/Atari_PPO_training
python3 main.py PongNoFrameskip-v0
a policy file will be generated in ./Utils/Atari_PPO_training/save. To run DistillPPO, we have to run the distill training process by
python3 main.py --mode Distill --env-name PongNoFrameskip-v0
We kindly provide an environment wrapper and a policy wrapper to make easy extensions to other environments. All you need is to modify ./Env/EnvWrapper.py and ./Policy/PolicyWrapper.py, and fit in your own environment. Please follow the below instructions.
-
Edit the class EnvWrapper in ./Env/EnvWrapper.py.
Nest your environment into the wrapper by providing specific functionality in each of the member function of EnvWrapper. There are currently four input arguments to EnvWrapper: env_name, max_episode_length, enable_record, and record_path. If additional information needs to be imported, you may first consider adding them in env_name.
-
Edit the class PolicyWrapper in ./Policy/PolicyWrapper.py.
Similarly, nest your default policy in PolicyWrapper, and pass the corresponding method using --policy. You will need to rewrite get_action, get_value, and get_prior_prob three member functions.
- Refactor prior policy module to support easy reuse (Sep. 26, 2019).
(empty)
Please cite the paper in the following format if you used this code during your research :)
@inproceedings{liu2020watch,
title = {Watch the Unobserved: A Simple Approach to Parallelizing Monte Carlo Tree Search},
author = {Anji Liu and Jianshu Chen and Mingze Yu and Yu Zhai and Xuewen Zhou and Ji Liu},
booktitle = {International Conference on Learning Representations},
month = apr,
year = {2020},
url = "https://openreview.net/forum?id=BJlQtJSKDB"
}