- A tabular implementation of MuZero compatible with gym.
- To elaborate on that: the current implementation uses the policy iteration update rule of MuZero (via MCTS), but instead of using function approximation (e.g., a neural network) for the learned model, a simple table is used.
- I'm planning to add in the future a version with (i) a neural network model and (ii) distributed training.
- To be added.
- The test environment is a maze, see maze.py for details.
- Here's a graph showing average discounted return during training:
- Averaged across 10 training runs
- Standard deviation across runs shown in lighter color
- The algorithm doesn't quite reach optimal performance. This might be because the exploration doesn't reach zero, because it learns a suboptimal value / policy for some of the initial states due to a lack of exploration, or because there's a bug
- The performance depends a fair amount on the number of Monte Carlo simulations run in choosing the action each timestep:
- Here's a gif showing the state-value function of the learned model during training:
- Dark red is the largest value and dark blue is the lowest value
- The blue state in the upper right is a "pit" with -1 reward
- The goal state is the cell one step up and left from the bottom right corner
- Some state values near the pit and the exit don't converage because the agent doesn't visit them enough
- The implementation of the algorithm is based on implementations of other algorithms in rllab, which is a predecessor of baselines.
- A simpler implementation for the tabular case is definitely possible.
- The hope is that this approach will transfer to the function approximation case where distributed / parallel sampling is necessary.
- A simpler implementation for the tabular case is definitely possible.
- Here's a description of the types of classes:
- Trainer: Orchestrates the training loop (environment interaction and learning)
- Agents: The actual algorithms (e.g., TabularMuZero)
- Samplers: Objects that sample rollouts / trajectories from an environment when passed an agent
- Datasets: Convert trajectories into a form useful for learning
- Planners: Online planning algorithms used by the agents (e.g., mcts), or offline planning algorithms used for comparison (e.g., value iteration)