This project is aimed at solving a provided Unity Machine Learning Agents Toolkit challenge. In this challenge, a simulated robotic agent must learn to reach with its arm for a certain position by applying torque to the joint motors. This project uses the 20-agent version of the environment.
- Windows (64 bit)
- Python 3.6
- Unity ML-Agents Toolkit
- Pytorch
- Matplotlib
- Jupyter
Recommended way of installing the dependencies is via Anaconda. To create a new Python 3.6 environment run
conda create --name myenv python=3.6
Activate the environment with
conda activate myenv
Click here for instructions on how to install the Unity ML-Agents Toolkit.
Visit pytorch.org for instructions on installing pytorch.
Install matplotlib with
conda install -c conda-forge matplotlib
Jupyter should come installed with Anaconda. If not, click here for instructions on how to install Jupyter.
The project can be run with the provided jupyter notebooks. Reacher_Observe.ipynb allows one to observe a fully trained agent in the environment. Reacher_Training.ipynb can be used to train a new agent or continue training a pre-trained agent. Several partially pre-trained agents and one fully trained agents are stored in the savedata
folder.
The environment is a a platform with 20 robot arms. Each arm has its own target sphere hovering around it at random positions. The observation space for a single agent consists of 33 continuous variables, containing observables like position, rotation and angular velocities. The action space consists of 4 continuous actions between -1.0 and +1.0, corresponding to torque at the arm's joints. Each of the two joints can be rotated in two directions.
The training algorithm used here is Proximal Policy Optimization (PPO) with clipped surrogate functions, as introduced by OpenAI in 2017. A detailed description can be found in the 2017 paper by Schuman et al.. As opposed to Deep Deterministic Policy Gradients, PPO works with probabilities for every action. By collecting trajectories of states, actions and rewards, it tries to increase the probability for actions that resulted in a high reward and decrease probabilities that resulted in a low reward. The original version of this algorithm is called REINFORCE, but it has several shortcomings that were partially overcome with improvements like Trust Region Policy Optimization (TRPO). PPO greatly simplifies the approch TRPO and is especially useful for distributed training where many parallel agents collect experience.
The agent uses a basic actor-critic model. Both actor and critic are standard fully connected feed forward artificial neural networks, both with two hidden layers at 64 units each with ReLU activation. The actor predicts 4 action values and than uses them as mean for a gaussian to sample actions from. The standard deviation is an extra trainable parameter of the model. The critic is trained to estimate an advantage function.
During training, the agent collects a maximum of tmax steps of experience. Then it performs several steps of PPO on the collected trajectory using minibatches. The following hyperparameters are used during training
parameter | value | description |
---|---|---|
tmax | 300 | max number of steps to collect |
max_episodes | 800 | maximum number of episodes before quitting |
ppo_epochs | 5 | number of PPO training epochs |
batchsize | 500 | size of the mini-batches used during one epoch |
discount | 0.98 | discount value for future returns |
optimizer | Adam | Adam with lr=0.0003 and eps=1e-5 |
ratio_clip | 0.2 | r-parameter of PPO, advised value from the paper |
max_grad_norm | 0.5 | maximum value of the gradient when performing optimization step |
With these settings, the agent should learn to solve the environment within 200 episodes.
Example of an untrained agent:
... and the same agent after 156 episodes of training: