UnityML-Navigation Project

This project is aimed at solving a provided Unity Machine Learning Agents Toolkit challenge. In this challenge, a simulated robotic agent must learn to reach with its arm for a certain position by applying torque to the joint motors. This project uses the 20-agent version of the environment.

Requirements

Installation

Recommended way of installing the dependencies is via Anaconda. To create a new Python 3.6 environment run

conda create --name myenv python=3.6

Activate the environment with

conda activate myenv

Click here for instructions on how to install the Unity ML-Agents Toolkit.

Visit pytorch.org for instructions on installing pytorch.

Install matplotlib with

conda install -c conda-forge matplotlib

Jupyter should come installed with Anaconda. If not, click here for instructions on how to install Jupyter.

Getting started

The project can be run with the provided jupyter notebooks. Reacher_Observe.ipynb allows one to observe a fully trained agent in the environment. Reacher_Training.ipynb can be used to train a new agent or continue training a pre-trained agent. Several partially pre-trained agents and one fully trained agents are stored in the savedata folder.

Environment

The environment is a a platform with 20 robot arms. Each arm has its own target sphere hovering around it at random positions. The observation space for a single agent consists of 33 continuous variables, containing observables like position, rotation and angular velocities. The action space consists of 4 continuous actions between -1.0 and +1.0, corresponding to torque at the arm's joints. Each of the two joints can be rotated in two directions.

Algorithm

The training algorithm used here is Proximal Policy Optimization (PPO) with clipped surrogate functions, as introduced by OpenAI in 2017. A detailed description can be found in the 2017 paper by Schuman et al.. As opposed to Deep Deterministic Policy Gradients, PPO works with probabilities for every action. By collecting trajectories of states, actions and rewards, it tries to increase the probability for actions that resulted in a high reward and decrease probabilities that resulted in a low reward. The original version of this algorithm is called REINFORCE, but it has several shortcomings that were partially overcome with improvements like Trust Region Policy Optimization (TRPO). PPO greatly simplifies the approch TRPO and is especially useful for distributed training where many parallel agents collect experience.

Agent

The agent uses a basic actor-critic model. Both actor and critic are standard fully connected feed forward artificial neural networks, both with two hidden layers at 64 units each with ReLU activation. The actor predicts 4 action values and than uses them as mean for a gaussian to sample actions from. The standard deviation is an extra trainable parameter of the model. The critic is trained to estimate an advantage function.

Training

During training, the agent collects a maximum of tmax steps of experience. Then it performs several steps of PPO on the collected trajectory using minibatches. The following hyperparameters are used during training

parameter	value	description
tmax	300	max number of steps to collect
max_episodes	800	maximum number of episodes before quitting
ppo_epochs	5	number of PPO training epochs
batchsize	500	size of the mini-batches used during one epoch
discount	0.98	discount value for future returns
optimizer	Adam	Adam with lr=0.0003 and eps=1e-5
ratio_clip	0.2	r-parameter of PPO, advised value from the paper
max_grad_norm	0.5	maximum value of the gradient when performing optimization step

With these settings, the agent should learn to solve the environment within 200 episodes.

Showcase

Example of an untrained agent:

... and the same agent after 156 episodes of training:

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Reacher_Windows_x86_64		Reacher_Windows_x86_64
savedata		savedata
.gitattributes		.gitattributes
README.md		README.md
Reacher_Observe.ipynb		Reacher_Observe.ipynb
Reacher_Training.ipynb		Reacher_Training.ipynb
Report.pdf		Report.pdf
models.py		models.py
scores.png		scores.png
trained_example.gif		trained_example.gif
untrained_example.gif		untrained_example.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UnityML-Navigation Project

Requirements

Installation

Getting started

Environment

Algorithm

Agent

Training

Showcase

About

Releases

Packages

Languages

fd17/UnityML-Continuous-Control

Folders and files

Latest commit

History

Repository files navigation

UnityML-Navigation Project

Requirements

Installation

Getting started

Environment

Algorithm

Agent

Training

Showcase

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages