An easy 2D guide environment for RL.
The 2D coordinate of agent is (x, y), and θ=tan(y/x) shows the included angle between travel direction and positive x-axis of agent.
Target : Guide the agent to a designated point.
Action : { velocity | palstance }
State : { agent position | target postion | relative position }
Continuous-PPO with tricks shown in this work.
Trick 1—Advantage Normalization.
Trick 2—State Normalization.
Trick 4—— Reward Reward Scaling.
Trick 5—Policy Entropy.
Trick 6—Learning Rate Decay.
Trick 7—Gradient clip.
Trick 8—Orthogonal Initialization.
Trick 9—Adam Optimizer Epsilon Parameter.
Trick10—Tanh Activation Function.
An beta distribution must be used instead of Gaussian one for avoiding agent sample to much at the edge of the action space.
An action distribution sample on Beta distribution be like:
Two Rewards are used in this session: R = T + α * F
- F = -0.01 * (now distance - preious distance)
- T = 100(Terminal); -2(Out of Space); -1(Max Episode)
Terminal Reward effects the agents
10 times of evaluation have done for each 5e3 steps, shows the differences:
Terminal Reward = 50:
Terminal Reward = 80:
Terminal Reward = 90:
Termiinal Reward = 100:
A set of 10 times evaluation had been done. The result is shown in /test_img
For example:
Beside the args of PPO, the args of normalization(mean and std) must be used.
numpy
matplotlib
math
gym