Skip to content

Implementation and experiments with reinforcement learning algorithms and testing with OpenAI's gym.

Notifications You must be signed in to change notification settings

julianStreibel/reinforcement_learning

Repository files navigation

Reinforcement Learning

Implementation and experiments with reinforcement learning algorithms and testing them with OpenAI's gym [1].

Success Matching

"When learning from a set of their own trials in iterated decision problems, humans attempt to match not the best taken action but the reward-weighted frequency of their actions and outcomes" Arrow, 1958

Episode-based successs matching is a probabilistic policy search method. A parameter distribution is learned by iteratively fitting the reward weighted parameter distribution. This happens in multiple steps:

  1. Explore: Sample parameters from the parameter distibution
  2. Evaluation: Evaluate the resulting policy by collecting rewards from the environment
  3. Update: Fit the reward weighted parameter distribution with with weighted maximum likelihood

This algorithm is known as policy learning by weighting exploration with the returns (PoWER, [2]).

Cart Pole

For the cart pole environment PoWER works just fine with a time-independent linear policy model and no featurization of the state. The first clip shows the policy bevore training and then the policy with the mean of the paramter distribution for each epoch. The training runs for 10 epochs with 5 parameter samples and rollouts per epoch.

cartpole.mp4

Pendulum

For the pendulum environment PoWER works with a time-independent policy with a featurization of the state. The policy is linear in its parameters. Hurdles of the pendulum environment are that the state is given as the positions in task space and not in joint space with a non-linear dynamics model and the heterogeneity in the start position. For the first hurdle the joint angle was calculated and for the second hurdle the start position was seeded for all rollouts in one epoch. The last clip shows the pendulum starting in different positions and the learned policy is able to solve them but only uses one side for the upswing. This problem could be solved with hierarchical relative entropy policy search (HiREPS, [3]) which uses a gating-policy for an option dependent on the state and an option-policy for the policy parameters dependent on the option. In this manner HiREPS can learn versatile solutions.

pendulum.mp4

References

[1] G. Brockman and V. Cheung and L. Pettersson and J. Schneider and J. Schulman and J. Tang and W. Zaremba, "OpenAI Gym", 2016

[2] J. Kober and J. Peters, “Policy search for motor primitives in robotics”, in Advances in Neural Information Processing Systems (NIPS), 2008.

[3] C. Daniel, G. Neumann, and J. Peters. "Hierarchical relative entropy policy search", in Artificial Intelligence and Statistics, pages 273–281, 2012.

About

Implementation and experiments with reinforcement learning algorithms and testing with OpenAI's gym.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published