Reinforcement Learning

Implementation and experiments with reinforcement learning algorithms and testing them with OpenAI's gym [1].

Success Matching

"When learning from a set of their own trials in iterated decision problems, humans attempt to match not the best taken action but the reward-weighted frequency of their actions and outcomes" Arrow, 1958

Episode-based successs matching is a probabilistic policy search method. A parameter distribution is learned by iteratively fitting the reward weighted parameter distribution. This happens in multiple steps:

Explore: Sample parameters from the parameter distibution
Evaluation: Evaluate the resulting policy by collecting rewards from the environment
Update: Fit the reward weighted parameter distribution with with weighted maximum likelihood

This algorithm is known as policy learning by weighting exploration with the returns (PoWER, [2]).

Cart Pole

For the cart pole environment PoWER works just fine with a time-independent linear policy model and no featurization of the state. The first clip shows the policy bevore training and then the policy with the mean of the paramter distribution for each epoch. The training runs for 10 epochs with 5 parameter samples and rollouts per epoch.

cartpole.mp4

Pendulum

For the pendulum environment PoWER works with a time-independent policy with a featurization of the state. The policy is linear in its parameters. Hurdles of the pendulum environment are that the state is given as the positions in task space and not in joint space with a non-linear dynamics model and the heterogeneity in the start position. For the first hurdle the joint angle was calculated and for the second hurdle the start position was seeded for all rollouts in one epoch. The last clip shows the pendulum starting in different positions and the learned policy is able to solve them but only uses one side for the upswing. This problem could be solved with hierarchical relative entropy policy search (HiREPS, [3]) which uses a gating-policy for an option dependent on the state and an option-policy for the policy parameters dependent on the option. In this manner HiREPS can learn versatile solutions.

pendulum.mp4

References

[1] G. Brockman and V. Cheung and L. Pettersson and J. Schneider and J. Schulman and J. Tang and W. Zaremba, "OpenAI Gym", 2016

[2] J. Kober and J. Peters, “Policy search for motor primitives in robotics”, in Advances in Neural Information Processing Systems (NIPS), 2008.

[3] C. Daniel, G. Neumann, and J. Peters. "Hierarchical relative entropy policy search", in Artificial Intelligence and Statistics, pages 273–281, 2012.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
optimization		optimization
papers		papers
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
success_matching_for_cartpole.ipynb		success_matching_for_cartpole.ipynb
success_matching_for_pendulum.ipynb		success_matching_for_pendulum.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reinforcement Learning

Success Matching

Cart Pole

Pendulum

References

About

Releases

Packages

Languages

julianStreibel/reinforcement_learning

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning

Success Matching

Cart Pole

Pendulum

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages