Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doesn't work for continuous_mountain_car #9

Open
joyousrabbit opened this issue May 10, 2017 · 6 comments
Open

Doesn't work for continuous_mountain_car #9

joyousrabbit opened this issue May 10, 2017 · 6 comments

Comments

@joyousrabbit
Copy link

Hello, the algo doesn't work for continuous_mountain_car, because it's reward is -pow(action[0],2)*0.1. What means, the car's initial state is a local max reward, all the exploration will decrease the reward and cannot get evoluated.

Of course, if the car can explore the final solution by one try, it will work. But the probability is negligible.

How do you handle such local max initial state issue???

@PatrykChrabaszcz
Copy link

What do you mean by Of course, if the car can explore the final solution by one try, it will work. .
I think that if it finds good solution (Reaching the final state) by accident then update in weights will be too small anyway as most of the population will want to keep the policy "Do nothing" . Correct me if I'm wrong but I think that for this experiment you would have to change the way in which policy weights are updated to give more value to much better results and ignore the rest, and you would have to increase the noise so it's possible to find good policy by adding noise to policy that does nothing.

This example is quite hard. I managed to get good results for discrete version (MountainCar-v0) but no success for this one.

@joyousrabbit
Copy link
Author

joyousrabbit commented May 11, 2017

@PatrykChrabaszcz Hello, after the solution is found quickly, the new weights will all be based on that solution.

@PatrykChrabaszcz
Copy link

I don't see how one proper solution would drag the weights for the current policy such that it makes it more probable to draw more policies that reach final state in the next generation (for this enviroment). Influence from policies doing nothing will be much bigger when you use current default updating rule.

Maybe you mean initializing current policy (by accident) such that big part of the first population reaches the goal state.

@joyousrabbit
Copy link
Author

@PatrykChrabaszcz No, whenever it reachs the goal state, the influence will be big and immediate to the following biased and random weights. Because it's reward is huge compared with other opponents of doing nothing.

@PatrykChrabaszcz
Copy link

Reward might be huge but by default if I understand correctly it uses weighted average to update parameters. But the weights in this average are from <-0.5, 0.5> centered_rank . So if there is only one good solution in this population it will be counted as 0.5 but the next one assuming for example population of size 100 will be counted as 0.49. That's why I said you could change the way those weights are updated so it gives this good solution higher importance.
Am I right?

@joyousrabbit
Copy link
Author

joyousrabbit commented May 11, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants