-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't override terminal observation when using AutoResetWrapper #69
Don't override terminal observation when using AutoResetWrapper #69
Conversation
Also pinging @dfilan, since he is using AutoResetWrapper and might have some thoughts. |
Codecov Report
@@ Coverage Diff @@
## master #69 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 26 26
Lines 1047 1084 +37
=========================================
+ Hits 1047 1084 +37
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
(Requested a review from a somewhat random person, sorry if I picked the wrong one) |
Since you're changing the default behavior, this is a breaking change in the API, so I'm thinking we might want to bump the version to 0.2.x. What do you think, @AdamGleave? |
The code does seem to do what you intend it to do, I'm just a bit confused about the mathematical reasoning behind this. The default behavior of ignoring the terminal observation and replacing it with a reset seems bad since one is introducing a trajectory fragment where the learner might think that the action that in reality leads to the terminal state appears to lead to a uniformly sampled initial state, which is also non-physical. But in your modification, ignoring the action the user takes conditioned on the terminal observation effectively does the same thing (introduce a fake transition), with the only difference that the terminal observation is still in the outer trajectory data. I guess an argument in favor of the modifications is that the behavior of the agent in transitions beyond termination should not be relied upon in any case, so it doesn't matter if the algorithm doesn't learn anything sensible beyond terminal observations; but for trajectory fragments part of the POMDP, you don't want to be excluding important information. You're also manually injecting a reward of zero, which seems fine for many cases, but probably not OK to assume all reward functions are positive semi definite. Please, do correct me if I'm misunderstanding anything! |
Yeah that's right and those problems exist, I guess I'm not sure there is any other way around that. As a workaround for the problem of always returning reward 0, I suggest I could add an optional field that is the fixed reward that gets returned for the terminal observation and defaults to 0. |
BTW, I somewhat prefer the default behavior suggested in this PR, but it would also be fine to change it so the default is as it was before. Then we wouldn't have to make breaking changes and could always introduce these breaking changes at a later point if wanted. |
I think this should work
This would be a nice option in principle, but since I'm not using this feature myself not sure how useful that would be in practice. it might be worth adding if the change is simple to make and it keeps a clean API, but otherwise your previous suggestion is probably fine.
I think this would be ideal, so we can get this merged right away. Otherwise we'd have to run checks on imitation etc and make a separate decision |
Let me know if you want help implementing this, even though I think it should be pretty straightforward. Happy to make a final review once that's done. |
Discard terminal obs by default, set reset reward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Linter is currently failing |
Hm, |
Could be a package issue, could you try and check if you have the same versions of everything installed? |
Ok, re-running build_venv is not enough, I have to completely create in from scratch. now I can reproduce this error. Using
Will look more into this tomorrow. |
Ok, this seems to have been a bug introduced in |
In general, the total number of observations in an episode is always 1 + the number of transitions/actions, because there is always a final next_state observation, which an agent does not act on. The AutoResetWrapper effectively combines several of the episodes of the underlying environment into a single continuing episode.
gym
does not anticipate this use case. It will only return a single observation per transition in addition to the very first observation that gets passed the agent, which is generally accessed by calling.reset()
after an episode is done. Consequently, if we combine n episodes, the question remains how to handle these n-1 extra observations. This PR modifies the AutoResetWrapper to provide two modes with the following behavior.I chose to use the latter behavior as default since it presumably leaks less information.
This PR also has a test case for this new behavior.