-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate to gymnasium #72
Conversation
@ernestum could you take an initial look at this? There's still some errors to fix but seems like most of the stuff is updated now. |
Also @AdamGleave introducing some API changes here, since now that gym has generic types a lot of our weird workarounds are no longer necessary. In particular, at the ABC top-level, I am keeping inheritance of observation_space and action_space type declaration instead of creating our custom private variable etc. Users have to create these replacements on implementation. I am also migrating the old random number generation to the new interface and adding proper generic typing to numpy where appropriate. Also, since Discrete is an int64, the state and action values now are also int64. |
Also we seem to be getting a CI error:
Don't think this is related to my changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far. Thanks for the fast progress on this one!
I don't think the CI error should be due to your changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments, but it wasn't necessarily 100% comprehensive
One thing to consider, which I don't have a good answer for right now (would need to spend some more time getting used to the library) -- when we hit the end of a fixed-horizon env, should it be a terminated
or truncated
situation?
The first-order approximation (in "regular" RL) is that if you're hitting a time limit, it's truncation; if you're reaching a terminal state, it's termination.
The second-order approximation is that if, in regular RL training, you'd use the value estimation in the final step to get the full discounted reward estimate, then that's truncation.
Fixed-horizon envs might have a weird interaction with this. In the end, the most important thing is probably that it's consistent between here and imitation
. This might also slightly push it towards using truncated
, since there are many environments (like Half-Cheetah, for example) which will use that semantic automatically
@Rocamonde what's the current status on this? Planning on having someone take a look at the Gymnasium PR in SB3 next week. |
Currently quite swamped with other things and have not been able to look at this. I expect that I will have a bit more time in the future. |
Closing this in favor of #73 |
No description provided.