Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce MOPO results #101

Open
jdchang1 opened this issue Jul 23, 2021 · 5 comments
Open

Reproduce MOPO results #101

jdchang1 opened this issue Jul 23, 2021 · 5 comments

Comments

@jdchang1
Copy link

Hi @takuseno,

I have been trying to reproduce the MOPO results using your library and I have been having trouble. I have been following your MOPO script in the reproduce directory and have been messing around with training the dynamics ensemble for much longer. However even after training for more than 500 epochs I fail to get an evaluation score higher than 7 for Hopper-medium (d4rl). Could you provide any pointers?

Some metrics that seem suspicious are the SAC losses. There seems to be a blowup in the critic loss. Thanks!

@takuseno
Copy link
Owner

@jdchang1 Hello, thank you for the issue. Currently, I did not spend much time on checking MOPO's performance for now. But, very recently, the d4rl dataset conversion was fixed.
8e141c0
It might be worth trying the same for now with the latest master branch.

@TakuyaHiraoka
Copy link

Hi @takuseno @jdchang1,

MOPO implemented in d3rlpy does not terminate its model rollout at terminal states.
The original MOPO evaluates whether the generated state is terminal or not by accessing the termination function of the true environment, and terminates the rollout at terminal states (lines 408 -- 446 in [1]).

I found that this model rollout termination improves d3rl MOPO's performance and gets it closer to the original one.
(d3rl MOPO with a quick patch for the model rollout termination (and its evaluation result) are available at [2])

[1] https://github.com/tianheyu927/mopo/blob/master/mopo/algorithms/mopo.py
[2] https://drive.google.com/file/d/1GvHWJj3sU1wl7NGxMibee-ZbIOt65Smb/view?usp=sharing

@takuseno
Copy link
Owner

takuseno commented Feb 5, 2022

@TakuyaHiraoka Thanks for the info! I never realized they did a very tricky hack. It seems not very practical in general. I would rather add a classifier trained to estimate terminal flags. Probably this explains the COMBO issue too.

@IantheChan
Copy link

Actually, in the official implementation of Morel, the authors use the same trick. Maybe using a known termination function is common in model-based offline RL?

@ZishunYu
Copy link

I agree with @IantheChan. To my knowledge, some model-based RL works uses this trick, which seems to be very critical for model-based RL. I find this makes sense for robot learning as it is somehow not too difficult to check whether a robot sensor status is feasible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants