Reproduce MOPO results #101

jdchang1 · 2021-07-23T17:00:23Z

I have been trying to reproduce the MOPO results using your library and I have been having trouble. I have been following your MOPO script in the reproduce directory and have been messing around with training the dynamics ensemble for much longer. However even after training for more than 500 epochs I fail to get an evaluation score higher than 7 for Hopper-medium (d4rl). Could you provide any pointers?

Some metrics that seem suspicious are the SAC losses. There seems to be a blowup in the critic loss. Thanks!

takuseno · 2021-07-24T12:47:50Z

@jdchang1 Hello, thank you for the issue. Currently, I did not spend much time on checking MOPO's performance for now. But, very recently, the d4rl dataset conversion was fixed.
8e141c0
It might be worth trying the same for now with the latest master branch.

TakuyaHiraoka · 2022-02-04T10:29:11Z

Hi @takuseno @jdchang1,

MOPO implemented in d3rlpy does not terminate its model rollout at terminal states.
The original MOPO evaluates whether the generated state is terminal or not by accessing the termination function of the true environment, and terminates the rollout at terminal states (lines 408 -- 446 in [1]).

I found that this model rollout termination improves d3rl MOPO's performance and gets it closer to the original one.
(d3rl MOPO with a quick patch for the model rollout termination (and its evaluation result) are available at [2])

[1] https://github.com/tianheyu927/mopo/blob/master/mopo/algorithms/mopo.py
[2] https://drive.google.com/file/d/1GvHWJj3sU1wl7NGxMibee-ZbIOt65Smb/view?usp=sharing

takuseno · 2022-02-05T07:21:22Z

@TakuyaHiraoka Thanks for the info! I never realized they did a very tricky hack. It seems not very practical in general. I would rather add a classifier trained to estimate terminal flags. Probably this explains the COMBO issue too.

IantheChan · 2022-04-01T14:10:49Z

Actually, in the official implementation of Morel, the authors use the same trick. Maybe using a known termination function is common in model-based offline RL?

ZishunYu · 2022-05-11T19:42:11Z

I agree with @IantheChan. To my knowledge, some model-based RL works uses this trick, which seems to be very critical for model-based RL. I find this makes sense for robot learning as it is somehow not too difficult to check whether a robot sensor status is feasible.

takuseno mentioned this issue Jul 11, 2022

COMBO performance #183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce MOPO results #101

Reproduce MOPO results #101

jdchang1 commented Jul 23, 2021

takuseno commented Jul 24, 2021

TakuyaHiraoka commented Feb 4, 2022

takuseno commented Feb 5, 2022

IantheChan commented Apr 1, 2022

ZishunYu commented May 11, 2022

Reproduce MOPO results #101

Reproduce MOPO results #101

Comments

jdchang1 commented Jul 23, 2021

takuseno commented Jul 24, 2021

TakuyaHiraoka commented Feb 4, 2022

takuseno commented Feb 5, 2022

IantheChan commented Apr 1, 2022

ZishunYu commented May 11, 2022