Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why MOPO is given access to terminal function in rollout generation? #4

Open
IcarusWizard opened this issue Nov 30, 2020 · 2 comments

Comments

@IcarusWizard
Copy link

Hi, thanks for sharing your great work. However, I am confused about the rollout generation process.

As I see in the code, the agent can access to a pre-defined terminal function to cut down the unrealistic state. Is this assumption generally holds up for broad cases of offline RL? To my understanding, in the offline setting, the agent should only access to a fix dataset without anything else. It feels like a little cheating for me, especially when, in the paper, the authors argue that one of the difference between MOPO and MOReL is that the soft penalty, rather than a hard terminal, of MOPO allow the agent to take risky actions.

Besides, if MOPO really needs the terminal function, why not learn one by neural net? I have already seen many model-based works on Atari games that uses a learned terminal function.

@vermouth1992
Copy link

Actually, given the terminal function is reasonable in practice to perform model-based RL. Also, we can even get access to a given reward function when doing model-based learning. The reward and terminal function is essentially defined by human experts. On the contrary, the transition dynamics is governed by nature. Thus, reverse engineering reward and terminal function from data is not necessary in practice.

What worries me the most is that the numbers reported in the paper after it is accepted (the latest version) is different from what is reported in the first arxiv version when this paper is under review.

@IcarusWizard
Copy link
Author

I agree with your point. In real-world scenarios, the reward function and the terminal function are available in the most cases (MDP settings and some POMDP settings). I guess future works can take this into account.

About the result, I guess they just redo the experiments in a more standardized way. However, it indeed needs a bit of luck to get a good result even with the correct hyper-parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants