You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for sharing your great work. However, I am confused about the rollout generation process.
As I see in the code, the agent can access to a pre-defined terminal function to cut down the unrealistic state. Is this assumption generally holds up for broad cases of offline RL? To my understanding, in the offline setting, the agent should only access to a fix dataset without anything else. It feels like a little cheating for me, especially when, in the paper, the authors argue that one of the difference between MOPO and MOReL is that the soft penalty, rather than a hard terminal, of MOPO allow the agent to take risky actions.
Besides, if MOPO really needs the terminal function, why not learn one by neural net? I have already seen many model-based works on Atari games that uses a learned terminal function.
The text was updated successfully, but these errors were encountered:
Actually, given the terminal function is reasonable in practice to perform model-based RL. Also, we can even get access to a given reward function when doing model-based learning. The reward and terminal function is essentially defined by human experts. On the contrary, the transition dynamics is governed by nature. Thus, reverse engineering reward and terminal function from data is not necessary in practice.
What worries me the most is that the numbers reported in the paper after it is accepted (the latest version) is different from what is reported in the first arxiv version when this paper is under review.
I agree with your point. In real-world scenarios, the reward function and the terminal function are available in the most cases (MDP settings and some POMDP settings). I guess future works can take this into account.
About the result, I guess they just redo the experiments in a more standardized way. However, it indeed needs a bit of luck to get a good result even with the correct hyper-parameters.
Hi, thanks for sharing your great work. However, I am confused about the rollout generation process.
As I see in the code, the agent can access to a pre-defined terminal function to cut down the unrealistic state. Is this assumption generally holds up for broad cases of offline RL? To my understanding, in the offline setting, the agent should only access to a fix dataset without anything else. It feels like a little cheating for me, especially when, in the paper, the authors argue that one of the difference between MOPO and MOReL is that the soft penalty, rather than a hard terminal, of MOPO allow the agent to take risky actions.
Besides, if MOPO really needs the terminal function, why not learn one by neural net? I have already seen many model-based works on Atari games that uses a learned terminal function.
The text was updated successfully, but these errors were encountered: