-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hierarchical policies #253
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This review got a bit away from me and feels a bit jumbled: the comments range from minor low-level naming suggestions to high-level thoughts about API consistency. Sorry about that.
Overall, I think this direction is very cool and we definitely need something like your DirectedGraphicalModel
class. That said I'm a bit torn:
- As we do more of this directed graphical model stuff it feels a bit like we're heading down the path of reinventing
tensorflow
, i.e. defining a static computation graph, defining inputs, and then running it. Even with the best engineering, this gets quite complicated. - At the same time, this
DirectedGraphicalModel
class is (correct me if I'm wrong) just the most general means a user has to enable this hierarchical functionality. The other approach would be for them to define their ownDistr
subclass which just (effectively) hard-codes all of the inputs, outputs, and conditional distributions. Given this, maybe it doesn't matter that theDirectedGraphicalModel
approach is a bit complex, people can choose to go another direction if that's what they prefer.
It might be nice to hear @jiasenlu thoughts as the person would would likely be the first "customer" of this.
def act(self, rollouts: RolloutStorage, dist_wrapper_class: Optional[type] = None): | ||
dist_wrapper_class = None | ||
if self.training_pipeline.current_stage.teacher_forcing is not None: | ||
dist_wrapper_class = TeacherForcingDistr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how this TeacherForcingDistr
streamlines so much of the below code. I'm still going through everything but my one worry is that it might make calls to .log_prob
inconsistent among Distr
classes:
E.g. let's say the teacher forcing prob is 0.1, the expert action is 0
, and the agent's distribution is [0.25, 0.25, 0.25, 0.25]
for the actions [0, 1, 2, 3]
. Then the probability that I sample the 0 action is
0.1 * 1 + 0.9 * 0.25 = 0.325
and so if I call teacher_forcing_distr.prob(0)
it should equal 0.325 (to be consistent with all other uses of .prob
). But when we're doing IL I don't really care about the prob of the teacher_forcing_distr
, I care only about the 0.25 prob of the model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I see. Currently (if I'm not wrong), I'm still outputting the 0.25 of the model for the chosen action while enforcing actions during sampling (so TeacherForcingDistr
does not fully behave as a distribution). Maybe we want to call it by a different name? It's used as a distribution, but maybe we need a superclass for Distr
called Sampleable
/Sampler
, which can specialize as a Distr
(with all its expected properties) or a Teacher
, which kind of hacks some details since it's no longer a Distr
?
This pull request introduces 3 alerts when merging 2d5fd70 into 1882ff1 - view on LGTM.com new alerts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking much more streamlined. This is definitely a big update so I've added a bunch of additional comments. To me it seems like the one missing thing is the ability to not pass all actions to the task. Generally I suspect that, when using hierarchical models, most of the actions are going to be "internal", i.e. they're needed to chose the final action but don't actually mean anything to the task. This might happen if, for example, the top-level agent's action space was [f"search_for_{x}" for x in object_types] and the lower-level agent's actions were the movement actions.
- Old Vision Sensors tombstones + capitalized constants in ExpertSensor - `self.num_active_samplers` from vector tasks in engine - `dist_wrapper_class` argument in engine's `act` - Explicit call to `condition_on_input` in `ConditionalDistr` - Bug fix in ConditionedMiniGridTask stub
…classing abstract expert sensor
…ake it compatible with cross_entropy implementation in imitation
…entropy() in ppo loss)
I went ahead and added a new test (using the MiniGrid hacks I used while developing hierarchical policy support) + command line option to collect validation results + command line option to enforce expert during inference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. I left a few comments here and there, mostly asking for some additional documentation in places. As the hierarchical code is pretty involved it would be be great to have a tutorial that really went through all of the details, we could even reference this tutorial in the code so that people could learn things by example.
@@ -1416,6 +1374,7 @@ def __init__( | |||
worker_id: int = 0, | |||
num_workers: int = 1, | |||
distributed_port: int = 0, | |||
enforce_expert: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer that we didn't offer this option. I know it's off by default but I want to make it as hard as possible to accidentally run the expert in the inference phase (especially as the tensorboard log would associate these results with the tag from config file). Perhaps we should add an option to run the expert as in a "stand-alone" mode (with its own tag) where we don't even require the "ExpertConfig" to have defined a model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. I still think this is worth it, though. For example, at some development point in Jiasen's project, we wanted to check what happens when we enable the expert for only one/some of the skills in a hierarchical policy. For that, we needed to run the expert and the model during inference. Let me know your thoughts.
I provide an example in minigrid of how it can be used. I can see the interface being too hard to use, but I think it's good to iterate after we've all taken a look.