This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
Replies: 1 comment
-
@mxnet-label-bot add [modeling, gluon, question] |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I wonder the inner mechanism of gradient calculation with this example. Here I took some excerpts from model.py as below:
First, I know for actor critic algorithm, policy advantages should be maximized while state value difference should be minimized. And the absolute value for the 2 values are the same.
Then, my quesiton:
log_policy
got to be multipled with advantage, then treat the negative result as policy_loss toautograd.backward
. But there is so such manipulation here in functiontrain_step
, does that mean thelog_policy
is multipled implicitly?2.for state value estimation, to minimize difference, usually L1Loss or L2Loss is taken to
autograd.backward
in gluon version. But there seems a trick used here as the comment said," # NOTE(reed): The grads of values is actually negative advantages."
, Similar question like 1, what is exactly calculated here withv_grads
and corresponding net outputvalue
when back propagating?I know this question maybe somehow related to math, and sorry for my poor math with gradient calculation. Hope someone could give an clear answer, thanks.
Beta Was this translation helpful? Give feedback.
All reactions