-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reasoning behind constraint in grads_and_grad_moms
#1
Comments
Hey! The second moment of gradients, which is needed for the gradient variance computation, is computed implicitly using a "trick" explained in this note. This trick is not directly applicable when a variable has more than one incoming gradients during back-propagation, which is the case whenever a variable is consumed by more than one operation. Since we currently don't have an efficient way to compute the gradient moment in such a case, we ruled out multiple consumer operations. We are trying to come up with a better implementation of the gradient moment computation; any ideas/help would be highly appreciated :) |
@lballes I would say the most straight forward way would be to fall back to an inefficient method to compute the second moment, but |
Hi, have you tried a running average estimate like in ADAM:
(taken from here) The requirement for all operations to be |
Hi @hughsalimbeni, we did some general experiments comparing running average estimates to mini-batch estimates and found the former to be quite poor. That's why we went with the mini-batch estimate for this implementation, which is of course very restrictive for practical use in its current implementation. It would definitely be worth exploring how the line search would actually perform using the running average estimates. I don't know whether I personally will find time to do that anytime soon, though. |
Hej,
I'm wondering about the reasoning behind this constraint in
grads_and_grad_moms
.The given variables are indeed used in multiple operations in the loss computation graph, nonetheless that shouldn't hinder it to compute the grad.
Could you shed some light on this?
The text was updated successfully, but these errors were encountered: