Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reasoning behind constraint in grads_and_grad_moms #1

Open
aomader opened this issue Feb 12, 2017 · 4 comments
Open

Reasoning behind constraint in grads_and_grad_moms #1

aomader opened this issue Feb 12, 2017 · 4 comments

Comments

@aomader
Copy link

aomader commented Feb 12, 2017

Hej,

I'm wondering about the reasoning behind this constraint in grads_and_grad_moms.
The given variables are indeed used in multiple operations in the loss computation graph, nonetheless that shouldn't hinder it to compute the grad.

Could you shed some light on this?

@lballes
Copy link
Member

lballes commented Feb 13, 2017

Hey!

The second moment of gradients, which is needed for the gradient variance computation, is computed implicitly using a "trick" explained in this note. This trick is not directly applicable when a variable has more than one incoming gradients during back-propagation, which is the case whenever a variable is consumed by more than one operation.

Since we currently don't have an efficient way to compute the gradient moment in such a case, we ruled out multiple consumer operations. We are trying to come up with a better implementation of the gradient moment computation; any ideas/help would be highly appreciated :)

@aomader
Copy link
Author

aomader commented Feb 14, 2017

@lballes I would say the most straight forward way would be to fall back to an inefficient method to compute the second moment, but warnings.warn() the user that the trick doesn't apply to the supplied objective function.

@hughsalimbeni
Copy link

Hi, have you tried a running average estimate like in ADAM:

m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g

(taken from here)

The requirement for all operations to be matmul, add and conv2d seems rather restrictive

@lballes
Copy link
Member

lballes commented Jul 27, 2017

Hi @hughsalimbeni, we did some general experiments comparing running average estimates to mini-batch estimates and found the former to be quite poor. That's why we went with the mini-batch estimate for this implementation, which is of course very restrictive for practical use in its current implementation. It would definitely be worth exploring how the line search would actually perform using the running average estimates. I don't know whether I personally will find time to do that anytime soon, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants