Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LARC clipping+documentation #6

Merged
merged 1 commit into from
Jul 3, 2018
Merged

Conversation

raulpuric
Copy link
Contributor

  • Proper implementation of LARC clipping
  • Documentation of LARC class
  • Modification of FP16_Optimizer to absorb optimizer instance that's being wrapped instead of creating new optimizer instance of same class.

 * Proper implementation of LARC clipping
 * Documentation of LARC class
 * Modification of FP16_Optimizer to absorb optimizer instance that's being wrapped instead of creating new optimizer instance of same class.
optimizer: Pytorch optimizer to wrap and modify learning rate for.
trust_coefficient: Trust coefficient for calculating the lr. See https://arxiv.org/abs/1708.03888
clip: Decides between clipping or scaling mode of LARC. If `clip=True` the learning rate is set to `min(optimizer_lr, local_lr)` for each parameter. If `clip=False` the learning rate is set to `local_lr*optimizer_lr`.
eps: epsilon kludge to help with numerical stability while calculating adaotive_lr
Copy link
Contributor

@brettkoonce brettkoonce Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor sp: adaotive_lr --> adaptive_lr

@@ -4,11 +4,45 @@
from torch.nn.parameter import Parameter

class LARC(object):
def __init__(self, optimizer, trust_coefficient=0.02, epsilon=1e-8):
"""
:class:`LARC` is a pytorch implementation of both the scaling and clipping varients of LARC,
Copy link
Contributor

@brettkoonce brettkoonce Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor sp: varients --> variants

@mcarilli
Copy link
Contributor

mcarilli commented Jul 3, 2018

My rationale for creating a new instance of the passed optimizer's class within FP16_Optimizer's constructor was that if the passed optimizer had been used earlier, it might have created some momentum or other ancillary buffers in FP16. I would have to trace through the optimizer's param_groups and cast all these ancillary buffers to FP32 as well. This is doable (it's similar to what torch.Optimizer.load_state_dict does) but seemed more brittle than FP32-ifying the param_groups then using them to create a fresh optimizer instance.

Your proposed change (using the passed optimizer directly) does impose the additional restriction that the passed optimizer has not been used beforehand/does not contain any ancillary buffers (aside from its owned parameters) that might be FP16. All my documentation and examples work this way, although the requirement is not yet stated explicitly, so I suppose it's fine to accept.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants