-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LARC clipping+documentation #6
Conversation
raulpuric
commented
May 31, 2018
- Proper implementation of LARC clipping
- Documentation of LARC class
- Modification of FP16_Optimizer to absorb optimizer instance that's being wrapped instead of creating new optimizer instance of same class.
* Proper implementation of LARC clipping * Documentation of LARC class * Modification of FP16_Optimizer to absorb optimizer instance that's being wrapped instead of creating new optimizer instance of same class.
optimizer: Pytorch optimizer to wrap and modify learning rate for. | ||
trust_coefficient: Trust coefficient for calculating the lr. See https://arxiv.org/abs/1708.03888 | ||
clip: Decides between clipping or scaling mode of LARC. If `clip=True` the learning rate is set to `min(optimizer_lr, local_lr)` for each parameter. If `clip=False` the learning rate is set to `local_lr*optimizer_lr`. | ||
eps: epsilon kludge to help with numerical stability while calculating adaotive_lr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor sp: adaotive_lr --> adaptive_lr
@@ -4,11 +4,45 @@ | |||
from torch.nn.parameter import Parameter | |||
|
|||
class LARC(object): | |||
def __init__(self, optimizer, trust_coefficient=0.02, epsilon=1e-8): | |||
""" | |||
:class:`LARC` is a pytorch implementation of both the scaling and clipping varients of LARC, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor sp: varients --> variants
My rationale for creating a new instance of the passed optimizer's class within FP16_Optimizer's constructor was that if the passed optimizer had been used earlier, it might have created some momentum or other ancillary buffers in FP16. I would have to trace through the optimizer's param_groups and cast all these ancillary buffers to FP32 as well. This is doable (it's similar to what torch.Optimizer.load_state_dict does) but seemed more brittle than FP32-ifying the param_groups then using them to create a fresh optimizer instance. Your proposed change (using the passed optimizer directly) does impose the additional restriction that the passed optimizer has not been used beforehand/does not contain any ancillary buffers (aside from its owned parameters) that might be FP16. All my documentation and examples work this way, although the requirement is not yet stated explicitly, so I suppose it's fine to accept. |