-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Hackathon 7th PPSCI No.12】Adam、AdamW 优化器支持 amsgrad #68079
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
添加的ams_grad是否会影响原有的代码执行逻辑和存储空间占用情况?PR的代码起来无论是否开启ams_grad,都会比原先没有amsgrad的代码多申请一段mom2_max的空间,以及有一些多余的变量产生。
|
||
inline HOSTDEVICE void operator()(size_t i) const { | ||
// Merge all memory access together. | ||
T g = grad_[i]; | ||
T mom1 = moment1_[i]; | ||
T mom2 = moment2_[i]; | ||
T mom2_max = moment2_max_[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个是必须要记录的吗?
T mom2_max_; | ||
if (amsgrad_) { | ||
mom2_max_ = std::max(mom2, mom2_max); | ||
p -= lr * (mom1 / (sqrt(mom2_max_) + epsilon_ * sqrt(1 - beta2_pow))); | ||
} else { | ||
mom2_max_ = mom2_max; | ||
p -= lr * (mom1 / (sqrt(mom2) + epsilon_ * sqrt(1 - beta2_pow))); | ||
} | ||
|
||
// Write back to global memory | ||
moment1_out_[i] = mom1; | ||
moment2_out_[i] = mom2; | ||
moment2_max_out_[i] = mom2_max_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同理,如果amsgrad没有开启,建议不要添加任何多余的变量和相关计算逻辑,保持原样即可
Eigen::Map<Eigen::Array<T, 1, Eigen::Dynamic>> moment2_max_out{ | ||
moment2_max_out_, static_cast<Eigen::Index>(numel)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上,如果没有开启amsgrad,是否会有mom2_max相关的冗余运算和存储占用?
|
||
inline HOSTDEVICE void adam_update(size_t i, T g) const { | ||
// The following code is the same as dense | ||
T mom1 = moment1_[i]; | ||
T mom2 = moment2_[i]; | ||
T mom2_max = moment2_max_[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
@@ -14,6 +14,7 @@ | |||
|
|||
#pragma once | |||
|
|||
#include <stdio.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个头文件是什么有代码依赖吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
调试之后忘记删掉了,抱歉 ~
python/paddle/optimizer/adam.py
Outdated
@@ -117,6 +117,7 @@ class Adam(Optimizer): | |||
The default value is False. | |||
multi_precision (bool, optional): Whether to use multi-precision during weight updating. Default is false. | |||
use_multi_tensor (bool, optional): Whether to use multi-tensor strategy to update all parameters at once . Default is false. | |||
amsgrad (bool, optional): Whether to use the AMSGrad of this algorithm. Default is false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/paddle/optimizer/adamw.py
Outdated
@@ -104,6 +104,7 @@ class AdamW(Optimizer): | |||
different semantics with the original Adam algorithm and may lead to different result. | |||
The default value is False. | |||
multi_precision (bool, optional): Whether to use multi-precision during weight updating. Default is false. | |||
amsgrad (bool, optional): Whether to use the AMSGrad of this algorithm. Default is false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
这个之前考虑过,主要是因为,目前涉及到 amsgrad 的地方太多了,所以优化相关的事情想先往后放一下 ~ 那我现在改一下试试吧 ~ |
|
另外可以在修改完成后,用ResNet50或者其他模型,以fake data为输入做一个对比,确认下amsgrad关闭时,显存无变化,开启时显存增加量与参数量基本相同。 |
af27337
@megemini hello大佬,我们内部测试了最新的这个PR,应该是没问题了,还麻烦解决一下冲突 |
非常感谢!!!听说过程非常曲折 😂😂😂 感谢 ~~~ 冲突已经解决 ~ PR-CI-NPU-910B-Paddle 这个 CI 的错误,看上去是 npu 那边没有正确的处理 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@megemini 能麻烦再合一下develop分支吗,windows和Hygon-DCU-Test这两个挂了,应该不是PR的原因 |
@HydrogenSulfate DCU已豁免,是单侧随机挂。windows我重跑了,根据其他开发者反馈,最近windows流水线网络不好。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR Category
User Experience
PR Types
New features
Description
【Hackathon 7th No.12】Adam、AdamW 优化器支持 amsgrad
关联:
本地对比 pytorch 的结果,两者一致:
比对代码
输出结果:
Update 20240908
已在本地完成:
相关测试。
需要在 CI 环境中验证分布式的测试项目
需要在 CI 环境中验证其他测试项目
另外,xpu 的 amsgrad 变体,由于 xpu 底层接口暂不支持,因此,此处只修改了相关的输入输出参数列表。