-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize check_finite_and_unscale_op #31954
Optimize check_finite_and_unscale_op #31954
Conversation
update Paddle to newest version
Merge newest Paddle code
merge newest Paddle code
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
for (int64_t idx = tid; idx < num; idx += gridDim.x * blockDim.x) { | ||
// get the xs's index of thread | ||
int xs_index = pre_xs_index; | ||
while (idx < s_starts[xs_index]) xs_index++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in line 48 may not be triggered forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已在PR32554中删除该行
PR types
Performance optimization
PR changes
OPs
Describe
起因:
CheckFiniteAndUnscale
在新ernie doc模型timeline中占比高达5.7%,timeline中显示check_finite_and_unscale_op
在一次运行中多次调用了CheckFiniteAndUnscale
,最多调用了300次,且其中包含多个小kernel,存在优化点。代码分析:
原有代码中存在一个
for
循环:xs
和outs
均为一个vector<Tensor*>
,不论tensor大小,for
循环对于其中的每个tensor都要调用一次CheckFiniteAndUnscale
。优化
优化方法1:
commit id:b2eba11
显然,融合(fused)kernel,将外部
for
循环去掉,改为无论xs.size()
大小均只用调用一次kernel效果应该最为明显。难点:
xs
和outs
均为host端的vector<Tensor*>
变量,需拷贝到device端。xs
和outs
中的Tensor数据位置不是连续的,如何判断当前线程处理的是哪个Tensor中的哪个数据?优化点:
memory::Alloc
分配两个大小为xs.size()
的将指针数组,用于分别存储xs
和outs
中每个Tensor的数据的起始地址,并最终通过memory::Copy
拷贝到device端。memory::Alloc
分配一个大小为xs.size()
的int64_t
数组starts
,其中每个元素记录的是Tensor的起始索引值。通过memory::Copy
拷贝到device端后,由于该数组会经常用到,为避免多次访问global memory
带来的访存开销,在kernel中将该数组存储到shared memory
中,以降低访存延迟。global memory
带来的访存开销,将found_inf
和scale
放在寄存器上计算。grid
值特别大,大量时间花在了block切换上。因此改为每个线程处理20个数据。优化效果:
ResNet50收敛性验证
模型地址:ResNet50_fp16.sh