-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the forward of log_softmax for the case when axis is not the last dimention. #32396
Conversation
Update forked PaddlePaddle
Update my fork
update from PaddlePaddle
Update forked paddle repo
Update USERNAME/paddle
update Paddle USERNAME repo
update username repo
update local paddlepaddle
update paddlepaddle
… log-sftmx-case3
Thanks for your contribution! |
Review 意见:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
请在PR描述中说清楚这个PR的工作、优化方法和效果,参考#30601
} | ||
|
||
template <typename T> | ||
__forceinline__ __device__ T BlockReduceAdd(T *shared, T val) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
设计Max、Add两个functor的目的,应该是为了统一BlockReduceMax
和BlockReduceAdd
的实现。
// 3. input-max-log_sum and store | ||
for (uint32_t d = threadIdx.x; d < dim_size; d += blockDim.x) { | ||
output[data_offset + d * dim_stride] = static_cast<T>( | ||
static_cast<AccT>(input[data_offset + d * dim_stride]) - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L229、L238、L245读了input 3次,感觉效率上不是最优。
Sorry to inform you that 91dfddf's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
Sorry to inform you that 42fd6f9's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. PR先合入了,一些review建议可以下个PR中一起修复。
PR types
Performance optimization
PR changes
OPs
Describe
PR功能
当axis为-1且对应元素个数不超过1024,log_softmax的前向计算执行paddle#31630,后向计算执行paddle#32180。
此PR功能是当axis不为-1、或axis为-1但是axis对应的元素个数大于1024时,log_softmax执行的前向逻辑。
PR性能表现
以下两个配置来自于op benchmark,测试时repeat为1000, 数值表示执行op时间。
PR的方法和逻辑
方法分为两步,第一步计算核函数的执行配置,即得到block尺寸、shared memory大小和grid尺寸。第二步将网格映射到数据上执行核函数逻辑。
1. 如何得到grid和block
计算执行配置是启动核函数的前提。一般上讲,先考虑block的尺寸,grid在受限于GPU总的active block数时尽可能大。这里的计算配置不仅根据输入shape动态变化,并且可以最大化硬件资源利用率。实现在
ComputeLaunchConfigure()
中,分为以下四个步骤:步骤一:函数
GetBlockSize()
计算blockblock的得到仅与dim_size和inner_size有关。解释下,若shape=【2,3,4,1】axis=1,那么outer纬度对应第0个纬度,即outer_size=2;axis对应第一个纬度,即dim_size=3;inner纬度对应剩下的第二、三纬度,即inner_size=4*1。
步骤二:计算shared mem大小
block线程总数*数据类型大小
。步骤三:计算
max_active_blocks
调用APIcudaOccupancyMaxActiveBlocksPerMultiprocessor()
计算得到每个SM的最大block数。该API计算资源占用率,根据这篇blog描述:「该API根据block大小(步骤一得到)和shared mem大小(步骤二得到),预测一个kernel的occupancy」。根据API文档,此API返回每个SM最大block数:blocks_per_SM
。硬件SM数num_sm
易获得,blocks_per_SM*num_sm
这样就有了GPU总的active block数
。设使用该函数为对照组,那么实验组中active block的个数设置为SM的1倍、2倍、4倍、8倍。由于函数得到active block数,而active block数只影响grid配置,所以统计grid配置和gpu时间如下:
GetGridSize()
计算gridGPU总active block数
,作为约束,就可以计算使得GPU占用率最高的grid配置。grid在x轴上的block数量=(问题在x轴上的尺寸 + 每个block在x轴上的尺寸-1)/每个block在x轴上的尺寸
。同时保证总的block数不超过GPU总的active block数
。示例如下,shape=[2,3,4] axis=1,得到grid(2, 1)和block(1, 4)。
所以,当blockDim.x=1时,不需要block reduce,
此情况单独写出来已合并两种情况。2. 映射及执行逻辑
处理单元如何映射到数据及block如何执行
见"如何配置grid"处描述。
核函数数学逻辑
数学逻辑与这种case的前向计算相同paddle#31630。沿着axis方向,分三个步骤计算,不再赘述。
其他
在functors.h中添加MaxFunctor。