-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Hackathon No.34】优化 poisson op #45160
Conversation
|
1 similar comment
|
CI已全部通过,辛苦审核🙏 |
建议可以试试ElementwiseKernel那套模版,如果性能差别不大的话,建议改成那种形式,可以让代码风格更统一一些 |
ElementwiseKernel的方法我尝试后进行回复 |
ElementwiseKernel测试结果无法达到性能。观测发现ElementwiseKernel在重复测试过程中速度不稳定,且最快和最慢的差距较大。平均下来Float32的测试用例能达到预期但不及目前实现,但Float64相较原始情况提升达不到预期。
较差情况
故还是需要保持现用实现方式。 |
__global__ void GetPoisson( | ||
const T* in, T* out, const int N, unsigned int seed, unsigned int offset) { | ||
int idx = threadIdx.x + blockIdx.x * blockDim.x; | ||
if (idx < N) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里需要写成循环,防止numel过大时,无法完成所有的计算,用CUDA_KERNEL_LOOP_TYPE包一下就可以
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的 我修改下
|
||
int block_size = std::min(kMaxBlockDim, ctx.GetMaxThreadsPerBlock()); | ||
dim3 dim_block(block_size); | ||
dim3 dim_grid((size + block_size - 1) / block_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grid的设置也需要进行最大值的限制,可以参考
paddle::platform::LimitGridDim(ctx, &grid_dim); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grid的设置也需要进行最大值的限制,可以参考
paddle::platform::LimitGridDim(ctx, &grid_dim);
好的 我修改下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
CI已通过劳烦review |
@@ -19,6 +19,7 @@ limitations under the License. */ | |||
#include <hiprand_kernel.h> | |||
#endif | |||
|
|||
#include "paddle/fluid/platform/device/gpu/gpu_launch_config.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, phi does not allow include fluid header files, try to replace by #include "paddle/phi/backends/gpu/gpu_launch_config.h"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddle::platform::LimitGridDim
这个函数还在fluid下的头文件中,需要把这个函数先放过来一份,并且把namespace更新一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddle::platform::LimitGridDim
这个函数还在fluid下的头文件中,需要把这个函数先放过来一份,并且把namespace更新一下
已修改
@chenwhql 修改后CI已全部通过 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Performance optimization
PR changes
OPs
Describe
实验开发环境:
硬件:Tesla-P4
软件环境:CUDA 11.2,CuDNN 8
对于Poisson算子优化方案有以下两种
方案一:通过paddle已实现的gpu_launch_config.h中GetGpuLaunchConfig1D方法获得较优的参数配置。
性能效果:
该方案经过测试在float32数据上有5%左右的性能提升,float64数据上有10%左右的性能下降。故不作为首选方案。
Tesla-P4:
Tesla-P40:
方案二:通过手动测试在该场景下更优的配置参数,BlockSize性能较优的取值通常为[128, 256,512]。对这三者进行实验并测试性能,结果显示是用一维Grid,且BlockSize=256时,均有大幅性能提升。
性能效果:
Tesla-P4:
Tesla-P40:
根据比较,最终选择方案二。