【Hackathon No.34】优化 poisson op #45160

Rayman96 · 2022-08-15T12:55:07Z

PR types

Performance optimization

PR changes

OPs

Describe

实验开发环境：
硬件：Tesla-P4
软件环境：CUDA 11.2，CuDNN 8

对于Poisson算子优化方案有以下两种
方案一：通过paddle已实现的gpu_launch_config.h中GetGpuLaunchConfig1D方法获得较优的参数配置。
性能效果：
该方案经过测试在float32数据上有5%左右的性能提升，float64数据上有10%左右的性能下降。故不作为首选方案。
Tesla-P4:

Case No.	input_shape	data_type	Paddle_modify Perf(s)	Perf_over_paddle_origin(%)	Perf_over_pytorch(%)
1	[16, 16, 16, 16]	float32	0.3190	+8.00	-2.00
2	[16, 35, 1500]	float32	2.7361	+8.17	-2.34
3	[16, 16, 16, 16]	float64	0.3261	-13.78	-1.86

Tesla-P40:

Case No.	input_shape	data_type	Paddle_modify Perf(s)	Perf_over_paddle_origin(%)	Perf_over_pytorch(%)
1	[16, 16, 16, 16]	float32	0.1684	-5.00	+1.00
2	[16, 35, 1500]	float32	1.3766	+8.74	-1.28
3	[16, 16, 16, 16]	float64	0.1743	+5.27	-19.1

方案二：通过手动测试在该场景下更优的配置参数，BlockSize性能较优的取值通常为[128, 256,512]。对这三者进行实验并测试性能，结果显示是用一维Grid，且BlockSize=256时，均有大幅性能提升。
性能效果：

Tesla-P4:

Case No.	input_shape	data_type	Paddle_modify Perf(s)	Perf_over_paddle_origin(%)	Perf_over_pytorch(%)
1	[16, 16, 16, 16]	float32	0.2205	+36.62	+29.27
2	[16, 35, 1500]	float32	2.044	+31.40	+23.54
3	[16, 16, 16, 16]	float64	0.2159	+24.68	+32.57

Tesla-P40:

Case No.	input_shape	data_type	Paddle_modify Perf(s)	Perf_over_paddle_origin(%)	Perf_over_pytorch(%)
1	[16, 16, 16, 16]	float32	0.1323	+17.29	+20.94
2	[16, 35, 1500]	float32	1.0011	+33.63	+26.34
3	[16, 16, 16, 16]	float64	0.1324	+28.00	+9.49

根据比较，最终选择方案二。

CLAassistant · 2022-08-15T12:55:12Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

CLAassistant · 2022-08-15T12:55:12Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Rayman96 · 2022-08-17T03:36:22Z

CI已全部通过，辛苦审核🙏

paddle/phi/kernels/gpu/poisson_kernel.cu

ZzSean · 2022-08-17T03:59:03Z

建议可以试试ElementwiseKernel那套模版，如果性能差别不大的话，建议改成那种形式，可以让代码风格更统一一些

Rayman96 · 2022-08-17T04:57:22Z

ElementwiseKernel的方法我尝试后进行回复

Rayman96 · 2022-08-19T03:05:45Z

建议可以试试ElementwiseKernel那套模版，如果性能差别不大的话，建议改成那种形式，可以让代码风格更统一一些

ElementwiseKernel测试结果无法达到性能。观测发现ElementwiseKernel在重复测试过程中速度不稳定，且最快和最慢的差距较大。平均下来Float32的测试用例能达到预期但不及目前实现，但Float64相较原始情况提升达不到预期。
实验环境1：Tesla P4
较好情况

Case No.	input_shape	data_type	Paddle_ElementWise Perf(s)	Perf_over_paddle_origin(%)
1	[16, 16, 16, 16]	float32	0.25622	+26（较目前实现差10%）
2	[16, 35, 1500]	float32	2.24506	+25 （较目前实现差6%）
3	[16, 16, 16, 16]	float64	0.27475	+4 （较目前实现差20%）

较差情况

Case No.	input_shape	data_type	Paddle_ElementWise Perf(s)	Perf_over_paddle_origin(%)
1	[16, 16, 16, 16]	float32	0.26173	+25（和目前实现差11%）
2	[16, 35, 1500]	float32	2.2529	+24 （和目前实现差7%）
3	[16, 16, 16, 16]	float64	0.28583	0 （和目前实现差24%）

故还是需要保持现用实现方式。

ZzSean · 2022-08-19T08:41:53Z

paddle/phi/kernels/gpu/poisson_kernel.cu

+__global__ void GetPoisson(
+    const T* in, T* out, const int N, unsigned int seed, unsigned int offset) {
+  int idx = threadIdx.x + blockIdx.x * blockDim.x;
+  if (idx < N) {


这里需要写成循环，防止numel过大时，无法完成所有的计算，用CUDA_KERNEL_LOOP_TYPE包一下就可以

好的我修改下

ZzSean · 2022-08-19T08:46:36Z

paddle/phi/kernels/gpu/poisson_kernel.cu

+
+  int block_size = std::min(kMaxBlockDim, ctx.GetMaxThreadsPerBlock());
+  dim3 dim_block(block_size);
+  dim3 dim_grid((size + block_size - 1) / block_size);


grid的设置也需要进行最大值的限制，可以参考

Paddle/paddle/phi/kernels/gpu/index_select_kernel.cu

Line 82 in 4528ed2

paddle::platform::LimitGridDim(ctx, &grid_dim);

grid的设置也需要进行最大值的限制，可以参考

Paddle/paddle/phi/kernels/gpu/index_select_kernel.cu

Line 82 in 4528ed2

paddle::platform::LimitGridDim(ctx, &grid_dim);

好的我修改下

Rayman96 · 2022-08-22T03:50:43Z

CI已通过劳烦review

chenwhql · 2022-08-23T09:04:01Z

paddle/phi/kernels/gpu/poisson_kernel.cu

@@ -19,6 +19,7 @@ limitations under the License. */
 #include <hiprand_kernel.h>
 #endif

+#include "paddle/fluid/platform/device/gpu/gpu_launch_config.h"


In principle, phi does not allow include fluid header files, try to replace by #include "paddle/phi/backends/gpu/gpu_launch_config.h"?

paddle::platform::LimitGridDim 这个函数还在fluid下的头文件中，需要把这个函数先放过来一份，并且把namespace更新一下

paddle::platform::LimitGridDim 这个函数还在fluid下的头文件中，需要把这个函数先放过来一份，并且把namespace更新一下

已修改

Rayman96 · 2022-08-24T01:13:26Z

@chenwhql 修改后CI已全部通过

ZzSean

LGTM

【Hackathon No.34】优化 poisson op

bfd49c5

Rayman96 marked this pull request as ready for review August 16, 2022 01:32

This was referenced Aug 16, 2022

【PaddlePaddle Hackathon 第三期】任务总览 #43938

Closed

【Hackathon No.34】为 Paddle 优化 poisson op 在 GPU 上的计算性能 PaddlePaddle/community#204

Merged

luotao1 added contributor External developers PaddlePaddle Hackathon labels Aug 16, 2022

luotao1 assigned luotao1, Ligoml and JamesLim-sy Aug 16, 2022

[poisson] code style fix

ca0dae6

luotao1 assigned ZzSean Aug 16, 2022

ZzSean reviewed Aug 17, 2022

View reviewed changes

modify code style

a66d09b

ZzSean reviewed Aug 19, 2022

View reviewed changes

Rayman96 added 3 commits August 19, 2022 17:42

prevent from big number

82e17ea

modify code style

b746607

modify code style

d06f712

chenwhql reviewed Aug 23, 2022

View reviewed changes

Rayman96 added 3 commits August 23, 2022 17:11

modify import

3944c3f

modify import

e522211

modify code style

15c6cc2

chenwhql approved these changes Aug 24, 2022

View reviewed changes

ZzSean approved these changes Aug 24, 2022

View reviewed changes

ZzSean merged commit 3c14b09 into PaddlePaddle:develop Aug 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【Hackathon No.34】优化 poisson op #45160

【Hackathon No.34】优化 poisson op #45160

Rayman96 commented Aug 15, 2022 •

edited

Loading

CLAassistant commented Aug 15, 2022 •

edited

Loading

CLAassistant commented Aug 15, 2022

Rayman96 commented Aug 17, 2022

ZzSean commented Aug 17, 2022

Rayman96 commented Aug 17, 2022

Rayman96 commented Aug 19, 2022 •

edited

Loading

ZzSean Aug 19, 2022

Rayman96 Aug 19, 2022

ZzSean Aug 19, 2022

Rayman96 Aug 19, 2022

Rayman96 Aug 21, 2022

Rayman96 commented Aug 22, 2022

chenwhql Aug 23, 2022

chenwhql Aug 23, 2022

Rayman96 Aug 23, 2022

Rayman96 Aug 23, 2022

Rayman96 commented Aug 24, 2022

ZzSean left a comment

【Hackathon No.34】优化 poisson op #45160

【Hackathon No.34】优化 poisson op #45160

Conversation

Rayman96 commented Aug 15, 2022 • edited Loading

PR types

PR changes

Describe

CLAassistant commented Aug 15, 2022 • edited Loading

CLAassistant commented Aug 15, 2022

Rayman96 commented Aug 17, 2022

ZzSean commented Aug 17, 2022

Rayman96 commented Aug 17, 2022

Rayman96 commented Aug 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rayman96 commented Aug 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rayman96 commented Aug 24, 2022

ZzSean left a comment

Choose a reason for hiding this comment

Rayman96 commented Aug 15, 2022 •

edited

Loading

CLAassistant commented Aug 15, 2022 •

edited

Loading

Rayman96 commented Aug 19, 2022 •

edited

Loading