Optimize check_finite_and_unscale_op #31954

thisjiang · 2021-03-30T08:05:28Z

PR types

Performance optimization

PR changes

OPs

Describe

起因：

CheckFiniteAndUnscale 在新ernie doc模型timeline中占比高达5.7%，timeline中显示check_finite_and_unscale_op在一次运行中多次调用了CheckFiniteAndUnscale，最多调用了300次，且其中包含多个小kernel，存在优化点。

代码分析：

原有代码中存在一个for循环：

for (size_t i = 0; i < xs.size(); ++i) {
  ...
  CheckFiniteAndUnscale<<<...>>>(xs[i]->data<T>(),...,outs[i]->data<T>())
  ...
}

xs和outs均为一个vector<Tensor*>，不论tensor大小，for循环对于其中的每个tensor都要调用一次CheckFiniteAndUnscale。

优化

优化方法1：

commit id:b2eba11
显然，融合（fused）kernel，将外部for循环去掉，改为无论xs.size()大小均只用调用一次kernel效果应该最为明显。

难点：

xs和outs均为host端的vector<Tensor*>变量，需拷贝到device端。
xs和outs中的Tensor数据位置不是连续的，如何判断当前线程处理的是哪个Tensor中的哪个数据？

优化点：

通过memory::Alloc分配两个大小为xs.size()的将指针数组，用于分别存储xs和outs中每个Tensor的数据的起始地址，并最终通过memory::Copy拷贝到device端。
假设所有Tensor中的数据都是一维连续展开的，排列方式与kernel的线程排列方式一致，存储每个Tensor展开后的起始索引值 ---即该Tensor之前所有Tensor大小之和。这样我们就可以将线程id与数据位置相对应：若当前线程索引大于某个Tensor的起始索引值，且小于下一个Tensor的起始索引值，则说明当前线程处理的是这个Tensor，Tensor内部索引值即为线程索引值减去该Tensor起始索引值。
实现方案为通过memory::Alloc分配一个大小为xs.size()的int64_t数组starts，其中每个元素记录的是Tensor的起始索引值。通过memory::Copy拷贝到device端后，由于该数组会经常用到，为避免多次访问global memory带来的访存开销，在kernel中将该数组存储到shared memory中，以降低访存延迟。
同样的，为避免多次访问global memory带来的访存开销，将found_inf和scale放在寄存器上计算。
由于所有Tensor的总大小非常大，因此若kernel中每个线程只处理一个数据，常常会导致launch kernel时的grid值特别大，大量时间花在了block切换上。因此改为每个线程处理20个数据。

优化效果：

ernie_doc 模型速度(V100-SXM2-16GB机器）	FP32	AMP	加速比
优化前	4.34 steps/s	8.49 steps/s	1.95
优化1	4.34 steps/s	8.91 steps/s	2.05

ResNet50 AMP 模型速度(V100-SXM2-32GB机器）	优化前	优化1
ips	1339 images/sec	1347 images/sec

timeline占比	优化前	优化1
ernie_doc AMP(BS=2048)	5.7%	1.5%
ResNet50 AMP	0.3 %	0.2 %

ResNet50收敛性验证

模型地址：ResNet50_fp16.sh

update Paddle to newest version

… develop

Merge newest Paddle code

merge newest Paddle code

… develop

paddle-bot-old · 2021-03-30T08:05:38Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

wzzju

LGTM.

…addlePaddle#31954)

wzzju · 2021-04-26T08:08:14Z

paddle/fluid/operators/amp/check_finite_and_unscale_op.cu

+  for (int64_t idx = tid; idx < num; idx += gridDim.x * blockDim.x) {
+    // get the xs's index of thread
+    int xs_index = pre_xs_index;
+    while (idx < s_starts[xs_index]) xs_index++;


The code in line 48 may not be triggered forever.

已在PR32554中删除该行

thisjiang added 11 commits December 18, 2020 11:39

Merge pull request #1 from PaddlePaddle/develop

ff9053a

update Paddle to newest version

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

d78104e

… develop

Merge pull request #2 from PaddlePaddle/develop

a4c55cd

Merge newest Paddle code

Merge pull request #3 from PaddlePaddle/develop

84eb899

merge newest Paddle code

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

d4011eb

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

8bd3ce3

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

6f88225

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

fbb7891

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

5e8fb13

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

d709fcd

… develop

optimize check_finite_and_unscale_op by fused kernel, test=develop

b2eba11

wzzju approved these changes Apr 13, 2021

View reviewed changes

wzzju merged commit fdf63b4 into PaddlePaddle:develop Apr 13, 2021

thisjiang deleted the optimize-check_finite_and_unscale branch April 13, 2021 03:25

thisjiang added a commit to thisjiang/Paddle that referenced this pull request Apr 13, 2021

optimize check_finite_and_unscale_op by fused kernel, test=develop (P…

cbb58bb

…addlePaddle#31954)

thisjiang mentioned this pull request Apr 26, 2021

Optimize update_loss_scaling_op #32554

Merged

wzzju reviewed Apr 26, 2021

View reviewed changes

thisjiang mentioned this pull request Apr 27, 2021

[Cherry-pick] Optimize update_loss_scaling_op(#32554) #32606

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize check_finite_and_unscale_op #31954

Optimize check_finite_and_unscale_op #31954

thisjiang commented Mar 30, 2021 •

edited

Loading

paddle-bot-old bot commented Mar 30, 2021

wzzju left a comment

wzzju Apr 26, 2021

thisjiang Apr 26, 2021

Optimize check_finite_and_unscale_op #31954

Optimize check_finite_and_unscale_op #31954

Conversation

thisjiang commented Mar 30, 2021 • edited Loading

PR types

PR changes

Describe

起因：

代码分析：

优化

优化方法1：

难点：

优化点：

优化效果：

ResNet50收敛性验证

paddle-bot-old bot commented Mar 30, 2021

wzzju left a comment

Choose a reason for hiding this comment

wzzju Apr 26, 2021

Choose a reason for hiding this comment

thisjiang Apr 26, 2021

Choose a reason for hiding this comment

thisjiang commented Mar 30, 2021 •

edited

Loading