From bf480f7fa6b7db757f5782f869810672124b5730 Mon Sep 17 00:00:00 2001
From: Rayman96 <lqr5259@outlook.com>
Date: Wed, 10 Aug 2022 20:47:56 +0800
Subject: [PATCH 1/4] =?UTF-8?q?=E3=80=90Hackathon=20No.34=E3=80=91?=
 =?UTF-8?q?=E4=B8=BA=20Paddle=20=E4=BC=98=E5=8C=96=20poisson=20op=20?=
 =?UTF-8?q?=E5=9C=A8=20GPU=20=E4=B8=8A=E7=9A=84=E8=AE=A1=E7=AE=97=E6=80=A7?=
 =?UTF-8?q?=E8=83=BD?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .../OPs-Perf/20220810_poisson_optimization.md | 74 +++++++++++++++++++
 1 file changed, 74 insertions(+)
 create mode 100644 rfcs/OPs-Perf/20220810_poisson_optimization.md
diff --git a/rfcs/OPs-Perf/20220810_poisson_optimization.md b/rfcs/OPs-Perf/20220810_poisson_optimization.md
new file mode 100644
index 000000000..f4352df2f
--- /dev/null
+++ b/rfcs/OPs-Perf/20220810_poisson_optimization.md
@@ -0,0 +1,74 @@
+Poisson OP性能优化设计文档
+
+| 基本信息                                                     | 内容                                                         |
+| ------------------------------------------------------------ | ------------------------------------------------------------- |
+| 提交作者<input type="checkbox" class="rowselector hidden">   |  Rayman的团队  |                                         
+| 提交时间<input type="checkbox" class="rowselector hidden">   | 2022-08-10 |                                                
+| 版本号                                                 | V1.0  |                       
+| 依赖飞桨版本<input type="checkbox" class="rowselector hidden">| PaddleDevelop|                      
+| 文件名                    | 20220810_poisson_op_optimization.md<br> |
+
+ # 1 背景与意义
+
+目前Paddle中的PoissonKernel是通过cuRAND库+for_range + Functor组合实现。在device端调用cuRAND库是目前已知效率最高的随机数产生方案，性能可以进一步提升的空间在于优化线程的构建分布过程，以达到更优的效果。
+
+##  1.1 飞桨现状
+
+对于此OP在目前飞桨框架（Develop分支）中的性能现状调研，表格形式列出[OP Benchmark](https://github.com/PaddlePaddle/benchmark/tree/master/api/tests_v2)中各种case场景下的OP性能数据（Tesla P4）。
+| Case No. | input_shape | data_type | Paddle Perf(s) |
+|---|---|---|---|
+| 1 | [16, 16, 16, 16] |float32|0.3478 | 
+| 2 | [16, 35, 1500] |float32| 2.9794|
+| 3 | [16, 16, 16, 16] |float64 |0.2866|
+
+ ## 1.2 业内方案调研
+
+调研测试pytorch目前产生poisson分布的性能。
+| Case No. | input_shape | data_type | Paddle Perf(s) |Perf_over_percent(%)
+|---|---|---|---|---|
+| 1 | [16, 16, 16, 16] |float32|0.3117 | +10.38
+| 2 | [16, 35, 1500] |float32| 2.6733|+10.27
+| 3 | [16, 16, 16, 16] |float64 |0.3202|-11.72
+
+ ## 1.3 对比分析
+对比表格1和表格2中的数据，case1和case2情况下是处理float32数据，pytorch相比较paddle有10%左右的性能提升。而case3是处理float64的数据，pytorch相对于paddle有11.7%的性能下降。
+
+ # 2 设计方案与性能预期
+
+ ## 2.1 关键模块与性能提升点
+ + paddle于pytorch的性能差距在10%左右，差距并不太大，通过源码分析同样都是用了cuRAND函数库，因此试图改进poisson分布数值产生的方式以达到质的飞跃并不可取。
+ + 性能提升关键在点在于优化Poisson_kernel.cu的host端代码，通过优化GPU上grid， block数量以寻找到更优参数以获得超过7%的提升。
+
+##  2.2 Host端计算流程
+在方案设计阶段有两种实验思路；
+1. 方案一：通过paddle已实现的gpu_launch_config.h中GetGpuLaunchConfig1D方法获得较优的参数配置。该方案经过测试在float32数据上有5%左右的性能提升，float64数据上有10%左右的性能下降。故不作为首选方案。
+2. 方案二：通过手动测试在该场景下更优的配置参数，BlockSize性能较优的取值通常为[128, 256,512]。对这三者进行实验并测试性能，结果显示是用一维Grid，且BlockSize=256时，在不同测试用例，不同测试环境中均有大幅性能提升。
+
+ ## 2.3 Device端计算流程
+
+保持原有逻辑，使用cuRAND中curand_poisson方法实现。
+
+ ## 3 测试和验收的考量
+
+实验环境1：Tesla P4
+| Case No. | input_shape | data_type | Paddle_modify Perf(s) |Perf_over_paddle_origin(%)|Perf_over_pytorch(%)
+|---|---|---|---|---|---|
+| 1 | [16, 16, 16, 16] |float32|0.2205 | +36.62|+29.27
+| 2 | [16, 35, 1500] |float32| 2.044|+31.40|+23.54
+| 3 | [16, 16, 16, 16] |float64 |0.2159|+24.68|+32.57
+
+ # 4 可行性分析和排期规划
+
+已完成开发和测试，待优化代码规范后提交PR
+
+预计提交时间：8.15
+
+
+#  5 影响面
+
+对其他模块没有影响。
+
+
+ 附件及参考资料
+
+[1]. [OP Benchmark使用指南](https://github.com/PaddlePaddle/benchmark/blob/master/api/README.md)

From 4f657ad19162d2e38dab89096edf89ce519b4432 Mon Sep 17 00:00:00 2001
From: Rayman <lqr5259@outlook.com>
Date: Sat, 13 Aug 2022 23:07:08 +0800
Subject: [PATCH 2/4] =?UTF-8?q?=E3=80=90Hackathon=20No.13=E3=80=91?=
 =?UTF-8?q?=E4=B8=BA=20Paddle=20=E6=96=B0=E5=A2=9E=20triu=5Findices=20API?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .../20220813_api_design_for_triu_indices.md   | 372 ++++++++++++++++++
 1 file changed, 372 insertions(+)
 create mode 100644 rfcs/APIs/20220813_api_design_for_triu_indices.md

diff --git a/rfcs/APIs/20220813_api_design_for_triu_indices.md b/rfcs/APIs/20220813_api_design_for_triu_indices.md
new file mode 100644
index 000000000..bc9ab45c7
--- /dev/null
+++ b/rfcs/APIs/20220813_api_design_for_triu_indices.md
@@ -0,0 +1,372 @@
+# paddle.tril_indices设计文档
+
+
+|API名称 | paddle.triu_indices |
+|---|---|
+|提交作者 | Rayman的团队 |
+|提交时间 | 2022-08-13 |
+|版本号 | V1.0 |
+|依赖飞桨版本 | develop |
+|文件名 | 20220813_api_design_for_triu_indices.md |
+
+
+# 一、概述
+## 1、相关背景
+`triu_indices` 能获取一个2维矩阵的上三角元素的索引，其输出 Tensor 的 shape 为$[2, N]$，相当于有两行，第一行为 上三角元素的行索引，第二行为下三角元素的列索引。调用方式与`tril_indices(rows, cols, offset)`对应。offset的范围为$[-rows+1,cols-1]$。
+
+## 2、功能目标
+
+在Paddle框架中增加`paddle.triu_indices`这个API。
+
+## 3、意义
+
+Paddle将提供高效的`triu_indices`API供用户直接调用。
+
+# 二、飞桨现状
+飞桨目前没有提供`triu_indices`这个API，且无法通过API组合的方式间接实现其功能。
+
+相关接口：
+1. 飞桨目前提供了triu函数，输入矩阵和对角线的参数，返回矩阵上三角部分，其余部分元素为零。调用接口为`paddle.triu(input, diagonal=0, name=None)`
+[源码](https://github.com/PaddlePaddle/Paddle/blob/release/2.3/python/paddle/tensor/creation.py#L674)
+[文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/triu_cn.html#triu)
+
+2. 飞桨提供了tril_indices函数，与期望实现的triu_indices类似，但其返回的是下三角元素索引[源码](https://github.com/PaddlePaddle/Paddle/blob/6d31dc937704380efe2dee97716c3da47b7060f1/python/paddle/tensor/creation.py#L1721)。
+
+```python
+import paddle
+            
+# example 1, default offset value
+data1 = paddle.tril_indices(4,4,0)
+print(data1)
+# [[0, 1, 1, 2, 2, 2, 3, 3, 3, 3], 
+#  [0, 0, 1, 0, 1, 2, 0, 1, 2, 3]]
+# example 2, positive offset value
+data2 = paddle.tril_indices(4,4,2)
+print(data2)
+# [[0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], 
+#  [0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]]
+# example 3, negative offset value
+data3 = paddle.tril_indices(4,4,-1)
+print(data3)
+# [[ 1, 2, 2, 3, 3, 3],
+#  [ 0, 0, 1, 0, 1, 2]]
+```
+
+# 三、业内方案调研
+PyTorch和Numpy中都有triu_indices这个API
+## PyTorch
+
+### 实现解读
+pytorch 中接口配置为：  
+
+```python
+func: triu_indices(int row, int col, int offset=0, *, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+  dispatch:
+    CPU: triu_indices_cpu
+    CUDA: triu_indices_cuda
+```
+
+在PyTorch中，triu_indices是由C++和CUDA实现的，其中CPU核心代码为：  
+
+```c++
+Tensor triu_indices_cpu(
+    int64_t row, int64_t col, int64_t offset, c10::optional<ScalarType> dtype_opt,
+    c10::optional<Layout> layout_opt, c10::optional<Device> device_opt, c10::optional<bool> pin_memory_opt) {
+  if (!dtype_opt.has_value()) {
+    dtype_opt = ScalarType::Long;
+  }
+
+  check_args(row, col, layout_opt);
+
+  auto triu_size = row * col - get_tril_size(row, col, offset - 1);
+
+  // create an empty Tensor with correct size
+  auto result = at::native::empty_cpu({2, triu_size}, dtype_opt, layout_opt, device_opt, pin_memory_opt);
+
+  AT_DISPATCH_ALL_TYPES_AND(kBFloat16, result.scalar_type(), "triu_indices", [&]() -> void {
+    // fill the Tensor with correct values
+    scalar_t* result_data = result.data_ptr<scalar_t>();
+    int64_t i = 0;
+    // not typing std::max with scalar_t as it could be an unsigned type
+    // NOTE: no need to check if the returned value of std::max overflows
+    // scalar_t, as i and triu_size act as a guard.
+    scalar_t c = std::max<int64_t>(0, offset), r = 0;
+    while (i < triu_size) {
+      result_data[i] = r;
+      result_data[triu_size + i++] = c;
+
+      // move to the next column and check if (r, c) is still in bound
+      c += 1;
+      if (c >= col) {
+        r += 1;
+        // not typing std::max with scalar_t as it could be an unsigned type
+        // NOTE: not necessary to check if c is less than col or overflows here,
+        // because i and triu_size act as a guard.
+        c = std::max<int64_t>(0, r + offset);
+      }
+    }
+  });
+
+  return result;
+}
+```
+CPU端的代码主要逻辑是：
+
+1. 对入参进行检查
+2. 复用get_tril_size()函数，通过总维度减去下三角区域得到需要的tensor维度
+3. 创建空的Tensor并赋值得到正确的输出。
+
+GPU核心代码为：
+```c++
+template <typename scalar_t>
+__global__
+void triu_indices_kernel(scalar_t * tensor,
+                         int64_t col_offset,
+                         int64_t m_first_row,
+                         int64_t col,
+                         int64_t rectangle_size,
+                         int64_t triu_size) {
+  int64_t linear_index = blockIdx.x * blockDim.x + threadIdx.x;
+
+  if (linear_index < triu_size) {
+    int64_t r, c;
+    if (linear_index < rectangle_size) {
+      // the coordinate is within the top rectangle
+      r = linear_index / col;
+      c = linear_index % col;
+    } else {
+      // the coordinate falls in the bottom trapezoid
+      get_coordinate_in_triu_trapezoid(
+        m_first_row, linear_index - rectangle_size, r, c);
+      r += rectangle_size / col;
+    }
+
+    c += col_offset;
+    tensor[linear_index] = r;
+    tensor[linear_index + triu_size] = c;
+  }
+}
+
+// Some Large test cases for the fallback binary search path is disabled by
+// default to speed up CI tests and to avoid OOM error. When modifying the
+// implementation, please enable them in test/test_cuda.py and make sure they
+// pass on your local server.
+Tensor triu_indices_cuda(
+    int64_t row, int64_t col, int64_t offset, c10::optional<ScalarType> dtype_opt,
+    c10::optional<Layout> layout_opt, c10::optional<Device> device_opt, c10::optional<bool> pin_memory_opt) {
+  check_args(row, col, layout_opt);
+
+  auto triu_size = row * col - get_tril_size(row, col, offset - 1);
+  auto tensor = empty_cuda({2, triu_size}, dtype_opt, layout_opt, device_opt, pin_memory_opt);
+
+  if (triu_size > 0) {
+    // # of triu elements in the first row
+    auto m_first_row = offset > 0 ?
+      std::max<int64_t>(col - offset, 0) : // upper bounded by col
+      col;
+
+    // size of the top rectangle
+    int64_t rectangle_size = 0;
+    if (offset < 0) {
+      rectangle_size = std::min<int64_t>(row, -offset) * col;
+    }
+
+    dim3 dim_block = cuda::getApplyBlock();
+    dim3 dim_grid;
+
+    // using triu_size instead of tensor.numel(), as each thread takes care of
+    // two elements in the tensor.
+    TORCH_CHECK(
+      cuda::getApplyGrid(triu_size, dim_grid, tensor.get_device()),
+      "unable to get dim grid");
+
+    AT_DISPATCH_ALL_TYPES_AND(at::ScalarType::Half, tensor.scalar_type(), "triu_indices_cuda", [&] {
+      triu_indices_kernel<<<
+          dim_grid, dim_block, 0, at::cuda::getCurrentCUDAStream()>>>(
+        tensor.data_ptr<scalar_t>(),
+        std::max<int64_t>(0, offset),
+        m_first_row,
+        col,
+        rectangle_size,
+        triu_size);
+      C10_CUDA_KERNEL_LAUNCH_CHECK();
+    });
+  }
+
+  return tensor;
+}
+```
+
+上述CUDA代码是计算逻辑的GPU端实现，整体上未进行特殊的优化，计算逻辑清晰简洁。
+
+
+### 使用示例
+
+```python
+>>> import torch
+>>> a = torch.triu_indices(3, 3)
+>>> a
+tensor([[0, 0, 0, 1, 1, 2],
+        [0, 1, 2, 1, 2, 2]])
+
+>>> a = torch.triu_indices(4, 3, -1)
+>>> a
+tensor([[0, 0, 0, 1, 1, 1, 2, 2, 3],
+        [0, 1, 2, 0, 1, 2, 1, 2, 2]])
+
+>>> a = torch.triu_indices(4, 3, 1)
+>>> a
+tensor([[0, 0, 1],
+        [1, 2, 2]])
+```
+
+## NumPy
+
+### 实现解读
+
+调用接口为`numpy.triu_indices(n,k=0,m=None)`，n为矩阵行数，m为矩阵列数（可选），k为偏移，正数向右上方向偏移，  
+返回二维数组为指定元素的行列
+
+```python
+def triu_indices(n, k=0, m=None):
+    tri_ = ~tri(n, m, k=k - 1, dtype=bool)
+
+    return tuple(broadcast_to(inds, tri_.shape)[tri_]
+                 for inds in indices(tri_.shape, sparse=True))
+```
+上述代码调用函数`tri()`获得一个n*m维矩阵，其下上角元素为True，其余元素为False. 其底层通过umath库实现。
+后通过indices()函数取出此矩阵的行列下标,用broadcast_to()函数展开坐标.
+
+### 使用示例
+
+```python
+>>> import numpy as np
+>>> iu1 = np.triu_indices(4)
+>>> iu2 = np.triu_indices(4, 2)
+
+Here is how they can be used with a sample array:
+
+>>> a = np.arange(16).reshape(4, 4)
+>>> a
+array([[ 0,  1,  2,  3],
+        [ 4,  5,  6,  7],
+        [ 8,  9, 10, 11],
+        [12, 13, 14, 15]])
+
+Both for indexing:
+
+>>> a[iu1]
+array([ 0,  1,  2, ..., 10, 11, 15])
+
+And for assigning values:
+
+>>> a[iu1] = -1
+>>> a
+array([[-1, -1, -1, -1],
+        [ 4, -1, -1, -1],
+        [ 8,  9, -1, -1],
+        [12, 13, 14, -1]])
+
+These cover only a small part of the whole array (two diagonals right
+of the main one):
+
+>>> a[iu2] = -10
+>>> a
+array([[ -1,  -1, -10, -10],
+        [  4,  -1,  -1, -10],
+        [  8,   9,  -1,  -1],
+        [ 12,  13,  14,  -1]])
+
+```
+使用triu_indices接口可以取出指定的对角线元素行列坐标，从而修改矩阵中指定的对角线元素值
+
+# 四、对比分析
+ `numpy.triu_indices`比`torch.triu_indices`功能相同,但实现方式略有不同  
+ pytorch中根据定义直接计算需要输出的下标，而numpy中使用一系列的函数巧妙地进行输出 
+
+ 分析numpy的实现巧妙，但是中间变量占用空间大，在规模大时会影响性能，且不支持GPU加速。
+ 
+ 另外`paddle.tril_indices`是实现逻辑与pytorch类似，故`paddle.triu_indices`主体参考pytorch的实现思路保持代码一致性。
+
+# 五、设计思路与实现方案
+
+## 命名与参数设计
+API设计为`paddle.triu_indices(rows, cols, offset,dtype=None)`，产生一个2行x列的二维数组存放指定上三角区域的坐标，第一行为行坐标，第二行为列坐标
+
+参数类型要求：
+
+- `rows`、`cols`、`offset`的类型是`int`
+- 输出`Tensor`的dtype默认参数为None时使用'int64'，否则以用户输入为准
+
+## 底层OP设计
+
+在`paddle/fluid/operators/triu_indices_op.cc`添加triu_indices算子的描述，
+
+在`paddle/phi/infermeta/nullary.h`中声明形状推断的函数原型，在`paddle/phi/infermeta/nullary.cc`中实现。
+
+```c++
+void TriuIndicesInferMeta(const int& rows,
+                       const int& cols,
+                       const int& offset,
+                       MetaTensor* out);
+```
+
+在`paddle/phi/kernels/triu_indices_kernel.h`中声明核函数的原型  
+
+```c++
+template <typename Context>
+void TriuIndicesKernel( const Context& dev_ctx,
+                        const int& rows,
+                        const int& cols,
+                        const int& offset,
+                        DataType dtype,
+                        DenseTensor* out);
+```
+
+分别在 `paddle/phi/kernels/cpu/triu_indices_kernel.cc` 和`paddle/phi/kernels/gpu/triu_indices_kernel.cu`注册和实现核函数  
+实现逻辑借鉴pytorch直接计算下标。  
+CPU实现逻辑：计算输出数组大小，开辟空间，遍历每个位置赋值行列坐标。  
+GPU实现逻辑：计算输出数组大小，计算每个block负责的原始行列，按照输出数组大小进行平均的任务划分，实现每个block的赋值kernel。（目前pytorch版本的逻辑在device端实际存在一定的线程束分化，如有时间可以尝试进行优化。）
+
+## python API实现方案
+
+在`python/paddle/fluid/layers/tensor.py`中增加`triu_indices`函数,添加英文描述
+
+```python
+def triu_indices(rows, cols, offset, dtype=None):
+    # ...
+    # 参数检查,非整数类型转换成整数类型，给出提示
+    # ...
+    if dtype == None :
+        dtype == int
+    # ...
+    # 调用核函数
+    TriuIndicesKernel(dev_ctx,rows,cols,offset,dtype,out)
+    # ...
+    return out
+```
+## 单测及文档填写
+在` python/paddle/fluid/tests/unittests/`中添加`test_triu_indices.py`文件进行单测,测试代码使用numpy计算结果后对比，与numpy对齐    
+在` docs/api/paddle/`中添加中文API文档
+
+# 六、测试和验收的考量
+
+- 输入合法性及有效性检验；
+
+- 对比与Numpy的结果的一致性：
+  不同情况 
+  $（m>n || n>m || offset \in \{1-rows , cols-1\} || offset \notin \{1-rows , cols-1\})$
+
+- CPU、GPU测试。
+
+# 七、可行性分析和排期规划
+已完成主体开发，8.21前完成单元测试并提交
+
+# 八、影响面
+triu_indices是独立API，不会对其他API产生影响。
+
+# 名词解释
+无
+
+# 附件及参考资料
+无

From 3ae628a3e031b114e5c5f3a910085fe2e2ee346d Mon Sep 17 00:00:00 2001
From: Rayman96 <lqr5259@outlook.com>
Date: Sat, 13 Aug 2022 23:09:38 +0800
Subject: [PATCH 3/4] =?UTF-8?q?=E3=80=90Hackathon=20No.13=E3=80=91?=
 =?UTF-8?q?=E4=B8=BA=20Paddle=20=E6=96=B0=E5=A2=9E=20triu=5Findices=20API?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .../OPs-Perf/20220810_poisson_optimization.md | 74 -------------------
 1 file changed, 74 deletions(-)
 delete mode 100644 rfcs/OPs-Perf/20220810_poisson_optimization.md

diff --git a/rfcs/OPs-Perf/20220810_poisson_optimization.md b/rfcs/OPs-Perf/20220810_poisson_optimization.md
deleted file mode 100644
index f4352df2f..000000000
--- a/rfcs/OPs-Perf/20220810_poisson_optimization.md
+++ /dev/null
@@ -1,74 +0,0 @@
-Poisson OP性能优化设计文档
-
-| 基本信息                                                     | 内容                                                         |
-| ------------------------------------------------------------ | ------------------------------------------------------------- |
-| 提交作者<input type="checkbox" class="rowselector hidden">   |  Rayman的团队  |                                         
-| 提交时间<input type="checkbox" class="rowselector hidden">   | 2022-08-10 |                                                
-| 版本号                                                 | V1.0  |                       
-| 依赖飞桨版本<input type="checkbox" class="rowselector hidden">| PaddleDevelop|                      
-| 文件名                    | 20220810_poisson_op_optimization.md<br> |
-
- # 1 背景与意义
-
-目前Paddle中的PoissonKernel是通过cuRAND库+for_range + Functor组合实现。在device端调用cuRAND库是目前已知效率最高的随机数产生方案，性能可以进一步提升的空间在于优化线程的构建分布过程，以达到更优的效果。
-
-##  1.1 飞桨现状
-
-对于此OP在目前飞桨框架（Develop分支）中的性能现状调研，表格形式列出[OP Benchmark](https://github.com/PaddlePaddle/benchmark/tree/master/api/tests_v2)中各种case场景下的OP性能数据（Tesla P4）。
-| Case No. | input_shape | data_type | Paddle Perf(s) |
-|---|---|---|---|
-| 1 | [16, 16, 16, 16] |float32|0.3478 | 
-| 2 | [16, 35, 1500] |float32| 2.9794|
-| 3 | [16, 16, 16, 16] |float64 |0.2866|
-
- ## 1.2 业内方案调研
-
-调研测试pytorch目前产生poisson分布的性能。
-| Case No. | input_shape | data_type | Paddle Perf(s) |Perf_over_percent(%)
-|---|---|---|---|---|
-| 1 | [16, 16, 16, 16] |float32|0.3117 | +10.38
-| 2 | [16, 35, 1500] |float32| 2.6733|+10.27
-| 3 | [16, 16, 16, 16] |float64 |0.3202|-11.72
-
- ## 1.3 对比分析
-对比表格1和表格2中的数据，case1和case2情况下是处理float32数据，pytorch相比较paddle有10%左右的性能提升。而case3是处理float64的数据，pytorch相对于paddle有11.7%的性能下降。
-
- # 2 设计方案与性能预期
-
- ## 2.1 关键模块与性能提升点
- + paddle于pytorch的性能差距在10%左右，差距并不太大，通过源码分析同样都是用了cuRAND函数库，因此试图改进poisson分布数值产生的方式以达到质的飞跃并不可取。
- + 性能提升关键在点在于优化Poisson_kernel.cu的host端代码，通过优化GPU上grid， block数量以寻找到更优参数以获得超过7%的提升。
-
-##  2.2 Host端计算流程
-在方案设计阶段有两种实验思路；
-1. 方案一：通过paddle已实现的gpu_launch_config.h中GetGpuLaunchConfig1D方法获得较优的参数配置。该方案经过测试在float32数据上有5%左右的性能提升，float64数据上有10%左右的性能下降。故不作为首选方案。
-2. 方案二：通过手动测试在该场景下更优的配置参数，BlockSize性能较优的取值通常为[128, 256,512]。对这三者进行实验并测试性能，结果显示是用一维Grid，且BlockSize=256时，在不同测试用例，不同测试环境中均有大幅性能提升。
-
- ## 2.3 Device端计算流程
-
-保持原有逻辑，使用cuRAND中curand_poisson方法实现。
-
- ## 3 测试和验收的考量
-
-实验环境1：Tesla P4
-| Case No. | input_shape | data_type | Paddle_modify Perf(s) |Perf_over_paddle_origin(%)|Perf_over_pytorch(%)
-|---|---|---|---|---|---|
-| 1 | [16, 16, 16, 16] |float32|0.2205 | +36.62|+29.27
-| 2 | [16, 35, 1500] |float32| 2.044|+31.40|+23.54
-| 3 | [16, 16, 16, 16] |float64 |0.2159|+24.68|+32.57
-
- # 4 可行性分析和排期规划
-
-已完成开发和测试，待优化代码规范后提交PR
-
-预计提交时间：8.15
-
-
-#  5 影响面
-
-对其他模块没有影响。
-
-
- 附件及参考资料
-
-[1]. [OP Benchmark使用指南](https://github.com/PaddlePaddle/benchmark/blob/master/api/README.md)

From 5b909e2b9a49269d84c624975daa1b80899663ea Mon Sep 17 00:00:00 2001
From: Rayman96 <lqr5259@outlook.com>
Date: Mon, 15 Aug 2022 15:36:49 +0800
Subject: [PATCH 4/4] modify doc

---
 .../20220813_api_design_for_triu_indices.md   | 19 ++++++-------------
 1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/rfcs/APIs/20220813_api_design_for_triu_indices.md b/rfcs/APIs/20220813_api_design_for_triu_indices.md
index bc9ab45c7..42dac004f 100644
--- a/rfcs/APIs/20220813_api_design_for_triu_indices.md
+++ b/rfcs/APIs/20220813_api_design_for_triu_indices.md
@@ -57,13 +57,10 @@ PyTorch和Numpy中都有triu_indices这个API
 ## PyTorch
 
 ### 实现解读
-pytorch 中接口配置为：  
+pytorch 中接口配置为: [在线文档](https://pytorch.org/docs/stable/generated/torch.triu_indices.html?highlight=triu_indices#torch.triu_indices)
 
 ```python
-func: triu_indices(int row, int col, int offset=0, *, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
-  dispatch:
-    CPU: triu_indices_cpu
-    CUDA: triu_indices_cuda
+  torch.triu_indices(row, col, offset=0, *, dtype=torch.long, device='cpu', layout=torch.strided) → Tensor
 ```
 
 在PyTorch中，triu_indices是由C++和CUDA实现的，其中CPU核心代码为：  
@@ -330,18 +327,14 @@ GPU实现逻辑：计算输出数组大小，计算每个block负责的原始行
 
 ## python API实现方案
 
-在`python/paddle/fluid/layers/tensor.py`中增加`triu_indices`函数,添加英文描述
+在`python/paddle/tensor/creation.py`中增加`triu_indices`函数，并添加英文描述
 
 ```python
-def triu_indices(rows, cols, offset, dtype=None):
+def triu_indices(row, col, offset=0, dtype='int64'):
     # ...
-    # 参数检查,非整数类型转换成整数类型，给出提示
+    # 参数检查
     # ...
-    if dtype == None :
-        dtype == int
-    # ...
-    # 调用核函数
-    TriuIndicesKernel(dev_ctx,rows,cols,offset,dtype,out)
+    # 增加算子
     # ...
     return out
 ```