From 6e3e9de190b321c9c104268337e6c1ac16cde8fa Mon Sep 17 00:00:00 2001 From: Li-fAngyU <56572498+Li-fAngyU@users.noreply.github.com> Date: Fri, 8 Jul 2022 16:40:24 +0800 Subject: [PATCH 1/4] Create 20220708_api_design_for_bucketize.md --- .../APIs/20220708_api_design_for_bucketize.md | 132 ++++++++++++++++++ 1 file changed, 132 insertions(+) create mode 100644 rfcs/APIs/20220708_api_design_for_bucketize.md diff --git a/rfcs/APIs/20220708_api_design_for_bucketize.md b/rfcs/APIs/20220708_api_design_for_bucketize.md new file mode 100644 index 000000000..11c8a5e94 --- /dev/null +++ b/rfcs/APIs/20220708_api_design_for_bucketize.md @@ -0,0 +1,132 @@ +# paddle.Tensor.bucketize 设计文档 + +|API名称 | paddle.bucketize | +|---|---| +|提交作者 | 李芳钰 | +|提交时间 | 2022-07-8 | +|版本号 | V1.0 | +|依赖飞桨版本 | develop | +|文件名 | 20220708_api_design_for_bucketize.md
| + +# 一、概述 + +## 1、相关背景 +为了提升飞桨API丰富度,Paddle需要扩充API`paddle.bucketize`的功能。 +## 2、功能目标 +增加API`paddle.bucketize`,实现根据边界返回输入值的桶索引。 +## 3、意义 +飞桨支持`paddle.bucketize`的API功能。 + +# 二、飞桨现状 +目前paddle可直接由`paddle.searchsorted`API,直接实现该功能。 + +paddle已经实现了[paddle.searchsorted](https://github.com/PaddlePaddle/Paddle/blob/release/2.3/python/paddle/tensor/search.py#L910)API,所以只需要调用该API既可以实现该功能。 + +需要注意的是`paddle.bucketize`处理的sorted_sequence特殊要求为1-D Tensor。 + +# 三、业内方案调研 +## Numpy +### 实现方法 +以现有numpy python API组合实现,[代码位置](https://github.com/numpy/numpy/blob/v1.23.0/numpy/lib/function_base.py#L5447-L5555). +其中核心代码为: +```Python + x = _nx.asarray(x) + bins = _nx.asarray(bins) + + # here for compatibility, searchsorted below is happy to take this + if np.issubdtype(x.dtype, _nx.complexfloating): + raise TypeError("x may not be complex") + + mono = _monotonicity(bins) + if mono == 0: + raise ValueError("bins must be monotonically increasing or decreasing") + + # this is backwards because the arguments below are swapped + side = 'left' if right else 'right' + if mono == -1: + # reverse the bins, and invert the results + return len(bins) - _nx.searchsorted(bins[::-1], x, side=side) + else: + return _nx.searchsorted(bins, x, side=side) +``` +整体逻辑为: + +- 通过`_monotonicity`判断箱子是否单调递增或者递减。 +- 然后根据`mono`和参数`right`决定是否需要反转箱子。 +- 最后也是通过`searchsorted`直接返回输入对应的箱子索引。 + +## Pytorch +Pytorch中有API`torch.bucketize(input, boundaries, *, out_int32=False, right=False, out=None) → Tensor`。在pytorch中,介绍为: +``` +Returns the indices of the buckets to which each value in the input belongs, where the boundaries of the buckets are set by boundaries. Return a new tensor with the same size as input. If right is False (default), then the left boundary is closed. +``` + +### 实现方法 +在实现方法上,Pytorch的整体逻辑与Numpy基本一致,[代码位置](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Bucketization.cpp)。其中核心代码为: +```c++ +Tensor& bucketize_out_cpu(const Tensor& self, const Tensor& boundaries, bool out_int32, bool right, Tensor& result) { + TORCH_CHECK(boundaries.dim() == 1, "boundaries tensor must be 1 dimension, but got dim(", boundaries.dim(), ")"); + at::native::searchsorted_out_cpu(boundaries, self, out_int32, right, nullopt, nullopt, result); + return result; +} +``` +整体逻辑为: +- 检查输入参数`boundaries`。 +- 然后直接利用`searchsorted_out_cpu`返回结果。 + +## Tensorflow +Tensorflow`tft.bucketize( + x: common_types.ConsistentTensorType, + num_buckets: int, + epsilon: Optional[float] = None, + weights: Optional[tf.Tensor] = None, + elementwise: bool = False, + name: Optional[str] = None +) -> common_types.ConsistentTensorType`。在Tensorflow中,介绍为: +Returns a bucketized column, with a bucket index assigned to each input. + +### 实现方法 +在实现方法上,Tensorflow的API参数设计于Numpy和Pytorch都不大相同,[代码位置](https://github.com/tensorflow/transform/blob/v1.9.0/tensorflow_transform/mappers.py#L1690-L1770)。这里就不具体分析其核心代码了,因为和我们想要实现的功能有很大的差距。 + + +# 四、对比分析 +- 使用场景与功能:Pytorch会比Numpy更贴和我们想要实现的功能,因为Pytorch也是仅针对1-D Tensor,而Numpy支持多维。 + +# 五、方案设计 +## 命名与参数设计 +API设计为`paddle.bucketize(x, sorted_sequence, out_int32=False, right=False, name=None)` +命名与参数顺序为:形参名`input`->`x`, 与paddle其他API保持一致性,不影响实际功能使用。 +参数类型中,`x`为N-D Tensor,`sorted_sequence`为1-D Tensor。 + +## 底层OP设计 +使用已有API组合实现,不再单独设计OP。 + +## API实现方案 +主要按下列步骤进行实现,实现位置为`paddle/tensor/math.py`与`searchsorted`方法放在一起: +1. 使用`len(sorted_sequence)`检验参数`sorted_sequence`的维度。 +2. 使用`paddle.searchsorted`得到输入的桶索引。 + + +# 六、测试和验收的考量 +测试考虑的case如下: + +- 和pytorch结果的数值的一致性, `paddle.bucketize`,和`torch.bucketize`结果是否一致; +- 参数`right`为True和False时输出的正确性; +- `out_int32`为True和False时输出dtype正确性; +- 未输入`right`时的输出正确性; +- 未输入`out_int32`时的输出正确性; +- 错误检查:输入`x`不是Tensor时,能否正确抛出错误; +- 错误检查:`axis`所指维度在当前Tensor中不合法时能正确抛出错误。 + +# 七、可行性分析及规划排期 + +方案主要依赖现有paddle api组合而成,且依赖的`paddle.searchsorted`已经在 Paddle repo 的 python/paddle/tensor/search.py [目录中](https://github.com/PaddlePaddle/Paddle/blob/release/2.3/python/paddle/tensor/search.py#L910)。工期上可以满足在当前版本周期内开发完成。 + +# 八、影响面 +为独立新增API,对其他模块没有影响 + +# 名词解释 +无 +# 附件及参考资料 +无 + From 0c5b1c693e2a9e9ce9501f949463723ae0a50484 Mon Sep 17 00:00:00 2001 From: Li-fAngyU <56572498+Li-fAngyU@users.noreply.github.com> Date: Mon, 11 Jul 2022 18:39:55 +0800 Subject: [PATCH 2/4] Update 20220708_api_design_for_bucketize.md --- rfcs/APIs/20220708_api_design_for_bucketize.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/APIs/20220708_api_design_for_bucketize.md b/rfcs/APIs/20220708_api_design_for_bucketize.md index 11c8a5e94..6539d140c 100644 --- a/rfcs/APIs/20220708_api_design_for_bucketize.md +++ b/rfcs/APIs/20220708_api_design_for_bucketize.md @@ -110,7 +110,7 @@ API设计为`paddle.bucketize(x, sorted_sequence, out_int32=False, right=False, # 六、测试和验收的考量 测试考虑的case如下: -- 和pytorch结果的数值的一致性, `paddle.bucketize`,和`torch.bucketize`结果是否一致; +- 和numpy结果的数值的一致性, `paddle.bucketize`,和`numpy.searchsorted`结果是否一致; - 参数`right`为True和False时输出的正确性; - `out_int32`为True和False时输出dtype正确性; - 未输入`right`时的输出正确性; From 9ff77c89407786d7094cf9865278aba5f6fcd650 Mon Sep 17 00:00:00 2001 From: Li-fAngyU <56572498+Li-fAngyU@users.noreply.github.com> Date: Thu, 14 Jul 2022 23:09:54 +0800 Subject: [PATCH 3/4] Update 20220708_api_design_for_bucketize.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 更新测试案例说明 --- rfcs/APIs/20220708_api_design_for_bucketize.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/APIs/20220708_api_design_for_bucketize.md b/rfcs/APIs/20220708_api_design_for_bucketize.md index 6539d140c..75b72b1dd 100644 --- a/rfcs/APIs/20220708_api_design_for_bucketize.md +++ b/rfcs/APIs/20220708_api_design_for_bucketize.md @@ -116,7 +116,7 @@ API设计为`paddle.bucketize(x, sorted_sequence, out_int32=False, right=False, - 未输入`right`时的输出正确性; - 未输入`out_int32`时的输出正确性; - 错误检查:输入`x`不是Tensor时,能否正确抛出错误; -- 错误检查:`axis`所指维度在当前Tensor中不合法时能正确抛出错误。 +- 错误检查:输入`sorted_sequence`不是一维张量时,能否正确抛出错误; # 七、可行性分析及规划排期 From dc76ea2a8c3b6030a3c3b429b2498958caea8789 Mon Sep 17 00:00:00 2001 From: Li-fAngyU <56572498+Li-fAngyU@users.noreply.github.com> Date: Thu, 14 Jul 2022 23:14:02 +0800 Subject: [PATCH 4/4] update test example. --- rfcs/APIs/20220708_api_design_for_bucketize.md | 1 + 1 file changed, 1 insertion(+) diff --git a/rfcs/APIs/20220708_api_design_for_bucketize.md b/rfcs/APIs/20220708_api_design_for_bucketize.md index 75b72b1dd..b73d855f3 100644 --- a/rfcs/APIs/20220708_api_design_for_bucketize.md +++ b/rfcs/APIs/20220708_api_design_for_bucketize.md @@ -117,6 +117,7 @@ API设计为`paddle.bucketize(x, sorted_sequence, out_int32=False, right=False, - 未输入`out_int32`时的输出正确性; - 错误检查:输入`x`不是Tensor时,能否正确抛出错误; - 错误检查:输入`sorted_sequence`不是一维张量时,能否正确抛出错误; +- 错误检查:未输入`x`和`sorted_sequence`时,能否正确抛出错误; # 七、可行性分析及规划排期