Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Type Hints】Paddle 中引入 Tensor stub 文件 #63953

Merged
merged 25 commits into from
May 23, 2024

Conversation

megemini
Copy link
Contributor

@megemini megemini commented Apr 28, 2024

PR Category

User Experience

PR Types

Others

Description

关联 PR : #63597

任务:1-2

根据 #50211 生成 tensor.pyi

涉及文件:

  • .pre-commit-config.yaml pre-commit 不检查 tensor.pyi
  • python/CMakeLists.txt 编译后执行 gen_tensor_stub.py
  • python/paddle/__init__.py type checking 时引入 tensor 目录下的 Tensor
  • python/paddle/py.typed 标记文件
  • python/setup.py.in 增加 py.typed 和 tensor.pyi
  • setup.py 增加 py.typed 和 tensor.pyi
  • tools/gen_tensor_stub.py 生成 tensor.pyi
  • python/paddle/tensor/tensor.pyi.in stub 文件的模板

临时文件:

  • python/paddle/tensor/tensor.pyi 生成的 stub 文件,可供参考

目前 (20240428) 还有几个问题:

  • 编译时的命令,gen_tensor_stub.py 加在哪里合适?python/CMakeLists.txt 吗?有这方面规划吗?
  • python/setup.py.insetup.py 都需要增加 py.typed 和 tensor.pyi ?
  • gen_tensor_stub.py 目前只是把 [Don't merge][Type Hints] add tensor.pyi generator (at runtime) #50211 里面的拿过来,还需要优化。 使用 tensor.pyi.in 模板生成 stub 文件

Copy link

paddle-bot bot commented Apr 28, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@megemini megemini changed the title 【WIP】【Type Hints】Paddle 中引入 Tensor stub 和 Paddle/python/paddle/py.typed 文件 【Type Hints】Paddle 中引入 Tensor stub 和 Paddle/python/paddle/py.typed 文件 May 5, 2024
@megemini
Copy link
Contributor Author

megemini commented May 5, 2024

Update 20240505

gen_tensor_stub.py 利用模板 tensor.pyi.in 生成 tensor.pyi

这里参考 #50211 中的方法,主要不同:

  • 利用模板,而不是直接生成 stub 文件

    主要逻辑为:在模板中插入 api (signature & docstring),或 api 的 docstring (仅有 signature 没有 docstring 的 api),如果 get_tensor_members 的名称与模板中相同,则优先使用模板中的 signature 和 docstring。

    主要原因为:单使用代码生成 stub 文件,有些地方比较难处理,比如 c++ 中的一些接口,只有简单的方法名称,如果要添加 type 或者 docstring 会比较麻烦,或者比如,有些方法需要 overload (如 init),使用模板也会更简单。

    后续,可以不断完善 tensor.pyi.in 文件,c++ 中不方便修改的,添加到模板中即可,模板的优先级高于 get_tensor_members

    另外,目前 tensor.pyi.in 还不完善,有些 api 的签名还没改,后面需要慢慢完善。

  • 对于 inspect.isdatadescriptor(member) ,当作 @property 处理,而不是 attribute ,主要是由于,这些属性大部分有 docstring,如 data,如果设置成 attribute 就丢失了 ~

  • Tensor 的签名改为 class Tensor(Generic[_ShapeType, _DType]),增加两个类型 shape 和 dtype ,如 def __eq__(self, y: Tensor) -> Tensor[Any, bool] 可以使用,后续也可以方便扩展 ~

另外,没有修改 tensor.py 文件,本来想把 python/paddle/__init__.py 中的 Tensor = framework.core.eager.Tensor 移入 tensor.py 里面,结果发现,目前的源码中有的地方用到了 paddle.Tensor 作为 type annotation,如果移入 tensor.py ,会出现 circular import ,为保险起见做如下处理

if typing.TYPE_CHECKING:
    from .tensor.tensor import Tensor
else:
    Tensor = framework.core.eager.Tensor
    Tensor.__qualname__ = 'Tensor'

python/paddle/tensor/tensor.pyi 为生成的 stub 文件,可供参考 ~

目前,paddle.Tensor 中共有 550 个成员,其中:

  • 添加 478 个方法
  • 添加 10 个 alias
  • 添加 1 个 alias __qualname__ = "Tensor"
  • 不添加 11is_inherited_member
  • 不添加 48 个私有成员
  • 不添加 3 个特殊成员 __array_ufunc____module____new__

以上共计 551 个 (不含 overload)~

@SigureMo 请评审 ~

Copy link

paddle-ci-bot bot commented May 14, 2024

Sorry to inform you that 770c25b's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@SigureMo
Copy link
Member

利用模板,而不是直接生成 stub 文件

模板唯一一个问题是不受 Ruff 等工具监控,代码风格无法保证

Tensor 的签名改为 class Tensor(Generic[_ShapeType, _DType]),增加两个类型 shape 和 dtype

我的建议是在想好 API 形态前不要加泛型,一旦现在加了,以后改就是 breaking change,但从非泛型到泛型在非 strict mode 下是兼容性变动

如 def eq(self, y: Tensor) -> Tensor[Any, bool] 可以使用,后续也可以方便扩展 ~

而且这个签名也不对啊,dtype 哪来的 bool

@@ -5,7 +5,8 @@ exclude: |
paddle/fluid/framework/fleet/heter_ps/cudf/.+|
paddle/fluid/distributed/ps/thirdparty/round_robin.h|
python/paddle/utils/gast/.+|
third_party/.+
third_party/.+|
python/paddle/tensor/tensor\.pyi|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个注意合入前要删掉,这个文件不应该存在源码里

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件应该在另一个 PR 已经加过了~

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以根据 https://github.com/cattidea/paddlepaddle-stubs/blob/main/paddle-stubs/_typing/tensor.pyi 看看有没有什么缺失的,比如 __or__ 好像就是缺失的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有的 ~

def __or__(self, y, out=None, name=None):

是自动生成的 ~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我这里是比对的 paddle 目前 Tensor 里面的东西,后面我对比一下那个 https://github.com/cattidea/paddlepaddle-stubs/blob/main/paddle-stubs/_typing/tensor.pyi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该是太多了搜索卡住了,但 __ror__ 应该是没有的

# Add docstring, attributes, methods and alias with type annotaions for `Tensor`
# if not conveniently coding in original place (like c++ source file).

from __future__ import annotations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyi 文件不需要加 PEP 563,因为它没有什么所谓的运行时

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件我是当作 python 文件处理的,这样可以保证通过 mypy 检查,所以添加了这个 ~ 如果生成的 pyi 不需要的话,可以在生成的 tensor.pyi 里面删掉 ~ 如何?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

诶?为啥不从源头删掉?这个文件应该不需要 mypy 检查的

@megemini
Copy link
Contributor Author

利用模板,而不是直接生成 stub 文件

模板唯一一个问题是不受 Ruff 等工具监控,代码风格无法保证

是的,所以这个模板文件原则上应该是能通过 mypy 的 python 文件 ~ 而不是 torch 的那种混有 ${xxx} 进行字符串插入的模板 ~

我在线下用 mypy 检查是通过的,后面考虑是否把他也加到 CI 的检查里面?

Tensor 的签名改为 class Tensor(Generic[_ShapeType, _DType]),增加两个类型 shape 和 dtype

我的建议是在想好 API 形态前不要加泛型,一旦现在加了,以后改就是 breaking change,但从非泛型到泛型在非 strict mode 下是兼容性变动

如 def eq(self, y: Tensor) -> Tensor[Any, bool] 可以使用,后续也可以方便扩展 ~

而且这个签名也不对啊,dtype 哪来的 bool

本来是没加泛型的,结果就是因为这个 eq 和 nq 方法,返回的是个 bool 类型的 Tensor ,如果只写 Tensor 感觉语义不明确,所以就加上了 ~ 其他地方暂时没看到需要泛型的,要么就去掉吧?

另外,dtype 有 bool 啊

In [2]: a = paddle.to_tensor(False)

In [3]: a
Out[3]: 
Tensor(shape=[], dtype=bool, place=Place(gpu:0), stop_gradient=True,
       False)

为啥没有???

@megemini
Copy link
Contributor Author

补充说明一下根据模板 tensor.pyi.in 生成 tensor.pyi 的逻辑:

  • 利用正则在 # annotation: ${xxx} 后面插入 docstring、methods 等
  • 利用正则在 def xxx()... 方法中插入文档
  • 模板中存在定义的方法,如果缺少文档,则只插入文档

@megemini
Copy link
Contributor Author

另外,关于 tensor.pyi.in 这个模板文件的维护问题,我的想法是,不需要特殊维护,因为,后面 CI 会检查 api 的 typing,如果 tensor 接口有变,且没有办法自动生成有效的签名,那么,如果不修改 tensor.pyi.in 文件手动添加,则 CI 检查 可能 会 fail(没有示例代码和类型测试用例,那就没办法检查了 ... ...)。
这个可以在 《Paddle 中的类型提示》 的开发文档里面写明 ~

@SigureMo
Copy link
Member

另外,dtype 有 bool 啊

这是 paddle.bool 吧,但就算写 paddle.bool 也不对,泛型参数是不能写 paddle.bool 这种「值」的,所以需要设计一种方式覆盖这种表达,但明显目前是不具备的

我在线下用 mypy 检查是通过的,后面考虑是否把他也加到 CI 的检查里面?

如果是合法的 pyi 文件,建议使用 .pyi 后缀,天然所有的检查工具都会检查(black、Ruff 等等)

可以考虑改为 tensor.prototype.pyi 之类的

另外,关于 tensor.pyi.in 这个模板文件的维护问题,我的想法是,不需要特殊维护,因为,后面 CI 会检查 api 的 typing

对的,我不关心这个文件自身的 typing 问题,因为我们本就通过叶子结点(API)去检查了这一点,我这里只是说其它的通用检查工具,以及,编辑器并不会给 .in 提供语法高亮

比如 #63256 就是因为不爽很久了,直接改了后缀

@overload
def __init__(self, *args: Any, **kwargs: Any) -> None:
"""
ref: paddle/fluid/pybind/eager.cc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这样写的话,会不会把这个当成 docstring?会不会影响用户对文档的阅读?注释的话可以用 #

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我这里就是把它当作 docstring 处理的,看看需不需要?如果不需要的话那么删掉就行 ~

倒是不会影响阅读 ~ 例如 vscode 对于 overload 的 api 可以逐个查看 ~

Copy link
Member

@SigureMo SigureMo May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eager.cc 就在这段注释的上面就是 Tensor 的 __doc__

PyDoc_STRVAR( // NOLINT
TensorDoc,
R"DOC(Tensor($self, /, value, place, persistable, zero_copy, name, stop_gradient, dims, dtype, type)
--
Tensor is the basic data structure in PaddlePaddle. There are some ways to create a Tensor:
- Use the exsiting ``data`` to create a Tensor, please refer to :ref:`api_paddle_to_tensor`.
- Create a Tensor with a specified ``shape``, please refer to :ref:`api_paddle_ones`,
:ref:`api_paddle_zeros`, :ref:`api_paddle_full`.
- Create a Tensor with the same ``shape`` and ``dtype`` as other Tensor, please refer to
:ref:`api_paddle_ones_like`, :ref:`api_paddle_zeros_like`, :ref:`api_paddle_full_like`.
)DOC");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,这两者倒是不矛盾
image

image

如果不需要的话把 init 里面的这个删掉就行 ~

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

喔喔,是 Tensor 本身也有一个是吧,那这样也没问题

@megemini
Copy link
Contributor Author

包括 __ror__,这些应该是我觉得「应该」存在的方法,不过这里应该和运行时保持一致,这些不存在的就不要加了

OK ~

tensor.pyi.in 需要检查啊,不然怎么保证开发者修改这个文件的时候没有改出问题?

检查方式有想好么?可以考虑复用 pre-commit,不建议在 CI 单独开逻辑,会和开发工作流割裂,提完 PR 在 CI 上才发现问题,不了解的同学调试起来也很麻烦

嗯!已经改为 tensor.prototype.pyipre-commit 可以直接检查 ~~~

另外,做以下修改:

  • setup 里面只打包 tensor.pyi,也就是生成的 stub 文件,不打包这个 tensor.prototype.pyi
  • 去掉了 tensor.prototype.pyi 里面的 from __future__ import annotationspre-commit 用来检查的话确实就用不到了 ~

@megemini megemini requested a review from SigureMo May 22, 2024 13:14
# avoid same name: Tensor.slice
_Slice: TypeAlias = slice

class Tensor(Generic[_ShapeType, _DType]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

泛型这里还没有去掉是嘛?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

漏掉了 ~ 我改一下 ~

name: std::string)
"""
...
def __add__(self, y: Tensor) -> Tensor: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些 magic method 的 rhs 应该不仅仅支持 Tensor,还支持一些 TensorLike 的,比如 x + 1 之类的

与之相对的,paddle.add 是只支持 Tensor 的,paddle.add(x, y)(含 x.add(y)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的!当时 _typing 还没 merge,我改一下 ~

Comment on lines 131 to 136
def __and__(
self,
y: Tensor,
out: Tensor | None = None,
name: str | None = None,
) -> Tensor: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magic_method_func = [
('__and__', 'bitwise_and'),
('__or__', 'bitwise_or'),
('__xor__', 'bitwise_xor'),
('__invert__', 'bitwise_not'),
]

这几个签名不应该包含 outname 的,包含的原因是目前是直接将相关的 bitwise API patch 上了,但这是不规范的,建议不要在签名里暴露后两个参数~

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外建议这里排序按照分组会更好一些,比如

  • bitwise (__lshift____or__ etc.)
  • comparison (__eq____lt__ etc.)
  • math operation
    • unary (__neg__ etc.)
    • binary (__add____matmul__ etc., including reversed and inplaced version)
  • type cast (__bool____int____float__ etc.)
  • numpy specific (__array__)

@SigureMo
Copy link
Member

另外,为了更好的 stub 检查,可以考虑引入 Ruff 的 flake8-pyi rules 了~(独立任务)

https://docs.astral.sh/ruff/rules/#flake8-pyi-pyi

"""
...
# rich comparison
def __eq__(self, y: _typing.Numberic) -> Tensor: ... # type: ignore[override]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我倒是没注意现在的 Numberic 里包含了 Tensor,但 Numberic 应该指的是一个能表示数字的类型,这包含了 numpy scalar、0D Tensor

TensorLike 是一个不同的类型,它不仅包含这种 0D 的,还应该包含任意维度的 Tensor/ndarray,可以参考下面的:

https://github.com/cattidea/paddlepaddle-stubs/blob/03ab2c251d8239beb4a65aaf662e46aebdb00c1c/paddle-stubs/_typing/tensor.pyi#L19

其中 list[TensorLike] | tuple[TensorLike, ...] 我不是很确定当时添加的理由了,但后面三个是应该有的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我还纳闷呢,找不到 TensorLike ,以为改成 Numberic 了,可是也不合适啊 ... ...

TensorLike 应该挺有用的,把它加到 _typing.basic.py 里吧?

list[TensorLike] | tuple[TensorLike, ...] 有用,比如 paddle.atleast_xdinputs (Tensor|list(Tensor)) ,方法内部会解包 ~ 不过,我觉得还是不要加到 TensorLike 里面吧,比如这里 __eq__ 之类的地方,更多用到的应该是下面这个形式:

TensorLike: TypeAlias = npt.NDArray[Any] | Tensor | Numberic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorLike 应该挺有用的,把它加到 _typing.basic.py 里吧?

嗯嗯~

不过,我觉得还是不要加到 TensorLike 里面吧,比如这里 eq 之类的地方,更多用到的应该是下面这个形式:

嗯嗯可以~

@property
def strides(self) -> list[int]: ...
@property
def type(self) -> _typing.DTypeLike: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tensor 的 type 应该不是 dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,但是不晓得应该写啥?Tensor

In [8]: a.type
Out[8]: <VarType.LOD_TENSOR: 7>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VarType 是旧体系下的,不太建议暴露,这里可以直接用 Any

def detach(self) -> Tensor: ...
def detach_(self) -> Tensor: ...
@property
def dtype(self) -> _typing.DTypeLike: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

返回值一般都不是 XxxLike,比如这里应该是明确的 paddle.dtype

def set_string_list(self, value: str) -> None: ...
def set_vocab(self, value: dict) -> None: ...
@property
def shape(self) -> _typing.ShapeLike: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,这里应该是明确的 list[int]

@property
def persistable(self) -> bool: ...
@property
def place(self) -> _typing.PlaceLike: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个也要看一下类型,这些 Place 没有公共基类么?

def placements(self) -> list[paddle.distributed.Placement] | None: ...
@property
def process_mesh(self) -> paddle.distributed.ProcessMesh | None: ...
def rows(self) -> paddle.core.SelectedRows: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个类型就是这个是么?不过这个类型 Python 端并没有额外信息,可能并不是很友好

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有几个接口太陌生了 ... ... 找了找,貌似应该是 list[int]

In [32]:             >>> import paddle
    ...:             >>> import paddle.base as base
    ...:             >>> from paddle.base import core
    ...:             >>> paddle.enable_static()
    ...:             >>> scope = core.Scope()
    ...:             >>> block = paddle.static.default_main_program().global_block()
    ...:             >>> x_rows = [0, 5, 5, 4, 19]
    ...:             >>> height = 20
    ...:             >>> x = scope.var('X').get_selected_rows()
    ...:             >>> x.set_rows(x_rows)
    ...: 

In [33]: x
Out[33]: <paddle.base.libpaddle.SelectedRows at 0x7f319b781d70>

In [36]: x.rows()
Out[36]: [0, 5, 5, 4, 19]

In [37]: type(x.rows())
Out[37]: list

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有几个接口太陌生了 ... ...

这些我也不清楚 😂,基本都不会用的

@megemini megemini requested a review from zrr1999 as a code owner May 23, 2024 02:26
SigureMo
SigureMo previously approved these changes May 23, 2024
Copy link
Member

@SigureMo SigureMo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMeow 🐾

SigureMo
SigureMo previously approved these changes May 23, 2024
Copy link
Member

@SigureMo SigureMo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMeow 🐾

@megemini
Copy link
Contributor Author

Update 20240523:

  • _typing.basic.py 添加 TensorLike
  • core.pyi 添加 Place
  • 更新 tensor.prototype.pyi

@SigureMo SigureMo changed the title 【Type Hints】Paddle 中引入 Tensor stub 和 Paddle/python/paddle/py.typed 文件 【Type Hints】Paddle 中引入 Tensor stub 文件 May 23, 2024
Copy link
Contributor

@risemeup1 risemeup1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@zyfncg zyfncg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for setup.py.in

Copy link
Member

@SigureMo SigureMo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMeow 🐾

@SigureMo SigureMo merged commit 67b30ec into PaddlePaddle:develop May 23, 2024
32 checks passed
chen2016013 pushed a commit to chen2016013/Paddle that referenced this pull request May 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants