Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single release for PaddlePaddle CPU Image #1607

Closed
wants to merge 6 commits into from
Closed

Single release for PaddlePaddle CPU Image #1607

wants to merge 6 commits into from

Conversation

gangliao
Copy link
Contributor

@gangliao gangliao commented Mar 13, 2017

Please give me some ack.... 😄


### Background

Currently, PaddlePaddle supports AVX and SSE3 intrinsics (extensions to the x86 instruction set architecture). When using CMake to compile PaddlePaddle source code, it will check and detect the host which SIMD instruction is supported, then automatically set the legal one. Developer or user also could manually set CMake option `WITH_AVX=ON/OFF` before PaddlePaddle compilation. That's good for local usage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里我想强调一下hosttarget。目前我们常用的local usage方式,host就是target,所以没有什么区别。如果hosttarget不同,check and detect the host which SIMD instruction is supported其实没有意义。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

正是如此。

} else if (HAS_SIMD(SIMD_SSE3)) {
sse3_stub();
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里接口上可以简单一些吧,比如可以定义一个SIMD的宏(或者其他方式),然后调用出简化为SIMD(stub())

Copy link
Contributor Author

@gangliao gangliao Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实CpuID.h里面也是有宏的, 可以直接用, 函数调用和宏都支持

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
 * @brief   Check SIMD flags at runtime.
 *
 * 1. Check all SIMD flags at runtime:
 *
 * @code{.cpp}
 * if (HAS_AVX && HAS_AVX2) {
 *      avx2_stub();
 * }
 * @endcod
 *
 * 2. Check one SIMD flag at runtime:
 *
 * @code{.cpp}
 * if (HAS_SSE41 || HAS_SSE42) {
 *      sse4_stub();
 * }
 * @endcode
 */

| |---sse3 -- sse3_stub()
|
arm--- ...
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naive指的是纯c++的实现吧?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

恩, 顺手写了,当然可以直接放在x86目录下,不需要naive目录

| | |- gru.cc
| |- ...
|- gpu -- ...
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里我有两点不确定:

  • gpu的实现以后是不是都会移到function里面去
  • 这个目录还需要保持includesrc两个目录吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. gpu这一块可以先放着不动,先增加一个cpu目录开展工作,并不会冲突。
  2. 觉得还是需要include目录来存放接口,清晰规范一些,方便调用。

Copy link
Contributor Author

@gangliao gangliao Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naive目录直接去掉,相应代码直接放x86目录很合理。

PaddlePaddle will crash and throw out `illegal instruction is used`. This problem will appear
frequently on cluster environment, like Kubernetes. **It must be addressed before PaddlePaddle on Cloud**

2. Once new version is ready to deliver, we have to release more products to users, for example, `no-avx-cpu`, `avx-cpu`, `no-avx-gpu`, `avx=gpu`. Users do not need to care about details. It sucks!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avx=gpu ==> avx-gpu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


```c++
if (HAS_SIMD(SIMD_AVX2 | SIMD_FMA4)) {
avx2_fm4_stub();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xxx_xx_stub不太合适。

stub似乎指测试时候模拟某种行为的函数,好像是用在单元测试里的术语。

https://zh.wikipedia.org/wiki/%E6%A1%A9_(%E8%AE%A1%E7%AE%97%E6%9C%BA)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有道理。

`avx2_fm4_stub` and `sse3_stub` could be located in different directory:

```text
------x84---naive
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive的实现应该是和x86, arm并列的一种实现。因为ARM和X86实现某一个操作的时候,应该会调用naive的代码进行尾数处理。

譬如,目录结构可以是

---simd
        `---naive
        `---X86
                  `--- AVX
        `---ARM

另外,谨慎怀疑我们应该直接使用一些封装好的,成熟的SIMD库做这个事情。直接把Paddle内部的SIMD优化删掉,依赖其他开源项目也许更方便。

例如 Vc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

补充回复下, Vc 不太能用。因为那个库里面没有做运行时的实现切换。

所以还是把目录结构改一下。

@@ -0,0 +1,79 @@
## Runtime Check SIMD for x86 architecture
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个design doc的标题可以修改一下,Runtime Check是其中一个工作;整个工作(或design)的目标应该是,只发布一个Paddle支持各种CPU环境。
我的理解这里至少包括三块工作,
a. Runtime Check
b. 代码修改和目录结构调整 (比如下面讲到的把naive/sse/avx的一些实现放到不同的目录里;另外,在issue #1116 里面也有相关的内容)
c. 最后是编译相关的工作;比如,如何编译一个即包含naive实现,也包含sse、avx实现的paddle(这里涉及编译以外,也涉及代码修改相关工作)。

虽然,后面的实现上也都写到了一些相关的技术细节,但建议design doc还是从整个工作包含哪些方面去讲,这样基于这个design doc可以创建一个个相关issue,比如cuda下面的那些代码怎么调整,把细节的讨论放到issue里面去(一些细节问题可以在issue里面讨论完之后再merge回这个design doc)。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### How to implement it?

Since the current `cuda` directory includes heterogeneous source code, we want to refactor `cuda` directory as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我理解这里目录结构是一个example?有几个问题:

  1. kernels是替换cuda目录?还是新增一个目录?
  2. 这里面好些文件名与当前paddle里面的文件对应不上,比如activation.cc是指的哪一个?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我的理解应该是替换,之后再issue里面讨论目录问题

Here, each directory uses the different compile options (`-mavx` or `-msse`) to generate the corresponding binaries. Then, at
runtime, `if(HAS_SIMD(__flags)` can select the supported branch (intrinsics) to execute.

The method could fix the releases and deployment problems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

并不是所有代码都需要写naive和avx两份,有的只需要写naive一份,编译的时候通过是否加-mavx生成两份,对应到不同的CPU环境下执行。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,我的理解也是这样。

Copy link
Collaborator

@reyoung reyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 如果其他同学再看的话。


3. [Pending] Modify CMake files.

Different simd intrinsics will be inside the different directories. we need to modified CMake files to support this solution. Each directory uses the different compile options (`-mavx` or `-msse`) to generate the corresponding binaries. Then, at runtime, using SIMD flags `HAS_AVX`, `HAS_SSE` automatically detect and select the supported branch (intrinsics) to execute.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分逻辑需要这么做再看一下,简单的做法#1634 (comment)


2. [Pending] Adjust `cuda` Directory.

Since the current `cuda` directory includes heterogeneous source code (cpu and gpu), we want to refactor `cuda` directory. For simplicity, different simd intrinsics will be inside the different directories. we need to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda目录里面的代码需要调整,但是different simd intrinsics will be inside the different directories会怎加一些sse/avx目录,这样感觉并不是很好,每个目录里面可能没有几个文件;另外,我觉得相同功能的代码放在一起比相同指令集的代码放在一起更重要。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有一些开源库的做法是,把一些simd intrinsic做一层封装[fftw],上层的功能都是基于这层封装开发的,毕竟大部分用intrinsic实现的功能都只是指令的不一样,而常用的指令也就是load、store、add、mul


### Conclusion

The method could fix the releases and deployment problems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

所以,releases and deployment 时的环境不一致,从而带来的运行时的一些困惑是这个design的目的。而Single release for PaddlePaddle CPU Image 是其中一种解决方法;#1634 (comment) 是另外一种解决方法。需要再比较一下这两种方法。
另外,我的建议是,是否有必要去做Single release for PaddlePaddle CPU Image ;如果后续引入一些AVX2/AVX512的代码,当前的设计是否能够支持(3. [Pending] Modify CMake files里面只提到-mavx/-msse)?

@gangliao gangliao changed the title Runtime check SIMD design docs Single release for PaddlePaddle CPU Image Mar 20, 2017
@luotao1
Copy link
Contributor

luotao1 commented Dec 8, 2017

@gangliao @Xreki @hedaoyuan @reyoung
How about this pull request: updated or closed? Since there is an extra branch in Paddle repo:
image

@livc livc removed their request for review December 8, 2017 06:54
@luotao1 luotao1 closed this Dec 19, 2017
@luotao1 luotao1 deleted the avx_docs branch December 19, 2017 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants