Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gelu x86] Finish intrinsic with elempack merged(fast version) #4144

Merged
merged 6 commits into from
Sep 18, 2022

Conversation

LRY89757
Copy link
Contributor

@LRY89757 LRY89757 commented Aug 15, 2022

  • 实现了Gelu的x86平台优化
  • 仅使用了fast gelu版本,也就是近似版本的erfc:
    image
  • 添加了tanh的mathfunc sse/avx/avx512实现
  • 添加了test_sample追求覆盖率.

@codecov-commenter
Copy link

codecov-commenter commented Aug 15, 2022

Codecov Report

Merging #4144 (6e9cf57) into master (9f59711) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4144      +/-   ##
==========================================
+ Coverage   94.41%   94.44%   +0.03%     
==========================================
  Files         749      750       +1     
  Lines      179061   179180     +119     
==========================================
+ Hits       169053   169222     +169     
+ Misses      10008     9958      -50     
Impacted Files Coverage Δ
src/layer/x86/avx512_mathfun.h 100.00% <100.00%> (ø)
src/layer/x86/avx_mathfun.h 100.00% <100.00%> (ø)
src/layer/x86/gelu_x86.cpp 100.00% <100.00%> (ø)
src/layer/x86/sse_mathfun.h 100.00% <100.00%> (ø)
src/layer/riscv/convolution1d_riscv.cpp 99.00% <0.00%> (+0.24%) ⬆️
src/layer/riscv/convolution_3x3_packn_fp16s.h 99.48% <0.00%> (+0.51%) ⬆️
src/layer/riscv/convolution_3x3_pack1ton_fp16s.h 100.00% <0.00%> (+10.85%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@LRY89757 LRY89757 changed the title [Gelu x86] Finish intrinsic with elempack merged(fast version) [WIP][Gelu x86] Finish intrinsic with elempack merged(fast version) Aug 16, 2022
@LRY89757
Copy link
Contributor Author

LRY89757 commented Aug 16, 2022

  1. 请教一个问题,关于mathfun.h这里的定义宏,为什么要这么定义一个这样的宏:
/* declare some AVX constants -- why can't I figure a better way to do that? */
#define _PS512_CONST(Name, Val) \
    static const ALIGN64_BEG float _ps512_##Name[16] ALIGN64_END = {Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val}
#define _PI32_CONST512(Name, Val) \
    static const ALIGN64_BEG int _pi32_512_##Name[16] ALIGN64_END = {Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val}
#define _PS512_CONST_TYPE(Name, Type, Val) \
    static const ALIGN64_BEG Type _ps512_##Name[16] ALIGN64_END = {Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val}

不应该直接使用_mm_set1_ps这类函数就可以了吗?是有着什么更深的用意吗?

  1. 关于tanh的实现这里暂时借用了exp, 非常naive的一种实现,单独专门的simd x86优化正在做 [WIP](已实现)

@LRY89757
Copy link
Contributor Author

LRY89757 commented Aug 16, 2022

实现了tanh的fast simd x86版本

@LRY89757 LRY89757 closed this Aug 16, 2022
@LRY89757 LRY89757 reopened this Aug 16, 2022
@LRY89757 LRY89757 changed the title [WIP][Gelu x86] Finish intrinsic with elempack merged(fast version) [Gelu x86] Finish intrinsic with elempack merged(fast version) Aug 16, 2022
src/layer/x86/gelu_x86.h Outdated Show resolved Hide resolved
src/layer/x86/gelu_x86.cpp Outdated Show resolved Hide resolved
src/layer/x86/gelu_x86.cpp Outdated Show resolved Hide resolved
src/layer/x86/gelu_x86.cpp Outdated Show resolved Hide resolved
@nihui
Copy link
Member

nihui commented Sep 17, 2022

  1. 请教一个问题,关于mathfun.h这里的定义宏,为什么要这么定义一个这样的宏:
/* declare some AVX constants -- why can't I figure a better way to do that? */
#define _PS512_CONST(Name, Val) \
    static const ALIGN64_BEG float _ps512_##Name[16] ALIGN64_END = {Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val}
#define _PI32_CONST512(Name, Val) \
    static const ALIGN64_BEG int _pi32_512_##Name[16] ALIGN64_END = {Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val}
#define _PS512_CONST_TYPE(Name, Type, Val) \
    static const ALIGN64_BEG Type _ps512_##Name[16] ALIGN64_END = {Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val}

不应该直接使用_mm_set1_ps这类函数就可以了吗?是有着什么更深的用意吗?

2. ~关于`tanh`的实现这里暂时借用了`exp`, 非常naive的一种实现,单独专门的`simd x86`优化正在做 **[WIP]**~(已实现)

这可能得请教原作者,目前的写法很可能是与编译器斗智斗勇的结果

猜测

  • __m256 这样的寄存器无法写作全局静态的变量,结局就是编译器把这些数值放在全局静态区中,运行时载入
  • 编译器可能会不对齐的存放数据,于是作者干脆直接写成数组,并强制要求对齐,提升运行时载入效率

@LRY89757
Copy link
Contributor Author

  1. 请教一个问题,关于mathfun.h这里的定义宏,为什么要这么定义一个这样的宏:
/* declare some AVX constants -- why can't I figure a better way to do that? */
#define _PS512_CONST(Name, Val) \
    static const ALIGN64_BEG float _ps512_##Name[16] ALIGN64_END = {Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val}
#define _PI32_CONST512(Name, Val) \
    static const ALIGN64_BEG int _pi32_512_##Name[16] ALIGN64_END = {Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val}
#define _PS512_CONST_TYPE(Name, Type, Val) \
    static const ALIGN64_BEG Type _ps512_##Name[16] ALIGN64_END = {Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val}

不应该直接使用_mm_set1_ps这类函数就可以了吗?是有着什么更深的用意吗?

2. ~关于`tanh`的实现这里暂时借用了`exp`, 非常naive的一种实现,单独专门的`simd x86`优化正在做 **[WIP]**~(已实现)

这可能得请教原作者,目前的写法很可能是与编译器斗智斗勇的结果

猜测

  • __m256 这样的寄存器无法写作全局静态的变量,结局就是编译器把这些数值放在全局静态区中,运行时载入
  • 编译器可能会不对齐的存放数据,于是作者干脆直接写成数组,并强制要求对齐,提升运行时载入效率

Got it, thanks for guidance. I will improve the codes as soon as possible.

@LRY89757
Copy link
Contributor Author

LRY89757 commented Sep 17, 2022

已经按照对应的指导改正代码格式,同时也修正添加了create_pipeline函数用来回退版本

有一个进一步问题,既然我此前x86 simd代码全部使用的是fast_gelu用来计算,同时test函数并没有报错,是否这意味着我们没有必要使用erfc来进行推理而直接全部使用fast_gelu版本来进行计算即可,因为两者计算出来的结果几乎没有误差?
如果有必要的话我会把simd版本的erfc加上去

@nihui
Copy link
Member

nihui commented Sep 17, 2022

已经按照对应的指导改正代码格式,同时也修正添加了create_pipeline函数用来回退版本

有一个进一步问题,既然我此前x86 simd代码全部使用的是fast_gelu用来计算,同时test函数并没有报错,是否这意味着我们没有必要使用erfc来进行推理而直接全部使用fast_gelu版本来进行计算即可,因为两者计算出来的结果几乎没有误差? 如果有必要的话我会把simd版本的erfc加上去

没有必要,erfc 在 naive 实现里面就足够了,为的就是参考一下
目前pnnx那边转出ncnn的gelu一律会设置 fast_gelu=1

@LRY89757 LRY89757 closed this Sep 17, 2022
@LRY89757 LRY89757 reopened this Sep 17, 2022
@nihui nihui merged commit 5eb56b2 into Tencent:master Sep 18, 2022
@nihui
Copy link
Member

nihui commented Sep 18, 2022

Thanks for your contribution !

csukuangfj added a commit to csukuangfj/ncnn that referenced this pull request Dec 1, 2022
* remove duplicated newline (Tencent#4187)

* remove duplicated newline (Tencent#4188)

* optmize softmax arm neon (Tencent#4171)

* [docs] Fix typo (Tencent#4201)

* [Prelu x86] Finish intrinsic with elempack merged (Tencent#4177)

* changed size of images for pretty formatting of page (Tencent#4193)

* [Gelu x86] Finish intrinsic with elempack merged(fast version) (Tencent#4144)

* Finish the gelu x86 intrinsics
* Finish the fast tanh x86 simd impl

* Ignore .xmake directory (Tencent#4212)

* Bump pypa/cibuildwheel from 2.9.0 to 2.10.1 (Tencent#4207)

Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.9.0 to 2.10.1.
- [Release notes](https://github.com/pypa/cibuildwheel/releases)
- [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md)
- [Commits](pypa/cibuildwheel@v2.9.0...v2.10.1)

---
updated-dependencies:
- dependency-name: pypa/cibuildwheel
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* style: space alignment (Tencent#4217)

* Ignore CMakeSettings.json, the Visual Studio CMake schema file (Tencent#4228)

* RVV: use new interface for segment load/store & change word_type to size_t&add clang ci (part Tencent#4100) (Tencent#4118)

* RVV: use size_t for vl

* RVV: replace vsseg.v tuple type by using regex

-----

search:
vsseg([1-9])e(8|16|32)_v_(f|i|u)\2m(1|2|4|8)x\1\(([ -~]+), vcreate_\3\2m\4x\1\(([ -~]+)\), vl\);

substitute by:
vsseg$1e$2_v_$3$2m$4($5, $6, vl);

* RVV: replace vssseg.v tuple types by using regex

---

search:
vssseg([1-9])e(8|16|32)_v_f\2m1x\1\(([ -~]+), vcreate_f\2m1x\1\(([ -~]+)\), vl\);

substitute by:
vssseg$1e$2_v_f$2m1($3, $4, vl);

* RVV: replace vlseg.v tuple types in load/store

* RVV: replace vloxseg2ei32.v tuple types

* RVV: add a wrapper for old compilers

* RVV: add segment load/store wrapper in pakcing

* RVV: fix cmake test

* RVV: make clang happy by dropping VLAs in sgemm

* RVV: add clang cmake toolchain configure

* RVV: add clang ci, riscv64-unknown-linux-gnu

Co-authored-by: thelastlin <[email protected]>
Co-authored-by: nihui <[email protected]>

* Bump pypa/cibuildwheel from 2.10.1 to 2.10.2 (Tencent#4220)

Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.10.1 to 2.10.2.
- [Release notes](https://github.com/pypa/cibuildwheel/releases)
- [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md)
- [Commits](pypa/cibuildwheel@v2.10.1...v2.10.2)

---
updated-dependencies:
- dependency-name: pypa/cibuildwheel
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* add c906 build ci (Tencent#4232)

* Add benchmark result of T-Head TH1520 (Tencent#4240)

`cpuinfo`: 

```
isa             : rv64imafdcvsu
mmu             : sv39
cpu-freq                : 1.848Ghz
cpu-icache              : 64KB
cpu-dcache              : 64KB
cpu-l2cache             : 1MB
cpu-tlb         : 1024 4-ways
cpu-cacheline           : 64Bytes
cpu-vector              : 0.7.1
```

Compiled with `-DCMAKE_TOOLCHAIN_FILE=../toolchains/c910-v240.toolchain.cmake -DCMAKE_BUILD_TYPE=release -DNCNN_OPENMP=OFF -DNCNN_THREADS=OFF -DNCNN_RUNTIME_CPU=OFF -DNCNN_RVV=ON -DNCNN_SIMPLEOCV=ON -DNCNN_BUILD_EXAMPLES=ON` 

Seems much worse than expected 🤔

* fix param parsing issue when layer/blob name exceeds 255 (Tencent#4236)

* fix param parsing issue when layer/blob name exceeds 255

* apply code-format changes

Co-authored-by: ZhangGe6 <[email protected]>

* Memory Pool Improvement For Variadic Sized Inputs (Tencent#4190)

* Simple miss count for better space efficiency

* Simple double ended greedy;

* Add size drop threshold setter;

* set workspace allocator cr to zero as we had some sort of recylcing capability :P

Co-authored-by: LinHeLurking <[email protected]>
Co-authored-by: nihuini <[email protected]>

* docs: disable fp16 when wrong results encountered caused by overflow (Tencent#4248)

* pnnx math operation (Tencent#4251)

* more stricter armv7 fp16 and armv84 bf16 compiler check, fix Tencent#4147 fix Tencent#4222 (Tencent#4247)

* modified the param axes of expanddims in modelwriter (Tencent#4259)

* Add TH1520 (4*C910V) toolchain support.  (Tencent#4267)

* implement lstm proj_size (Tencent#4263)

* Optimize x86 DeformableConv2D (Tencent#4128)

* fix compile warning with gcc 9.1.0 including simplestl.h file (Tencent#4274)

* fix compile warning with gcc 9.1.0 including simplestl.h file

* apply code-format changes

Co-authored-by: veahow <[email protected]>

* add benchmark for rk3588 on rock5b (Tencent#4275)

* linux-x64-cpu-gcc on tencent ci

* implement layer feature disabled bit (Tencent#4278)

* add elu vulkan operator (Tencent#4280)

* fix tencent ci (Tencent#4277)

* implement GLU and pnnx conversion (Tencent#4283)

* Bump pypa/cibuildwheel from 2.10.2 to 2.11.1 (Tencent#4271)

Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.10.2 to 2.11.1.
- [Release notes](https://github.com/pypa/cibuildwheel/releases)
- [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md)
- [Commits](pypa/cibuildwheel@v2.10.2...v2.11.1)

---
updated-dependencies:
- dependency-name: pypa/cibuildwheel
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix pnnx softmax/normalize/slice negative axis conversion to ncnn (Tencent#4284)

* pnnx glu batchindex aware conversion (Tencent#4285)

* 1. Fix typo in readme (Tencent#4287)

* x86 sse2/avx2 optimization for convolution sgemm/winograd int8 family (Tencent#4286)

* pnnx skip dynamic size evaluation (Tencent#4291)

* Fix linux build error(Tencent#4265) (Tencent#4294)

Co-authored-by: wangyu <[email protected]>

* general cpu feature detection on macos/ios, enable bf16 and i8mm on a15 a16 and m2 (Tencent#4300)

* x86 unified fc fp32/fp16s (Tencent#4303)

* more fma
* more transpose utility function

* Bump pypa/cibuildwheel from 2.11.1 to 2.11.2 (Tencent#4308)

Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.11.1 to 2.11.2.
- [Release notes](https://github.com/pypa/cibuildwheel/releases)
- [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md)
- [Commits](pypa/cibuildwheel@v2.11.1...v2.11.2)

---
updated-dependencies:
- dependency-name: pypa/cibuildwheel
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* pnnx pytorch 1.13 (Tencent#4314)

* fix Tencent#4315 (Tencent#4316)

* get_physical_cpu_count api family (Tencent#4302)

* get_physical_cpu_count api family

* set default to physical big cpu

* always treat smt core as big core

* is_smt_cpu

* get max freq mhz on windows

* windows thread affinity

* groupnorm 1d/2d/4d (Tencent#4312)

* fix slice end index, fix fp16 model weight alignment (Tencent#4317)

* tencent ci test-coverage pnnx (Tencent#4305)

* RVV: BatchNorm with fp16s(a) support (Tencent#4075)

* RVV: InstanceNorm with fp16s(a) support (Tencent#4078)

* fix ci pnnx build

* fold new_full and full_like (Tencent#4323)

* pnnx convert nn.Softmax2d (Tencent#4324)

* pnnx convert fold unfold (Tencent#4325)

* support yolov5 6.2 (Tencent#4328)

* implement ncnn fold and unfold (Tencent#4326)

* pnnx load gpu torchscript and reset device (Tencent#4330)

* fix:pnnx-softmax (Tencent#4333)

* pnnx save onnx zero (Tencent#4077)

* save foldable constants in file for reducing memory usage (Tencent#4337)

* match inplace slice copy pattern, rewrite copy uses (Tencent#4338)

* add vector optimization for loongarch64 (Tencent#4242)

* ci loongarch64 lsx (Tencent#4344)

* gridsample op support (Tencent#4288)



Co-authored-by: LRY89757 <[email protected]>
Co-authored-by: nihuini <[email protected]>
Co-authored-by: nihui <[email protected]>

* squeeze and expanddims 4d (Tencent#4346)

* implement MultiheadAttention kdim vdim (Tencent#4347)

* pnnx convert torch bitwise left_shift right_shift (Tencent#4349)

* pnnx fp16 option for ncnn and onnx weight type (Tencent#4350)

* pnnx fuse more function to module (Tencent#4351)

* pnnx fuse more function to module

* rename some pass name

* fuse adjacent reshape, fuse pad conv2d

* fuse pad conv1d

* split tests (Tencent#4354)

* Support mat.numpy() in Python (Tencent#4356)

* Fix typo in stb_image.h (Tencent#4358)

exitting -> exiting

* Fix windows-arm64 build for non-neon case (Tencent#4227)

* update release ci (Tencent#4359)

* update release ci

* find modern glslang

* parallel jobs on windows

* Fix c api allocator (Tencent#4360)

* add some c_api interfaces related to allocator setup.

* fix errors in allocator parameters in c_api.

* test c api allocator

Co-authored-by: zhangtongshe <[email protected]>

* update glslang (Tencent#4361)

* disable out-of-line atomics since ndk23+ for resolving linking issue with old ndk (Tencent#4362)

* I added one more project to the list of examples. (Tencent#4205)

* Dedicated to coloring black and white photographs.

* add example project link (Tencent#4365)

* fix(pybind11): build error (Tencent#4368)

* fix openmp affinity abort when cpu goes offline (Tencent#4370)

* Update release-python.yml

* small fixes

* unpack list input

* Remove LSTM2

* fix LSTM

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Molly Sophia <[email protected]>
Co-authored-by: Menci <[email protected]>
Co-authored-by: luqiang guo <[email protected]>
Co-authored-by: Lry89757 <[email protected]>
Co-authored-by: magicse <[email protected]>
Co-authored-by: Zhuo Zhang <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: 汤圆奶昔 <[email protected]>
Co-authored-by: Xavier Hsinyuan <[email protected]>
Co-authored-by: thelastlin <[email protected]>
Co-authored-by: nihui <[email protected]>
Co-authored-by: 柚木鉉 <[email protected]>
Co-authored-by: Zhang Ge <[email protected]>
Co-authored-by: ZhangGe6 <[email protected]>
Co-authored-by: LinHe <[email protected]>
Co-authored-by: LinHeLurking <[email protected]>
Co-authored-by: nihuini <[email protected]>
Co-authored-by: MisakaBit <[email protected]>
Co-authored-by: LiuYi-Up <[email protected]>
Co-authored-by: 陸 言 <[email protected]>
Co-authored-by: miemie2013 <[email protected]>
Co-authored-by: Eahow Chen <[email protected]>
Co-authored-by: veahow <[email protected]>
Co-authored-by: li mengyang <[email protected]>
Co-authored-by: Yoh <[email protected]>
Co-authored-by: Caize Wu <[email protected]>
Co-authored-by: bestpower <[email protected]>
Co-authored-by: wangyu <[email protected]>
Co-authored-by: shaoshengsong <[email protected]>
Co-authored-by: WuJinxuan <[email protected]>
Co-authored-by: junchao-loongson <[email protected]>
Co-authored-by: LRY89757 <[email protected]>
Co-authored-by: Ikko Ashimine <[email protected]>
Co-authored-by: zhangtongshe <[email protected]>
Co-authored-by: tpoisonooo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants