Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize and tidy up affine transform code. #3663

Closed
wants to merge 2 commits into from

Conversation

Sopel97
Copy link
Member

@Sopel97 Sopel97 commented Aug 17, 2021

The new network caused some issues initially due to the very narrow neuron set between the first two FC layers. Necessary changes were hacked together to make it work. This patch is a mature approach to make the affine transform code faster, more readable, and easier to maintain should the layer sizes change again. The following changes were made:

  • ClippedReLU always produces a multiple of 32 outputs. This is about as good of a solution for AffineTransform's SIMD requirements as it can get without a bigger rewrite.
  • All self-contained simd helpers are moved to a separate file (simd.h). Inline asm is utilized to work around GCC's issues with code generation and register assignment. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101693, https://godbolt.org/z/da76fY1n7
  • AffineTransform has 2 specializations. While it's more lines of code due to the boilerplate, the logic in both is significantly reduced, as these two are impossible to nicely combine into one.
    • The first specialization is for cases when there's >=128 inputs. It uses a different approach to perform the affine transform and can make full use of AVX512 without any edge cases. Furthermore, it has higher theoretical throughput because less loads are needed in the hot path, requiring only a fixed amount of instructions for horizontal additions at the end, which are amortized by the large number of inputs.
    • The second specialization is made to handle smaller layers where performance is still necessary but edge cases need to be handled. AVX512 implementation for this was ommited by mistake, a remnant from the temporary implementation for the new... This could be easily reintroduced if needed. A slightly more detailed description of both implementations is in the code.

Overall it should be a minor speedup (or at least neutral) with AVX2/BMI2 targets. The biggest gains are expected with SSSE3, AVX512, VNNI256, VNNI512 targets. The VNNI targets have particularily been impacted, due to a bug in GCC linked above.

Example hot path comparison with x86-64-vnni256 target (GCC 10.1 in MSYS2):
master:

XDIS 14003bd80: DATAXFER  AVX        C5FD6F09                 vmovdqa ymm1, ymmword ptr [rcx]
XDIS 14003bd84: DATAXFER  AVX        C5FD6F4120               vmovdqa ymm0, ymmword ptr [rcx+0x20]
XDIS 14003bd89: BINARY    BASE       4883C140                 add rcx, 0x40
XDIS 14003bd8d: BINARY    BASE       4883C040                 add rax, 0x40
XDIS 14003bd91: AVX512    AVX512EVEX 62F2752850583E           vpdpbusd ymm3, ymm1, ymmword ptr [rax+0x7c0]
XDIS 14003bd98: AVX512    AVX512EVEX 62E275285080C01F0000     vpdpbusd ymm16, ymm1, ymmword ptr [rax+0x1fc0]
XDIS 14003bda2: AVX512    AVX512EVEX 62E275285088C0270000     vpdpbusd ymm17, ymm1, ymmword ptr [rax+0x27c0]
XDIS 14003bdac: AVX512    AVX512EVEX 62E275285090C02F0000     vpdpbusd ymm18, ymm1, ymmword ptr [rax+0x2fc0]
XDIS 14003bdb6: AVX512    AVX512EVEX 62E275285098C0370000     vpdpbusd ymm19, ymm1, ymmword ptr [rax+0x37c0]
XDIS 14003bdc0: AVX512    AVX512EVEX 62F275285050FE           vpdpbusd ymm2, ymm1, ymmword ptr [rax-0x40]
XDIS 14003bdc7: AVX512    AVX512EVEX 62F2752850607E           vpdpbusd ymm4, ymm1, ymmword ptr [rax+0xfc0]
XDIS 14003bdce: DATAXFER  AVX        C5FD6FFB                 vmovdqa ymm7, ymm3
XDIS 14003bdd2: AVX512    AVX512EVEX 62F2752850A8C0170000     vpdpbusd ymm5, ymm1, ymmword ptr [rax+0x17c0]
XDIS 14003bddc: DATAXFER  AVX512EVEX 62317D286FD0             vmovdqa32 ymm10, ymm16
XDIS 14003bde2: AVX512    AVX512EVEX 62F27D2850783F           vpdpbusd ymm7, ymm0, ymmword ptr [rax+0x7e0]
XDIS 14003bde9: DATAXFER  AVX512EVEX 62317D286FD9             vmovdqa32 ymm11, ymm17
XDIS 14003bdef: AVX512    AVX512EVEX 62727D285090E01F0000     vpdpbusd ymm10, ymm0, ymmword ptr [rax+0x1fe0]
XDIS 14003bdf9: DATAXFER  AVX512EVEX 62317D286FE2             vmovdqa32 ymm12, ymm18
XDIS 14003bdff: AVX512    AVX512EVEX 62727D285098E0270000     vpdpbusd ymm11, ymm0, ymmword ptr [rax+0x27e0]
XDIS 14003be09: DATAXFER  AVX512EVEX 62B17D286FCB             vmovdqa32 ymm1, ymm19
XDIS 14003be0f: AVX512    AVX512EVEX 62F27D285050FF           vpdpbusd ymm2, ymm0, ymmword ptr [rax-0x20]
XDIS 14003be16: AVX512    AVX512EVEX 62F27D2850607F           vpdpbusd ymm4, ymm0, ymmword ptr [rax+0xfe0]
XDIS 14003be1d: AVX512    AVX512EVEX 62F27D2850A8E0170000     vpdpbusd ymm5, ymm0, ymmword ptr [rax+0x17e0]
XDIS 14003be27: AVX512    AVX512EVEX 62727D2850A0E02F0000     vpdpbusd ymm12, ymm0, ymmword ptr [rax+0x2fe0]
XDIS 14003be31: DATAXFER  AVX        C5FD6FDF                 vmovdqa ymm3, ymm7
XDIS 14003be35: AVX512    AVX512EVEX 62F27D285088E0370000     vpdpbusd ymm1, ymm0, ymmword ptr [rax+0x37e0]
XDIS 14003be3f: DATAXFER  AVX512EVEX 62C1FD286FC2             vmovdqa64 ymm16, ymm10
XDIS 14003be45: DATAXFER  AVX512EVEX 62C1FD286FCB             vmovdqa64 ymm17, ymm11
XDIS 14003be4b: DATAXFER  AVX        C5FD6FF2                 vmovdqa ymm6, ymm2
XDIS 14003be4f: DATAXFER  AVX        C57D6FC4                 vmovdqa ymm8, ymm4
XDIS 14003be53: DATAXFER  AVX        C57D6FCD                 vmovdqa ymm9, ymm5
XDIS 14003be57: DATAXFER  AVX512EVEX 62C1FD286FD4             vmovdqa64 ymm18, ymm12
XDIS 14003be5d: DATAXFER  AVX512EVEX 62E1FD286FD9             vmovdqa64 ymm19, ymm1
XDIS 14003be63: BINARY    BASE       4839D1                   cmp rcx, rdx
XDIS 14003be66: COND_BR   BASE       0F8514FFFFFF             jnz 0x14003bd80

patch:

XDIS 14003bc88: DATAXFER  AVX        C5FD6F01                 vmovdqa ymm0, ymmword ptr [rcx]
XDIS 14003bc8c: DATAXFER  AVX        C5FD6F4920               vmovdqa ymm1, ymmword ptr [rcx+0x20]
XDIS 14003bc91: BINARY    BASE       4883C140                 add rcx, 0x40
XDIS 14003bc95: AVX512    AVX512EVEX 62F27D285018             vpdpbusd ymm3, ymm0, ymmword ptr [rax]
XDIS 14003bc9b: AVX512    AVX512EVEX 62F27528505808           vpdpbusd ymm3, ymm1, ymmword ptr [rax+0x100]
XDIS 14003bca2: AVX512    AVX512EVEX 62E27D28505801           vpdpbusd ymm19, ymm0, ymmword ptr [rax+0x20]
XDIS 14003bca9: AVX512    AVX512EVEX 62E27528505809           vpdpbusd ymm19, ymm1, ymmword ptr [rax+0x120]
XDIS 14003bcb0: AVX512    AVX512EVEX 62F27D28506802           vpdpbusd ymm5, ymm0, ymmword ptr [rax+0x40]
XDIS 14003bcb7: AVX512    AVX512EVEX 62F2752850680A           vpdpbusd ymm5, ymm1, ymmword ptr [rax+0x140]
XDIS 14003bcbe: AVX512    AVX512EVEX 62E27D28505003           vpdpbusd ymm18, ymm0, ymmword ptr [rax+0x60]
XDIS 14003bcc5: AVX512    AVX512EVEX 62E2752850500B           vpdpbusd ymm18, ymm1, ymmword ptr [rax+0x160]
XDIS 14003bccc: AVX512    AVX512EVEX 62F27D28505004           vpdpbusd ymm2, ymm0, ymmword ptr [rax+0x80]
XDIS 14003bcd3: AVX512    AVX512EVEX 62F2752850500C           vpdpbusd ymm2, ymm1, ymmword ptr [rax+0x180]
XDIS 14003bcda: AVX512    AVX512EVEX 62E27D28504805           vpdpbusd ymm17, ymm0, ymmword ptr [rax+0xa0]
XDIS 14003bce1: AVX512    AVX512EVEX 62E2752850480D           vpdpbusd ymm17, ymm1, ymmword ptr [rax+0x1a0]
XDIS 14003bce8: AVX512    AVX512EVEX 62F27D28506006           vpdpbusd ymm4, ymm0, ymmword ptr [rax+0xc0]
XDIS 14003bcef: AVX512    AVX512EVEX 62F2752850600E           vpdpbusd ymm4, ymm1, ymmword ptr [rax+0x1c0]
XDIS 14003bcf6: AVX512    AVX512EVEX 62E27D28504007           vpdpbusd ymm16, ymm0, ymmword ptr [rax+0xe0]
XDIS 14003bcfd: AVX512    AVX512EVEX 62E2752850400F           vpdpbusd ymm16, ymm1, ymmword ptr [rax+0x1e0]
XDIS 14003bd04: BINARY    BASE       480500020000             add rax, 0x200
XDIS 14003bd0a: BINARY    BASE       4839CA                   cmp rdx, rcx
XDIS 14003bd0d: COND_BR   BASE       0F8575FFFFFF             jnz 0x14003bc88

Comparison for x86-64-modern:
(here the difference is harder to spot because in the patch 2x more work is done in each loop iteration. Notably it doesn't use the shuffles, which are slow, with everything else being about the same.)
master:

XDIS 14003b4e0: DATAXFER  SSE2       660F6E3A                 movd xmm7, dword ptr [rdx]
XDIS 14003b4e4: DATAXFER  SSE2       660F6E7204               movd xmm6, dword ptr [rdx+0x4]
XDIS 14003b4e9: BINARY    BASE       4883C210                 add rdx, 0x10
XDIS 14003b4ed: SSE       SSE2       660F70CF00               pshufd xmm1, xmm7, 0x0
XDIS 14003b4f2: SSE       SSE2       660F70FE00               pshufd xmm7, xmm6, 0x0
XDIS 14003b4f7: DATAXFER  SSE2       660F6E72F8               movd xmm6, dword ptr [rdx-0x8]
XDIS 14003b4fc: DATAXFER  SSE2       660F6FD9                 movdqa xmm3, xmm1
XDIS 14003b500: DATAXFER  SSE2       66440F6FD7               movdqa xmm10, xmm7
XDIS 14003b505: SSE       SSSE3      660F380418               pmaddubsw xmm3, xmmword ptr [rax]
XDIS 14003b50a: SSE       SSE2       660F70D600               pshufd xmm2, xmm6, 0x0
XDIS 14003b50f: DATAXFER  SSE2       660F6E72FC               movd xmm6, dword ptr [rdx-0x4]
XDIS 14003b514: SSE       SSSE3      66440F38045020           pmaddubsw xmm10, xmmword ptr [rax+0x20]
XDIS 14003b51b: SSE       SSSE3      660F38044810             pmaddubsw xmm1, xmmword ptr [rax+0x10]
XDIS 14003b521: SSE       SSSE3      660F38047830             pmaddubsw xmm7, xmmword ptr [rax+0x30]
XDIS 14003b527: DATAXFER  SSE2       66440F6FC2               movdqa xmm8, xmm2
XDIS 14003b52c: SSE       SSE2       660F70F600               pshufd xmm6, xmm6, 0x0
XDIS 14003b531: SSE       SSSE3      660F38045050             pmaddubsw xmm2, xmmword ptr [rax+0x50]
XDIS 14003b537: SSE       SSE2       66410FEDDA               paddsw xmm3, xmm10
XDIS 14003b53c: SSE       SSSE3      66440F38044040           pmaddubsw xmm8, xmmword ptr [rax+0x40]
XDIS 14003b543: DATAXFER  SSE2       66440F6FCE               movdqa xmm9, xmm6
XDIS 14003b548: SSE       SSSE3      660F38047070             pmaddubsw xmm6, xmmword ptr [rax+0x70]
XDIS 14003b54e: SSE       SSSE3      66440F38044860           pmaddubsw xmm9, xmmword ptr [rax+0x60]
XDIS 14003b555: SSE       SSE2       660FEDCF                 paddsw xmm1, xmm7
XDIS 14003b559: BINARY    BASE       4883E880                 sub rax, 0xffffffffffffff80
XDIS 14003b55d: SSE       SSE2       660FEDD6                 paddsw xmm2, xmm6
XDIS 14003b561: SSE       SSE2       660FF5D8                 pmaddwd xmm3, xmm0
XDIS 14003b565: SSE       SSE2       66450FEDC1               paddsw xmm8, xmm9
XDIS 14003b56a: SSE       SSE2       660FF5C8                 pmaddwd xmm1, xmm0
XDIS 14003b56e: SSE       SSE2       66440FF5C0               pmaddwd xmm8, xmm0
XDIS 14003b573: SSE       SSE2       660FF5D0                 pmaddwd xmm2, xmm0
XDIS 14003b577: SSE       SSE2       66410FFED8               paddd xmm3, xmm8
XDIS 14003b57c: SSE       SSE2       660FFECA                 paddd xmm1, xmm2
XDIS 14003b580: SSE       SSE2       660FFEDD                 paddd xmm3, xmm5
XDIS 14003b584: SSE       SSE2       660FFECC                 paddd xmm1, xmm4
XDIS 14003b588: DATAXFER  SSE2       660F6FEB                 movdqa xmm5, xmm3
XDIS 14003b58c: DATAXFER  SSE2       660F6FE1                 movdqa xmm4, xmm1
XDIS 14003b590: BINARY    BASE       4839D1                   cmp rcx, rdx
XDIS 14003b593: COND_BR   BASE       0F8547FFFFFF             jnz 0x14003b4e0

patch:

XDIS 14003b220: DATAXFER  SSE2       660F6F01                 movdqa xmm0, xmmword ptr [rcx]
XDIS 14003b224: DATAXFER  SSE2       660F6F4910               movdqa xmm1, xmmword ptr [rcx+0x10]
XDIS 14003b229: BINARY    BASE       4883C120                 add rcx, 0x20
XDIS 14003b22d: DATAXFER  SSE2       66440F6FD8               movdqa xmm11, xmm0
XDIS 14003b232: DATAXFER  SSE2       66440F6FE1               movdqa xmm12, xmm1
XDIS 14003b237: SSE       SSSE3      66440F380418             pmaddubsw xmm11, xmmword ptr [rax]
XDIS 14003b23d: SSE       SSSE3      66440F3804A080000000     pmaddubsw xmm12, xmmword ptr [rax+0x80]
XDIS 14003b247: SSE       SSE2       66450FEDDC               paddsw xmm11, xmm12
XDIS 14003b24c: SSE       SSE2       66440FF5DA               pmaddwd xmm11, xmm2
XDIS 14003b251: SSE       SSE2       66410FFEFB               paddd xmm7, xmm11
XDIS 14003b256: DATAXFER  SSE2       66440F6FD8               movdqa xmm11, xmm0
XDIS 14003b25b: DATAXFER  SSE2       66440F6FE1               movdqa xmm12, xmm1
XDIS 14003b260: SSE       SSSE3      66440F38045810           pmaddubsw xmm11, xmmword ptr [rax+0x10]
XDIS 14003b267: SSE       SSSE3      66440F3804A090000000     pmaddubsw xmm12, xmmword ptr [rax+0x90]
XDIS 14003b271: SSE       SSE2       66450FEDDC               paddsw xmm11, xmm12
XDIS 14003b276: SSE       SSE2       66440FF5DA               pmaddwd xmm11, xmm2
XDIS 14003b27b: SSE       SSE2       66450FFED3               paddd xmm10, xmm11
XDIS 14003b280: DATAXFER  SSE2       66440F6FD8               movdqa xmm11, xmm0
XDIS 14003b285: DATAXFER  SSE2       66440F6FE1               movdqa xmm12, xmm1
XDIS 14003b28a: SSE       SSSE3      66440F38045820           pmaddubsw xmm11, xmmword ptr [rax+0x20]
XDIS 14003b291: SSE       SSSE3      66440F3804A0A0000000     pmaddubsw xmm12, xmmword ptr [rax+0xa0]
XDIS 14003b29b: SSE       SSE2       66450FEDDC               paddsw xmm11, xmm12
XDIS 14003b2a0: SSE       SSE2       66440FF5DA               pmaddwd xmm11, xmm2
XDIS 14003b2a5: SSE       SSE2       66450FFEC3               paddd xmm8, xmm11
XDIS 14003b2aa: DATAXFER  SSE2       66440F6FD8               movdqa xmm11, xmm0
XDIS 14003b2af: DATAXFER  SSE2       66440F6FE1               movdqa xmm12, xmm1
XDIS 14003b2b4: SSE       SSSE3      66440F38045830           pmaddubsw xmm11, xmmword ptr [rax+0x30]
XDIS 14003b2bb: SSE       SSSE3      66440F3804A0B0000000     pmaddubsw xmm12, xmmword ptr [rax+0xb0]
XDIS 14003b2c5: SSE       SSE2       66450FEDDC               paddsw xmm11, xmm12
XDIS 14003b2ca: SSE       SSE2       66440FF5DA               pmaddwd xmm11, xmm2
XDIS 14003b2cf: SSE       SSE2       66450FFECB               paddd xmm9, xmm11
XDIS 14003b2d4: DATAXFER  SSE2       66440F6FD8               movdqa xmm11, xmm0
XDIS 14003b2d9: DATAXFER  SSE2       66440F6FE1               movdqa xmm12, xmm1
XDIS 14003b2de: SSE       SSSE3      66440F38045840           pmaddubsw xmm11, xmmword ptr [rax+0x40]
XDIS 14003b2e5: SSE       SSSE3      66440F3804A0C0000000     pmaddubsw xmm12, xmmword ptr [rax+0xc0]
XDIS 14003b2ef: SSE       SSE2       66450FEDDC               paddsw xmm11, xmm12
XDIS 14003b2f4: SSE       SSE2       66440FF5DA               pmaddwd xmm11, xmm2
XDIS 14003b2f9: SSE       SSE2       66410FFEF3               paddd xmm6, xmm11
XDIS 14003b2fe: DATAXFER  SSE2       66440F6FD8               movdqa xmm11, xmm0
XDIS 14003b303: DATAXFER  SSE2       66440F6FE1               movdqa xmm12, xmm1
XDIS 14003b308: SSE       SSSE3      66440F38045850           pmaddubsw xmm11, xmmword ptr [rax+0x50]
XDIS 14003b30f: SSE       SSSE3      66440F3804A0D0000000     pmaddubsw xmm12, xmmword ptr [rax+0xd0]
XDIS 14003b319: SSE       SSE2       66450FEDDC               paddsw xmm11, xmm12
XDIS 14003b31e: SSE       SSE2       66440FF5DA               pmaddwd xmm11, xmm2
XDIS 14003b323: SSE       SSE2       66410FFEEB               paddd xmm5, xmm11
XDIS 14003b328: DATAXFER  SSE2       66440F6FD8               movdqa xmm11, xmm0
XDIS 14003b32d: DATAXFER  SSE2       66440F6FE1               movdqa xmm12, xmm1
XDIS 14003b332: SSE       SSSE3      66440F38045860           pmaddubsw xmm11, xmmword ptr [rax+0x60]
XDIS 14003b339: SSE       SSSE3      66440F3804A0E0000000     pmaddubsw xmm12, xmmword ptr [rax+0xe0]
XDIS 14003b343: SSE       SSSE3      660F38044070             pmaddubsw xmm0, xmmword ptr [rax+0x70]
XDIS 14003b349: SSE       SSE2       66450FEDDC               paddsw xmm11, xmm12
XDIS 14003b34e: SSE       SSE2       66440FF5DA               pmaddwd xmm11, xmm2
XDIS 14003b353: SSE       SSE2       66410FFEE3               paddd xmm4, xmm11
XDIS 14003b358: SSE       SSSE3      660F380488F0000000       pmaddubsw xmm1, xmmword ptr [rax+0xf0]
XDIS 14003b361: BINARY    BASE       480500010000             add rax, 0x100
XDIS 14003b367: SSE       SSE2       660FEDC1                 paddsw xmm0, xmm1
XDIS 14003b36b: SSE       SSE2       660FF5C2                 pmaddwd xmm0, xmm2
XDIS 14003b36f: SSE       SSE2       660FFED8                 paddd xmm3, xmm0
XDIS 14003b373: BINARY    BASE       4839D1                   cmp rcx, rdx
XDIS 14003b376: COND_BR   BASE       0F85A4FEFFFF             jnz 0x14003b220

Bench of x86-64-modern target on my i7-920:

run       base       test     diff
  1     647015     662153   +15138
  2     623499     643276   +19777
  3     641628     652663   +11035
  4     640316     652238   +11922
  5     639743     671114   +31371
  6     646014     659538   +13524
  7     616516     647350   +30834
  8     640398     663469   +23071
  9     640562     670754   +30192
 10     649616     659712   +10096

Result of  10 runs
==================
base (...tockfish.exe) =     638531  +/- 6469
test (...una/tuna.exe) =     658227  +/- 5756
diff                   =     +19696  +/- 5334

speedup        = +0.0308
P(speedup > 0) =  1.0000

VNNI and AVX512 benchmarks are welcome.

STC:
LLR: 2.96 (-2.94,2.94) <-0.50,2.50>
Total: 51520 W: 4074 L: 3888 D: 43558
Ptnml(0-2): 111, 3136, 19097, 3288, 128

No functional changes.

@Sopel97
Copy link
Member Author

Sopel97 commented Aug 17, 2021

Looks like SSE2 (old code path) wasn't hit on fishtest and failed to compile because I forgot that OutputType is not declared there now that it was moved out. I fixed it and tested that the bench is correct.

@Torom
Copy link
Contributor

Torom commented Aug 17, 2021

make profile-build ARCH=x86-64-bmi2

clang 12.0.0:

Result of 100 runs
==================
base (...es/stockfish) =    2028801  +/- 4827
test (....optimize_fc) =    2062108  +/- 4916
diff                   =     +33307  +/- 2710

speedup        = +0.0164
P(speedup > 0) =  1.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

gcc 10.3.0:

Result of 100 runs
==================
base (...tockfish_gcc) =    2013952  +/- 2761
test (...imize_fc_gcc) =    2033194  +/- 2860
diff                   =     +19242  +/- 3444

speedup        = +0.0096
P(speedup > 0) =  1.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

@vondele
Copy link
Member

vondele commented Aug 17, 2021

on Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz

Base:  ./stockfish.bmi2
Test:  ./stockfish.patch.bmi2
run       base       test     diff
  1    1417768    1457051   +39283
  2    1417768    1459330   +41562
  3    1397693    1440361   +42668
  4    1398301    1443453   +45152
  5    1399587    1439860   +40273
  6    1398707    1436213   +37506
  7    1390703    1434077   +43374
  8    1409751    1455731   +45980
  9    1388967    1437069   +48102
 10    1402233    1444606   +42373

Result of  10 runs
==================
base (./stockfish.bmi2         ) =    1402148  +/- 6201
test (./stockfish.patch.bmi2   ) =    1444775  +/- 5744
diff                             =     +42627  +/- 1982

speedup        = +0.0304
P(speedup > 0) =  1.0000

@vondele
Copy link
Member

vondele commented Aug 17, 2021

Another result

Result of  10 runs
==================
base (./stockfish.master       ) =    1804306  +/- 7843
test (./stockfish.patch        ) =    1796735  +/- 6373
diff                             =      -7571  +/- 7870

speedup        = -0.0042
P(speedup > 0) =  0.0299

CPU: 16 x AMD Ryzen 9 3950X 16-Core Processor

@vondele
Copy link
Member

vondele commented Aug 17, 2021

Some more results on an AWS cascade lake box:

Base:  ./stockfish.master.vnni256
Test:  ./stockfish.patch.vnni256
run       base       test     diff
  1    1524803    1545994   +21191
  2    1487628    1497952   +10324
  3    1503481    1486404   -17077
  4    1526171    1515217   -10954
  5    1525446    1546490   +21044
  6    1546738    1507320   -39418
  7    1503090    1511495    +8405
  8    1537284    1523518   -13766
  9    1523678    1546242   +22564
 10    1500984    1500439     -545

Result of  10 runs
==================
base (...ockfish.master.vnni256) =    1517930  +/- 11381
test (./stockfish.patch.vnni256) =    1518107  +/- 13522
diff                             =       +177  +/- 12605

speedup        = +0.0001
P(speedup > 0) =  0.5110

CPU: 48 x Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Hyperthreading: on 

Base:  ./stockfish.master.vnni512
Test:  ./stockfish.patch.vnni512
run       base       test     diff
  1    1496867    1496479     -388
  2    1499505    1469040   -30465
  3    1493389    1497564    +4175
  4    1466730    1453828   -12902
  5    1493080    1497642    +4562
  6    1512048    1478351   -33697
  7    1483888    1486557    +2669
  8    1520234    1463243   -56991
  9    1480697    1449021   -31676
 10    1511732    1476012   -35720

Result of  10 runs
==================
base (...ockfish.master.vnni512) =    1495817  +/- 10018
test (./stockfish.patch.vnni512) =    1476774  +/- 11099
diff                             =     -19043  +/- 13332

speedup        = -0.0127
P(speedup > 0) =  0.0026

CPU: 48 x Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Hyperthreading: on 

Base:  ./stockfish.master.avx512
Test:  ./stockfish.patch.avx512
run       base       test     diff
  1    1457785    1461838    +4053
  2    1487245    1472106   -15139
  3    1473606    1482899    +9293
  4    1493929    1455731   -38198
  5    1462059    1474808   +12749
  6    1444462    1455584   +11122
  7    1453463    1478956   +25493
  8    1439286    1452222   +12936
  9    1456317    1492926   +36609
 10    1485488    1470534   -14954

Result of  10 runs
==================
base (./stockfish.master.avx512) =    1465364  +/- 11627
test (./stockfish.patch.avx512 ) =    1469760  +/- 8235
diff                             =      +4396  +/- 13473

speedup        = +0.0030
P(speedup > 0) =  0.7385

CPU: 48 x Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Hyperthreading: on 

Base:  ./stockfish.master.avx2
Test:  ./stockfish.patch.avx2
run       base       test     diff
  1    1448731    1508893   +60162
  2    1465690    1483203   +17513
  3    1436284    1491310   +55026
  4    1420413    1461912   +41499
  5    1450620    1494315   +43695
  6    1465541    1482671   +17130
  7    1447426    1503246   +55820
  8    1465467    1483051   +17584
  9    1448078    1508972   +60894
 10    1467251    1484345   +17094

Result of  10 runs
==================
base (./stockfish.master.avx2  ) =    1451550  +/- 9401
test (./stockfish.patch.avx2   ) =    1490192  +/- 8948
diff                             =     +38642  +/- 11981

speedup        = +0.0266
P(speedup > 0) =  1.0000

CPU: 48 x Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Hyperthreading: on 

Base:  ./stockfish.master.modern
Test:  ./stockfish.patch.modern
run       base       test     diff
  1    1233420    1220151   -13269
  2    1261502    1252976    -8526
  3    1243121    1252976    +9855
  4    1232368    1225632    -6736
  5    1253846    1266196   +12350
  6    1242854    1224698   -18156
  7    1253574    1258483    +4909
  8    1248218    1239286    -8932
  9    1253791    1256515    +2724
 10    1256733    1243977   -12756

Result of  10 runs
==================
base (./stockfish.master.modern) =    1247943  +/- 6066
test (./stockfish.patch.modern ) =    1244089  +/- 9939
diff                             =      -3854  +/- 6520

speedup        = -0.0031
P(speedup > 0) =  0.1237

CPU: 48 x Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Hyperthreading: on 

@Sopel97
Copy link
Member Author

Sopel97 commented Aug 17, 2021

bench 16 1 15 default depth NNUE
20 runs each
ran sequentially

build               nps (pure nnue)    variance    diff     speedup %

modern_master:       851571             608     
modern_patch:        891296             757        +39725   +0.046649076
modern_master_pgo:   892006             788     
modern_patch_pgo:    895688             898        +3682    +0.004127775

bmi2_master:        1057312             936     
bmi2_patch:         1095579            1349        +38267   +0.036192723
bmi2_master_pgo:    1080779            1225        
bmi2_patch_pgo:     1114563            1831        +33784   +0.031258935

avx512_master:       939460            1495        
avx512_patch:        958149             625        +18689   +0.019893343
avx512_master_pgo:   974273             910     
avx512_patch_pgo:    998200            1039        +23927   +0.024558825
                
Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz

@vondele
Copy link
Member

vondele commented Aug 17, 2021

And same box, using 45 threads (so a bit more noise):

Base:  ./stockfish.master.vnni256
Test:  ./stockfish.patch.vnni256
run       base       test     diff
  1   40486829   41252272  +765443
  2   40262899   39238735 -1024164
  3   39285181   39352859   +67678
  4   45709654   49785016 +4075362
  5   40579900   38667586 -1912314
  6   40421223   45196140 +4774917
  7   40054553   41625985 +1571432
  8   39683729   45316464 +5632735
  9   41728480   39070233 -2658247
 10   41349140   40547315  -801825

Result of  10 runs
==================
base (...ockfish.master.vnni256) =   40956159  +/- 1124292
test (./stockfish.patch.vnni256) =   42005260  +/- 2250029
diff                             =   +1049102  +/- 1793827

speedup        = +0.0256
P(speedup > 0) =  0.8738

CPU: 48 x Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Hyperthreading: on 

Base:  ./stockfish.master.vnni512
Test:  ./stockfish.patch.vnni512
run       base       test     diff
  1   36870850   39314766 +2443916
  2   37495210   41702452 +4207242
  3   37561328   43967716 +6406388
  4   37471786   37469275    -2511
  5   36345396   41663268 +5317872
  6   40092814   36917402 -3175412
  7   40998898   37830183 -3168715
  8   36929072   37022521   +93449
  9   38035889   36935526 -1100363
 10   45789237   37586156 -8203081

Result of  10 runs
==================
base (...ockfish.master.vnni512) =   38759048  +/- 1776735
test (./stockfish.patch.vnni512) =   39040926  +/- 1562696
diff                             =    +281878  +/- 2774255

speedup        = +0.0073
P(speedup > 0) =  0.5788

CPU: 48 x Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Hyperthreading: on 

Base:  ./stockfish.master.avx512
Test:  ./stockfish.patch.avx512
run       base       test     diff
  1   37051464   36081928  -969536
  2   39531513   47324415 +7792902
  3   38082775   35676794 -2405981
  4   39662159   38237984 -1424175
  5   37145706   34948791 -2196915
  6   40223839   36788188 -3435651
  7   39284089   41048152 +1764063
  8   40747299   40117043  -630256
  9   43999092   37859598 -6139494
 10   35119494   42175408 +7055914

Result of  10 runs
==================
base (./stockfish.master.avx512) =   39084743  +/- 1507656
test (./stockfish.patch.avx512 ) =   39025830  +/- 2330033
diff                             =     -58913  +/- 2745227

speedup        = -0.0015
P(speedup > 0) =  0.4832

CPU: 48 x Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Hyperthreading: on 

Base:  ./stockfish.master.avx2
Test:  ./stockfish.patch.avx2
run       base       test     diff
  1   39225598   38549765  -675833
  2   40841537   38234411 -2607126
  3   40419143   38974217 -1444926
  4   40220467   41712561 +1492094
  5   38787126   38985397  +198271
  6   42052588   42311397  +258809
  7   43609382   40542323 -3067059
  8   37791503   41872346 +4080843
  9   46262951   46666249  +403298
 10   42783945   38751235 -4032710

Result of  10 runs
==================
base (./stockfish.master.avx2  ) =   41199424  +/- 1567226
test (./stockfish.patch.avx2   ) =   40659990  +/- 1611539
diff                             =    -539434  +/- 1476904

speedup        = -0.0131
P(speedup > 0) =  0.2374

CPU: 48 x Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Hyperthreading: on 

Base:  ./stockfish.master.modern
Test:  ./stockfish.patch.modern
run       base       test     diff
  1   37897963   36757368 -1140595
  2   33173615   35588156 +2414541
  3   34440251   32333059 -2107192
  4   38289123   33022845 -5266278
  5   33719721   33734059   +14338
  6   33827488   35471522 +1644034
  7   34968985   40462016 +5493031
  8   34894353   33200094 -1694259
  9   35616696   35453202  -163494
 10   41118151   33249460 -7868691

Result of  10 runs
==================
base (./stockfish.master.modern) =   35794635  +/- 1562862
test (./stockfish.patch.modern ) =   34927178  +/- 1495817
diff                             =    -867456  +/- 2347417

speedup        = -0.0242
P(speedup > 0) =  0.2348

CPU: 48 x Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Hyperthreading: on 


@JavaMast
Copy link

JavaMast commented Aug 17, 2021

Core i5-11400F

Screenshot_222

Screenshot_223

@vondele
Copy link
Member

vondele commented Aug 17, 2021

@JavaMast out of curiosity, could you also test -bmi2 on the rocket lake CPU you have there?

@JavaMast
Copy link

JavaMast commented Aug 17, 2021

@vondele
Screenshot_224

Screenshot_225

Speed for old network architecture: #3457 (comment)

@JavaMast JavaMast mentioned this pull request Aug 17, 2021
@mstembera
Copy link
Contributor

mstembera commented Aug 18, 2021

My BMI2 results on Skylake-X:

Results for 100 tests for each version:

            Base      Test      Diff      
    Mean    1008902   1026457   -17555    
    StDev   67246     70496     90024     

p-value: 0.577
speedup: 0.017

Would it make sense to put simd.h in the src\nnue\layers directory?
Also should it be possible to remove the horizontal adds as it was in the previous architecture?

@vondele
Copy link
Member

vondele commented Aug 18, 2021

I think the recent number with avx512 and vnni are not really much better than avx2, I wonder if we should consider dropping those implementations, and maybe revisit once the difference is more significant?

@NightlyKing
Copy link
Contributor

I think the recent number with avx512 and vnni are not really much better than avx2, I wonder if we should consider dropping those implementations, and maybe revisit once the difference is more significant?

A problem with currently available CPUs that are capable of AVX512 is the high dependency on luck (aka silicon lottery). Some chips just get a bit hotter and poof, the CPU throttles down so much that it may even lose speed compared to AVX2. Earlier models go as far as to always throttle once AVX512 instructions are used without a setting to change how aggressive it is in doing so. Then of course several other factors are at play such as thermal paste, cooler, and airflow of the PC case/server rack.
This is pretty unique since other instruction sets don't require the CPU to slow down its clock.
On the other side, it seems that our current NN structure and compiler don't "like" it.
Traditionally, Ipmanchess has had great speedup from AVX512 and vnni. If even he doesn't get a speedup over avx2+bmi2 then I think there's no point in supporting it until the net changes again.

@Fanael
Copy link
Contributor

Fanael commented Aug 18, 2021

Inline assembly constraints are incorrect in almost every place: they're writing to input operands, which can potentially confuse the compiler because it assumes input operands don't change. They probably should be changed to input/output operands with non-overlapping constraint, that is, +&v.

@Sopel97
Copy link
Member Author

Sopel97 commented Aug 18, 2021

Hmm, you're right. Someone mentioned it in the bug report too. I changed it now and inspected the assembly with XED, the results are the same aside from register assignment differences. I don't think a retest will be necessary for this?

@vondele vondele added the to be merged Will be merged shortly label Aug 20, 2021
@vondele
Copy link
Member

vondele commented Aug 20, 2021

so, I will merge this as I think separating out some of the simd functionality is a good step. It is somewhat unfortunate that we need to use inline asm to work around the gcc issues, that should be considered an intermediate fix, and not the direction of things.

The question on removing avx512/vnni support is still open. If it isn't useful (or if intel drops reduces support in hardware), we probably should drop it.

@vondele vondele closed this in 18dcf1f Aug 20, 2021
@Sopel97
Copy link
Member Author

Sopel97 commented Aug 20, 2021

There are some processors that do benefit from AVX512, and the situation with the next gen ryzens is uncertain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
to be merged Will be merged shortly
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants