FEXCore: Add non-atomic Memcpy and Memset IR fast paths #3478

bylaws · 2024-02-29T19:36:04Z

When TSO is disabled, vector LDP/STP can be used for a two instruction 32 byte memory copy which is significantly faster than the current byte-by-byte copy. Performing two such copies directly after oneanother also marginally increases copy speed for all sizes >=64.
Before:

-----------------------------------------------------------
Benchmark                 Time             CPU   Iterations
-----------------------------------------------------------
BM_SyscallBasic        20.2 ns         20.3 ns     35000000
BM_repmovsb/8          3.72 ns         3.70 ns    189189189 bytes_per_second=2.01367Gi/s
BM_repmovsb/16         6.73 ns         6.60 ns    100000000 bytes_per_second=2.22405Gi/s
BM_repmovsb/32         12.9 ns         12.8 ns     53846154 bytes_per_second=2.32571Gi/s
BM_repmovsb/64         25.5 ns         25.6 ns     26923077 bytes_per_second=2.32571Gi/s
BM_repmovsb/128        50.6 ns         50.0 ns     10000000 bytes_per_second=2.38419Gi/s
BM_repmovsb/256       100.0 ns         98.6 ns      7000000 bytes_per_second=2.41874Gi/s
BM_repmovsb/512         190 ns          191 ns      3500000 bytes_per_second=2.49094Gi/s
BM_repmovsb/1024        378 ns          375 ns      1842105 bytes_per_second=2.54604Gi/s
BM_repmovsb/2048        746 ns          740 ns      1000000 bytes_per_second=2.5775Gi/s
BM_repmovsb/4096       1473 ns         1479 ns       466667 bytes_per_second=2.57999Gi/s
BM_repmovsb/8192       2905 ns         2900 ns       241379 bytes_per_second=2.63082Gi/s
BM_repstosb/8          3.68 ns         3.70 ns    189189189 bytes_per_second=2.01367Gi/s
BM_repstosb/16         6.36 ns         6.34 ns    116666667 bytes_per_second=2.34928Gi/s
BM_repstosb/32         11.7 ns         11.8 ns     63636364 bytes_per_second=2.52868Gi/s
BM_repstosb/64         22.4 ns         22.6 ns     31818182 bytes_per_second=2.63404Gi/s
BM_repstosb/128        44.2 ns         44.0 ns     15909091 bytes_per_second=2.7093Gi/s
BM_repstosb/256        86.9 ns         86.1 ns      7777778 bytes_per_second=2.76771Gi/s
BM_repstosb/512         179 ns          177 ns      3888889 bytes_per_second=2.68749Gi/s
BM_repstosb/1024        350 ns          350 ns      2000000 bytes_per_second=2.72478Gi/s
BM_repstosb/2048        692 ns          690 ns      1000000 bytes_per_second=2.76427Gi/s
BM_repstosb/4096       1379 ns         1393 ns       538462 bytes_per_second=2.73876Gi/s
BM_repstosb/8192       2750 ns         2739 ns       259259 bytes_per_second=2.7859Gi/s

After:

BM_repmovsb/8          4.34 ns         4.36 ns    162790698 bytes_per_second=1.70829Gi/s
BM_repmovsb/16         7.39 ns         7.31 ns     87500000 bytes_per_second=2.03727Gi/s
BM_repmovsb/32         2.41 ns         2.30 ns    291666667 bytes_per_second=12.4176Gi/s
BM_repmovsb/64         2.35 ns         2.33 ns    304347826 bytes_per_second=25.5501Gi/s
BM_repmovsb/128        2.87 ns         2.86 ns    241379310 bytes_per_second=41.7024Gi/s
BM_repmovsb/256        4.19 ns         4.22 ns    170731707 bytes_per_second=56.5356Gi/s
BM_repmovsb/512        7.23 ns         7.10 ns     87500000 bytes_per_second=66.2274Gi/s
BM_repmovsb/1024       13.2 ns         13.0 ns     50000000 bytes_per_second=73.3596Gi/s
BM_repmovsb/2048       24.4 ns         24.3 ns     28000000 bytes_per_second=78.5379Gi/s
BM_repmovsb/4096       53.7 ns         53.1 ns     11666667 bytes_per_second=71.7819Gi/s
BM_repmovsb/8192       99.0 ns         98.6 ns      7000000 bytes_per_second=77.3997Gi/s
BM_repstosb/8          3.68 ns         3.70 ns    194444444 bytes_per_second=2.01212Gi/s
BM_repstosb/16         6.35 ns         6.20 ns    100000000 bytes_per_second=2.36526Gi/s
BM_repstosb/32         2.07 ns         2.09 ns    350000000 bytes_per_second=14.2888Gi/s
BM_repstosb/64         2.08 ns         2.07 ns    333333333 bytes_per_second=28.7945Gi/s
BM_repstosb/128        2.17 ns         2.17 ns    318181818 bytes_per_second=54.9713Gi/s
BM_repstosb/256        2.84 ns         2.84 ns    250000000 bytes_per_second=83.9502Gi/s
BM_repstosb/512        5.68 ns         5.70 ns    100000000 bytes_per_second=83.6556Gi/s
BM_repstosb/1024       11.1 ns         11.0 ns     63636364 bytes_per_second=86.6977Gi/s
BM_repstosb/2048       21.9 ns         22.0 ns     31818182 bytes_per_second=86.6977Gi/s
BM_repstosb/4096       43.1 ns         43.0 ns     16279070 bytes_per_second=88.7139Gi/s
BM_repstosb/8192       86.3 ns         86.1 ns      7777778 bytes_per_second=88.5668Gi/s

bylaws · 2024-02-29T20:03:57Z

Seems as if this is broken only for 32 bit code

unittests/InstructionCountCI/FEXOpt/MultiInst.json

bylaws · 2024-03-05T16:27:03Z

Ready for review

FEXCore/Source/Interface/Core/JIT/Arm64/MemoryOps.cpp

Sonicadvance1

This causes a crash in backwards stos/movs benchmark that needs to be fixed.
https://gist.github.com/Sonicadvance1/d893db3dd2c96f272f8639b4b24da234 for the stos memset bench.

This unit test recreates the error condition that FEX-Emu#3478 causes. With a string operation that is a backwards copy then the optimization will read past the end of the page and result in a crash. Seemingly only happens with backwards string operations, but test forward and backward in this test.

unittests/ASM: Implements a unit test for #3478

When TSO is disabled, vector LDP/STP can be used for a two instruction 32 byte memory copy which is significantly faster than the current byte-by-byte copy. Performing two such copies directly after oneanother also marginally increases copy speed for all sizes >=64.

bylaws · 2024-03-18T23:33:57Z

Fixed the issue with backwards copies

Sonicadvance1

Did a bunch of testing on this and couldn't find a regression.
Seems like pretty much entirely a win. Good job!

Caused by FEX-Emu#3478 This was missed in the review that it could cause issues. bylaws already has a fix incoming that will get this unit test working.

bylaws force-pushed the memcpy branch from 99071b2 to 1fe8e91 Compare February 29, 2024 20:03

bylaws force-pushed the memcpy branch from 1fe8e91 to e87ca02 Compare February 29, 2024 20:04

alyssarosenzweig reviewed Feb 29, 2024

View reviewed changes

unittests/InstructionCountCI/FEXOpt/MultiInst.json Outdated Show resolved Hide resolved

bylaws force-pushed the memcpy branch from 5067372 to 49b4763 Compare March 5, 2024 16:05

bylaws force-pushed the memcpy branch from 49b4763 to 91f9b1d Compare March 5, 2024 17:49

alyssarosenzweig reviewed Mar 6, 2024

View reviewed changes

FEXCore/Source/Interface/Core/JIT/Arm64/MemoryOps.cpp Outdated Show resolved Hide resolved

bylaws force-pushed the memcpy branch from 91f9b1d to 343efca Compare March 6, 2024 10:25

alyssarosenzweig approved these changes Mar 6, 2024

View reviewed changes

Sonicadvance1 requested changes Mar 6, 2024

View reviewed changes

Sonicadvance1 mentioned this pull request Mar 14, 2024

unittests/ASM: Implements a unit test for #3478 #3493

Merged

lioncash added a commit that referenced this pull request Mar 14, 2024

Merge pull request #3493 from Sonicadvance1/bug_for_3478

caff3cb

unittests/ASM: Implements a unit test for #3478

bylaws added 2 commits March 18, 2024 23:28

Update InstCountCI

29b05f6

bylaws force-pushed the memcpy branch from 343efca to 29b05f6 Compare March 18, 2024 23:31

Sonicadvance1 approved these changes Mar 19, 2024

View reviewed changes

Sonicadvance1 merged commit 7dcacfe into FEX-Emu:main Mar 19, 2024
10 checks passed

Sonicadvance1 mentioned this pull request Mar 20, 2024

unittests/ASM: Adds a test for overlapping memcpy using rep movs #3499

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEXCore: Add non-atomic Memcpy and Memset IR fast paths #3478

FEXCore: Add non-atomic Memcpy and Memset IR fast paths #3478

bylaws commented Feb 29, 2024 •

edited

Loading

bylaws commented Feb 29, 2024

bylaws commented Mar 5, 2024 •

edited

Loading

Sonicadvance1 left a comment

bylaws commented Mar 18, 2024

Sonicadvance1 left a comment

FEXCore: Add non-atomic Memcpy and Memset IR fast paths #3478

FEXCore: Add non-atomic Memcpy and Memset IR fast paths #3478

Conversation

bylaws commented Feb 29, 2024 • edited Loading

bylaws commented Feb 29, 2024

bylaws commented Mar 5, 2024 • edited Loading

Sonicadvance1 left a comment

Choose a reason for hiding this comment

bylaws commented Mar 18, 2024

Sonicadvance1 left a comment

Choose a reason for hiding this comment

bylaws commented Feb 29, 2024 •

edited

Loading

bylaws commented Mar 5, 2024 •

edited

Loading