Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEXCore: Add non-atomic Memcpy and Memset IR fast paths #3478

Merged
merged 2 commits into from
Mar 19, 2024

Conversation

bylaws
Copy link
Collaborator

@bylaws bylaws commented Feb 29, 2024

When TSO is disabled, vector LDP/STP can be used for a two instruction 32 byte memory copy which is significantly faster than the current byte-by-byte copy. Performing two such copies directly after oneanother also marginally increases copy speed for all sizes >=64.
Before:

-----------------------------------------------------------
Benchmark                 Time             CPU   Iterations
-----------------------------------------------------------
BM_SyscallBasic        20.2 ns         20.3 ns     35000000
BM_repmovsb/8          3.72 ns         3.70 ns    189189189 bytes_per_second=2.01367Gi/s
BM_repmovsb/16         6.73 ns         6.60 ns    100000000 bytes_per_second=2.22405Gi/s
BM_repmovsb/32         12.9 ns         12.8 ns     53846154 bytes_per_second=2.32571Gi/s
BM_repmovsb/64         25.5 ns         25.6 ns     26923077 bytes_per_second=2.32571Gi/s
BM_repmovsb/128        50.6 ns         50.0 ns     10000000 bytes_per_second=2.38419Gi/s
BM_repmovsb/256       100.0 ns         98.6 ns      7000000 bytes_per_second=2.41874Gi/s
BM_repmovsb/512         190 ns          191 ns      3500000 bytes_per_second=2.49094Gi/s
BM_repmovsb/1024        378 ns          375 ns      1842105 bytes_per_second=2.54604Gi/s
BM_repmovsb/2048        746 ns          740 ns      1000000 bytes_per_second=2.5775Gi/s
BM_repmovsb/4096       1473 ns         1479 ns       466667 bytes_per_second=2.57999Gi/s
BM_repmovsb/8192       2905 ns         2900 ns       241379 bytes_per_second=2.63082Gi/s
BM_repstosb/8          3.68 ns         3.70 ns    189189189 bytes_per_second=2.01367Gi/s
BM_repstosb/16         6.36 ns         6.34 ns    116666667 bytes_per_second=2.34928Gi/s
BM_repstosb/32         11.7 ns         11.8 ns     63636364 bytes_per_second=2.52868Gi/s
BM_repstosb/64         22.4 ns         22.6 ns     31818182 bytes_per_second=2.63404Gi/s
BM_repstosb/128        44.2 ns         44.0 ns     15909091 bytes_per_second=2.7093Gi/s
BM_repstosb/256        86.9 ns         86.1 ns      7777778 bytes_per_second=2.76771Gi/s
BM_repstosb/512         179 ns          177 ns      3888889 bytes_per_second=2.68749Gi/s
BM_repstosb/1024        350 ns          350 ns      2000000 bytes_per_second=2.72478Gi/s
BM_repstosb/2048        692 ns          690 ns      1000000 bytes_per_second=2.76427Gi/s
BM_repstosb/4096       1379 ns         1393 ns       538462 bytes_per_second=2.73876Gi/s
BM_repstosb/8192       2750 ns         2739 ns       259259 bytes_per_second=2.7859Gi/s

After:

BM_repmovsb/8          4.34 ns         4.36 ns    162790698 bytes_per_second=1.70829Gi/s
BM_repmovsb/16         7.39 ns         7.31 ns     87500000 bytes_per_second=2.03727Gi/s
BM_repmovsb/32         2.41 ns         2.30 ns    291666667 bytes_per_second=12.4176Gi/s
BM_repmovsb/64         2.35 ns         2.33 ns    304347826 bytes_per_second=25.5501Gi/s
BM_repmovsb/128        2.87 ns         2.86 ns    241379310 bytes_per_second=41.7024Gi/s
BM_repmovsb/256        4.19 ns         4.22 ns    170731707 bytes_per_second=56.5356Gi/s
BM_repmovsb/512        7.23 ns         7.10 ns     87500000 bytes_per_second=66.2274Gi/s
BM_repmovsb/1024       13.2 ns         13.0 ns     50000000 bytes_per_second=73.3596Gi/s
BM_repmovsb/2048       24.4 ns         24.3 ns     28000000 bytes_per_second=78.5379Gi/s
BM_repmovsb/4096       53.7 ns         53.1 ns     11666667 bytes_per_second=71.7819Gi/s
BM_repmovsb/8192       99.0 ns         98.6 ns      7000000 bytes_per_second=77.3997Gi/s
BM_repstosb/8          3.68 ns         3.70 ns    194444444 bytes_per_second=2.01212Gi/s
BM_repstosb/16         6.35 ns         6.20 ns    100000000 bytes_per_second=2.36526Gi/s
BM_repstosb/32         2.07 ns         2.09 ns    350000000 bytes_per_second=14.2888Gi/s
BM_repstosb/64         2.08 ns         2.07 ns    333333333 bytes_per_second=28.7945Gi/s
BM_repstosb/128        2.17 ns         2.17 ns    318181818 bytes_per_second=54.9713Gi/s
BM_repstosb/256        2.84 ns         2.84 ns    250000000 bytes_per_second=83.9502Gi/s
BM_repstosb/512        5.68 ns         5.70 ns    100000000 bytes_per_second=83.6556Gi/s
BM_repstosb/1024       11.1 ns         11.0 ns     63636364 bytes_per_second=86.6977Gi/s
BM_repstosb/2048       21.9 ns         22.0 ns     31818182 bytes_per_second=86.6977Gi/s
BM_repstosb/4096       43.1 ns         43.0 ns     16279070 bytes_per_second=88.7139Gi/s
BM_repstosb/8192       86.3 ns         86.1 ns      7777778 bytes_per_second=88.5668Gi/s

@bylaws
Copy link
Collaborator Author

bylaws commented Feb 29, 2024

Seems as if this is broken only for 32 bit code

@bylaws
Copy link
Collaborator Author

bylaws commented Mar 5, 2024

Ready for review

Copy link
Member

@Sonicadvance1 Sonicadvance1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes a crash in backwards stos/movs benchmark that needs to be fixed.
https://gist.github.com/Sonicadvance1/d893db3dd2c96f272f8639b4b24da234 for the stos memset bench.

Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Mar 14, 2024
This unit test recreates the error condition that FEX-Emu#3478 causes.
With a string operation that is a backwards copy then the optimization
will read past the end of the page and result in a crash.

Seemingly only happens with backwards string operations, but test
forward and backward in this test.
lioncash added a commit that referenced this pull request Mar 14, 2024
unittests/ASM: Implements a unit test for #3478
bylaws added 2 commits March 18, 2024 23:28
When TSO is disabled, vector LDP/STP can be used for a two
instruction 32 byte memory copy which is significantly faster than the
current byte-by-byte copy. Performing two such copies directly after
oneanother also marginally increases copy speed for all sizes >=64.
@bylaws
Copy link
Collaborator Author

bylaws commented Mar 18, 2024

Fixed the issue with backwards copies

Copy link
Member

@Sonicadvance1 Sonicadvance1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a bunch of testing on this and couldn't find a regression.
Seems like pretty much entirely a win. Good job!

@Sonicadvance1 Sonicadvance1 merged commit 7dcacfe into FEX-Emu:main Mar 19, 2024
10 checks passed
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Mar 20, 2024
Caused by FEX-Emu#3478

This was missed in the review that it could cause issues. bylaws already
has a fix incoming that will get this unit test working.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Mar 20, 2024
Caused by FEX-Emu#3478

This was missed in the review that it could cause issues. bylaws already
has a fix incoming that will get this unit test working.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Mar 21, 2024
Caused by FEX-Emu#3478

This was missed in the review that it could cause issues. bylaws already
has a fix incoming that will get this unit test working.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants