Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Fixing Buffer::BlockCopy, JIT_MemCpy, and JIT_MemSet to just call the appropriate CRT functions for x64 Windows, as is already done for all other platforms/targets #25750

Merged
merged 3 commits into from
Jul 18, 2019

Conversation

tannergooding
Copy link
Member

This resolves https://github.com/dotnet/coreclr/issues/25505.

Based on the original issue where JIT_MemCpy was changed to use rep movsb (see #7198), there was:

  • minor improvement (~5%) for arrays of length 0 to 120
  • good improvement (~40%) for arrays of length 130 to 510
    • ~47% for arrays of length 130 to 310
    • ~39% for arrays of length 320 to 440
    • ~27% for arrays of length 450 to 510
  • little improvement (~1%) for arrays above 510 in length
    • This was only tested for 520 and 1000 bytes

However, on AMD processors, there are additional limitations around rep movsb and when it is beneficial to use. The common conditions under which it is being used in the JIT_MemCpy method today actually cause a 2x perf decrease for arrays larger than 512 bytes.

Having a custom memcpy routine adds additional maintenance burden, can be error prone, is generally not as widely tested, and does not get many of the optimizations that the CRT implementations receives. This, coupled with the overall minor improvements for small arrays on Intel processors and the 2x regression for arrays over 512 bytes on AMD processors is resulting in the custom memcpy routine being removed.

It would be beneficial for any future improvements to memcpy to be made directly against glibc and crt instead.

@tannergooding
Copy link
Member Author

tannergooding commented Jul 17, 2019

I did a git grep -in JIT_Mem and it looks like there are no actual remaining usages of JIT_MemSet or JIT_MemCpy in the repo:

C:\repos\coreclr [fix-25505 ≡]> git grep -in JIT_Mem
src/inc/jithelpers.h:260:    JITHELPER(CORINFO_HELP_MEMSET,              JIT_MemSet,         CORINFO_HELP_SIG_REG_ONLY)
src/inc/jithelpers.h:261:    JITHELPER(CORINFO_HELP_MEMCPY,              JIT_MemCpy,         CORINFO_HELP_SIG_REG_ONLY)
src/vm/amd64/CrtHelpers.asm:43:LEAF_ENTRY JIT_MemSet, _TEXT
src/vm/amd64/CrtHelpers.asm:149:LEAF_END_MARKED JIT_MemSet, _TEXT
src/vm/amd64/CrtHelpers.asm:151:;JIT_MemCpy - Copy source buffer to destination buffer
src/vm/amd64/CrtHelpers.asm:154:;   JIT_MemCpy() copies a source memory buffer to a destination memory
src/vm/amd64/CrtHelpers.asm:183:LEAF_ENTRY JIT_MemCpy, _TEXT
src/vm/amd64/CrtHelpers.asm:329:LEAF_END_MARKED JIT_MemCpy, _TEXT
src/vm/amd64/crthelpers.S:9:// JIT_MemSet/JIT_MemCpy
src/vm/amd64/crthelpers.S:16:LEAF_ENTRY JIT_MemSet, _TEXT
src/vm/amd64/crthelpers.S:27:LEAF_END_MARKED JIT_MemSet, _TEXT
src/vm/amd64/crthelpers.S:29:LEAF_ENTRY JIT_MemCpy, _TEXT
src/vm/amd64/crthelpers.S:41:LEAF_END_MARKED JIT_MemCpy, _TEXT
src/vm/arm/CrtHelpers.asm:23:; JIT_MemSet/JIT_MemCpy
src/vm/arm/CrtHelpers.asm:31:;EXTERN_C void __stdcall JIT_MemSet(void* _dest, int c, size_t count)
src/vm/arm/CrtHelpers.asm:32:        LEAF_ENTRY JIT_MemSet
src/vm/arm/CrtHelpers.asm:101:        LEAF_END_MARKED JIT_MemSet
src/vm/arm/CrtHelpers.asm:104:;EXTERN_C void __stdcall JIT_MemCpy(void* _dest, const void *_src, size_t count)
src/vm/arm/CrtHelpers.asm:105:        LEAF_ENTRY JIT_MemCpy
src/vm/arm/CrtHelpers.asm:159:        LEAF_END_MARKED JIT_MemCpy
src/vm/arm/crthelpers.S:21:// JIT_MemSet/JIT_MemCpy
src/vm/arm/crthelpers.S:29://EXTERN_C void __stdcall JIT_MemSet(void* _dest, int c, size_t count)
src/vm/arm/crthelpers.S:30:LEAF_ENTRY JIT_MemSet, _TEXT
src/vm/arm/crthelpers.S:40:LEAF_END_MARKED JIT_MemSet, _TEXT
src/vm/arm/crthelpers.S:43://EXTERN_C void __stdcall JIT_MemCpy(void* _dest, const void *_src, size_t count)
src/vm/arm/crthelpers.S:44:LEAF_ENTRY JIT_MemCpy, _TEXT
src/vm/arm/crthelpers.S:56:LEAF_END_MARKED JIT_MemCpy, _TEXT
src/vm/arm64/crthelpers.S:7:// JIT_MemSet/JIT_MemCpy
src/vm/arm64/crthelpers.S:13:LEAF_ENTRY JIT_MemSet, _TEXT
src/vm/arm64/crthelpers.S:14:    cbz x2, LOCAL_LABEL(JIT_MemSet_ret)
src/vm/arm64/crthelpers.S:20:LOCAL_LABEL(JIT_MemSet_ret):
src/vm/arm64/crthelpers.S:22:LEAF_END_MARKED JIT_MemSet, _TEXT
src/vm/arm64/crthelpers.S:24:LEAF_ENTRY JIT_MemCpy, _TEXT
src/vm/arm64/crthelpers.S:25:    cbz x2, LOCAL_LABEL(JIT_MemCpy_ret)
src/vm/arm64/crthelpers.S:32:LOCAL_LABEL(JIT_MemCpy_ret):
src/vm/arm64/crthelpers.S:34:LEAF_END_MARKED JIT_MemCpy, _TEXT
src/vm/arm64/crthelpers.asm:15:;void JIT_MemSet(void *dst, int val, SIZE_T count)
src/vm/arm64/crthelpers.asm:58:; Assembly code corresponding to above C++ method. JIT_MemSet can AV and clr exception personality routine needs to
src/vm/arm64/crthelpers.asm:59:; determine if the exception has taken place inside JIT_Memset in order to throw corresponding managed exception.
src/vm/arm64/crthelpers.asm:60:; Determining this is slow if the method were implemented as C++ method (using unwind info). In .asm file by adding JIT_MemSet_End
src/vm/arm64/crthelpers.asm:61:; marker it can be easily determined if exception happened in JIT_MemSet. Therefore, JIT_MemSet has been written in assembly instead of
src/vm/arm64/crthelpers.asm:64:    LEAF_ENTRY JIT_MemSet
src/vm/arm64/crthelpers.asm:70:    b           JIT_MemSet_bottom
src/vm/arm64/crthelpers.asm:71:JIT_MemSet_top
src/vm/arm64/crthelpers.asm:73:JIT_MemSet_bottom
src/vm/arm64/crthelpers.asm:75:    bge        JIT_MemSet_top
src/vm/arm64/crthelpers.asm:77:    tbz         x2, #3, JIT_MemSet_tbz4
src/vm/arm64/crthelpers.asm:79:JIT_MemSet_tbz4
src/vm/arm64/crthelpers.asm:80:    tbz         x2, #2, JIT_MemSet_tbz2
src/vm/arm64/crthelpers.asm:82:JIT_MemSet_tbz2
src/vm/arm64/crthelpers.asm:83:    tbz         x2, #1, JIT_MemSet_tbz1
src/vm/arm64/crthelpers.asm:85:JIT_MemSet_tbz1
src/vm/arm64/crthelpers.asm:86:    tbz         x2, #0, JIT_MemSet_ret
src/vm/arm64/crthelpers.asm:88:JIT_MemSet_ret
src/vm/arm64/crthelpers.asm:92:    LEAF_ENTRY JIT_MemSet_End
src/vm/arm64/crthelpers.asm:97:; See comments above for JIT_MemSet
src/vm/arm64/crthelpers.asm:99:;void JIT_MemCpy(void *dst, const void *src, SIZE_T count)
src/vm/arm64/crthelpers.asm:143:; See comments above for JIT_MemSet method
src/vm/arm64/crthelpers.asm:144:    LEAF_ENTRY JIT_MemCpy
src/vm/arm64/crthelpers.asm:145:    b           JIT_MemCpy_bottom
src/vm/arm64/crthelpers.asm:146:JIT_MemCpy_top
src/vm/arm64/crthelpers.asm:149:JIT_MemCpy_bottom
src/vm/arm64/crthelpers.asm:151:    bge         JIT_MemCpy_top
src/vm/arm64/crthelpers.asm:153:    tbz         x2, #3, JIT_MemCpy_tbz4
src/vm/arm64/crthelpers.asm:156:JIT_MemCpy_tbz4
src/vm/arm64/crthelpers.asm:157:    tbz         x2, #2, JIT_MemCpy_tbz2
src/vm/arm64/crthelpers.asm:160:JIT_MemCpy_tbz2
src/vm/arm64/crthelpers.asm:161:    tbz         x2, #1, JIT_MemCpy_tbz1
src/vm/arm64/crthelpers.asm:164:JIT_MemCpy_tbz1
src/vm/arm64/crthelpers.asm:165:    tbz         x2, #0, JIT_MemCpy_ret
src/vm/arm64/crthelpers.asm:168:JIT_MemCpy_ret
src/vm/arm64/crthelpers.asm:172:    LEAF_ENTRY JIT_MemCpy_End
src/vm/excep.cpp:6702:EXTERN_C void JIT_MemSet_End();
src/vm/excep.cpp:6703:EXTERN_C void JIT_MemCpy_End();
src/vm/excep.cpp:6762:    CHECK_RANGE(JIT_MemSet)
src/vm/excep.cpp:6763:    CHECK_RANGE(JIT_MemCpy)
src/vm/exceptionhandling.cpp:5213:        // in one of the JIT helper functions (JIT_MemSet, ...)
src/vm/jitinterface.h:434:    void STDCALL JIT_MemSet(void *dest, int c, SIZE_T count);
src/vm/jitinterface.h:435:    void STDCALL JIT_MemCpy(void *dest, const void *src, SIZE_T count);

Should we consider just removing the assembly helpers and the CHECK_RANGE sanity checks that exist?

Edit: Fixed spacing in grep results

@tannergooding
Copy link
Member Author

CC. @jkotas, @janvorli

@janvorli
Copy link
Member

JIT still uses these helpers, see:

JITHELPER(CORINFO_HELP_MEMSET, JIT_MemSet, CORINFO_HELP_SIG_REG_ONLY)
JITHELPER(CORINFO_HELP_MEMCPY, JIT_MemCpy, CORINFO_HELP_SIG_REG_ONLY)

The CHECK_RANGE are not sanity checks, they are actually necessary. In case the JIT_MemCpy or JIT_MemSet get null reference as an argument, we need to handle that as if it happened in the managed caller.

@janvorli
Copy link
Member

To use memcpy / memset from the CRT, we should do the same thing that we do on AMD64 Linux:

LEAF_ENTRY JIT_MemSet, _TEXT
test rdx, rdx
jz Exit_MemSet
cmp byte ptr [rdi], 0
jmp C_PLTFUNC(memset)
Exit_MemSet:
ret
LEAF_END_MARKED JIT_MemSet, _TEXT
LEAF_ENTRY JIT_MemCpy, _TEXT
test rdx, rdx
jz Exit_MemCpy
cmp byte ptr [rdi], 0
cmp byte ptr [rsi], 0
jmp C_PLTFUNC(memcpy)
Exit_MemCpy:
ret

That way, we check the input parameters in the JIT_xxx functions and then tail call to the runtime functions.

@tannergooding
Copy link
Member Author

Fixed up the JIT_MemSet and JIT_MemCpy implementations to forward to the CRT versions. I also cleaned up the comments in the Unix version to match those in the x64 version (where applicable).


jmp C_PLTFUNC(memcpy)
jmp C_PLTFUNC(memmove) // forward to the CRT implementation, using memmove to handle overlapping buffers
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the only code change to the Unix version. The Windows version was very explicit about needing to handle overlapping buffers, and so memmove is the appropriate thing to forward to.

I would assume the same conditions apply to Unix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Windows version was very explicit about needing to handle overlapping buffers

That was for better compatibility with .NET Framework. It is not required by the ECMA 335: The behavior of cpblk is unspecified if the source and destination areas overlap.`

memmove is a bit slower. I do not think we need to be making the JIT helper slower on Unix...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need it for compat on Windows? I would think consistency here is also important.

(I'll keep memmove on Windows and memcpy on Unix for now, but I think it would be a good discussion to have).

Copy link
Member Author

@tannergooding tannergooding Jul 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nothing else, it might be good to have JIT_MemCpy and JIT_MemMove, and then the JIT can more readily decide which is appropriate based on the scenario. There may be scenarios where overlapping is desirable and several where it is not needed.

Edit: And, it appears as though more than just cpblk goes through these. Nevermind, looks like there were just some comments that said MemCpy and they meant MemSet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, memcpy as alias for memmove in Windows x64 CRT. So even if we have changed Windows to call memcpy, the unspecified behavior is still going to be different between Windows and Unix.

@tannergooding tannergooding changed the title Fixing Buffer::BlockCopy to just call the CRT memmove for x64 Windows, as is already done for all other platforms/targets Fixing Buffer::BlockCopy, JIT_MemCpy, and JIT_MemSet to just call the appropriate CRT functions for x64 Windows, as is already done for all other platforms/targets Jul 17, 2019
; void *dst = pointer to destination buffer
; const void *src = pointer to source buffer
; size_t count = number of bytes to copy
; Exit:
Copy link
Member Author

@tannergooding tannergooding Jul 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The declaration of these methods in jitinterface.h have them as void returning, rather than as void* returning (which is what the CRT implementations do):

void STDCALL JIT_MemSet(void *dest, int c, SIZE_T count);
void STDCALL JIT_MemCpy(void *dest, const void *src, SIZE_T count);

@tannergooding
Copy link
Member Author

CoreFX legs are failing due to:

It was not possible to find any compatible framework version
The specified framework 'Microsoft.NETCore.App', version '3.0.0' was not found.
  - The following frameworks were found:
      5.0.0 at [/home/helixbot/work/8698dcd6-71bd-4dec-90db-5aaae809ddac/Payload/shared/Microsoft.NETCore.App]

@wtgodbe, was this something missed or are things just not entirely in sync yet?

@wtgodbe
Copy link
Member

wtgodbe commented Jul 17, 2019

We may need to update some TFMs to netcoreapp5.0 - see #25704 (comment). @BruceForstall where are the runtimeconfig.json files you mentioned generated? @ericstj, is it alright if we hard-code a 5.0 TFM?

@BruceForstall
Copy link
Member

@wtgodbe The runtimeconfig.json files are built with the corefx tests in the corefx official build, and downloaded by the corefx test legs in coreclr repo CI.

Maybe the 3.0 versions here: https://github.com/dotnet/coreclr/blob/master/tests/src/Common/CoreFX/CoreFX.depproj need to be updated?

@tannergooding
Copy link
Member Author

Tests aren't running...

@tannergooding
Copy link
Member Author

/azp help

@tannergooding
Copy link
Member Author

Closed and reopened as per https://github.com/dotnet/corefx/issues/39593

@tannergooding tannergooding merged commit 54d9d9b into dotnet:master Jul 18, 2019
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
… appropriate CRT functions for x64 Windows, as is already done for all other platforms/targets (dotnet/coreclr#25750)

* Fixing Buffer::BlockCopy to just call the CRT memmove for x64 Windows, as is already done for all other platforms/targets

* Fixing up the x64 CrtHelpers.asm to just forward to the CRT implementations for JIT_MemSet and JIT_MemCpy

* Keep unix using memcpy and clarify that Windows uses memmove for full framework compat.


Commit migrated from dotnet/coreclr@54d9d9b
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MemoryStream.Write() slow on Ryzen CPUs
5 participants