-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arm64: Forward memset/memcpy to CRT implementation #67326
Comments
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsIn x64,
However, in Arm64, they are hand written in assembly as seen in https://github.com/dotnet/runtime/blob/2453f16807b85b279efc26d17d6f20de87801c09/src/coreclr/vm/arm64/crthelpers.asm. Experiment if CRT implementation of memset/memmove for Arm64 is faster and if yes, just use it. We might also need to readjust the heuristics that we do today to unroll the copy block. Here is the benchmark run difference between x64 (base) and arm64 (diff) Here is the x64 code for G_M19447_IG03:
lea rcx, bword ptr [rsp+08H]
lea rdx, bword ptr [rsp+88H]
mov r8d, 128
call CORINFO_HELP_MEMCPY
inc edi
cmp edi, 100
jl SHORT G_M19447_IG03 But Arm64 unrolls the loop to do so G_M19447_IG03:
ldr x1, [fp,#152]
str x1, [fp,#24]
ldp q16, q17, [fp,#160]
stp q16, q17, [fp,#32]
ldp q16, q17, [fp,#192]
stp q16, q17, [fp,#64]
ldp q16, q17, [fp,#224]
stp q16, q17, [fp,#96]
ldr q16, [fp,#0xd1ffab1e]
str q16, [fp,#128]
ldr x1, [fp,#0xd1ffab1e]
str x1, [fp,#144]
add w0, w0, #1
cmp w0, #100
blt G_M19447_IG03
|
@dotnet/jit-contrib |
@a74nh - FYI |
If we switch to CRT https://github.com/llvm/llvm-project/blob/main/libc/AOR_v20.02/string/aarch64/memset.S Likewise |
From my understanding we do use native memset on windows-arm64, it's just *nix-arm64 - should be easy to fix |
This is correct, although, I would not expect to see much of the performance improvement (especially, for smaller block sizes) only due to using DC ZVA. Using DC ZVA requires zeroing up to 64 bytes at the beginning and the end of the block manually. It also requires checking for DC ZVA availability on the target platform (e.g. as in https://github.com/llvm/llvm-project/blob/881350a92d821d4f8e4fa648443ed1d17e251188/libc/AOR_v20.02/string/aarch64/memset.S#L81) at runtime. It might be interesting to have two |
Few years ago, we already switched to using platform memset/memcpy in linux/arm64 in dotnet/coreclr#17536 and likewise for windows and linux on x64 in dotnet/coreclr#25750. The only left out was windows/arm64 which was still using the hand written assembly code. The platform implementation is highly optimized and has ways to use appropriate routines based on the size of block. |
All the benchmarks access the array element using |
For some benchmarks that involve runtime/src/coreclr/jit/codegenarmarch.cpp Lines 2679 to 2685 in e5eee9a
for (int startOffset = 0; startOffset < Size; startOffset += 64)
{
Unsafe.InitBlock(ref _dstData[startOffset], 255, 64);
} Because of that, here is the diff (left = x64 and right =arm64). |
Probably, I will experiment to loosen the heuristics if the loads/stores are happening in a loop and if so, ignore the alignment part and just use SIMD registers. |
@EgorChesakov - Did you experimented this? |
Seems like none of the C++ compilers have any such restrictions and does use SIMD register wherever possible. @TamarChristinaArm - any thoughts if we should also use SIMD registers and drop the condition about alignment? |
@kunalspathak Yes, this was how I implemented the algorithm in the first place. However, during testing I found that some benchmarks regressed due to using SIMD loads/stores with unaligned addresses. Hence, I extended the algorithm to check for src/dst 16-byte alignment. But, if you loose the heuristic you will get even better code size improvement that I had originally. |
Thanks for confirming @EgorChesakov . In case you recall, was that from local testing or from the perf lab. If it was local testing, I am inclined to get this change in and see how perf lab reacts. |
I did a private run of perf lab pipeline. And when the results came back I could reproduce the regressions locally on Surface Pro X. |
Yes C/C++ compilers do mostly always use SIMD for copies unless strict alignment is required ( So in your example link, if you look, For stack based copies the structure is aligned on the stack https://godbolt.org/z/8fv1eY7va so again unless the user has specified a custom alignment it works out. Which leaves general heap copies which normally have a bound higher than we would inline, so those get punted to
I think @EgorChesakov 's earlier experiment have shown that for the dotnet case this doesn't seem to be a win in general, The behavior here is somewhat dependent on the core, how it handles pairs and the total available memory bandwidth. I have been meaning to ask what the actual alignment and padding for structs in .NET is. |
In x64,
memset
andmemmove
is forwarded to the CRT implementation as seen below:runtime/src/coreclr/vm/amd64/CrtHelpers.asm
Line 49 in a85a2f5
runtime/src/coreclr/vm/amd64/CrtHelpers.asm
Line 83 in a85a2f5
However, in Arm64, they are hand written in assembly as seen in https://github.com/dotnet/runtime/blob/2453f16807b85b279efc26d17d6f20de87801c09/src/coreclr/vm/arm64/crthelpers.asm. Experiment if CRT implementation of memset/memmove for Arm64 is faster and if yes, just use it. We might also need to readjust the heuristics that we do today to unroll the copy block.
Here is the benchmark run difference between x64 (base) and arm64 (diff)
Here is the x64 code for
CopyBlock128()
benchmark that just usesmemcpy
:But Arm64 unrolls the loop to do so
I will perform some experiments and update the results here.
The text was updated successfully, but these errors were encountered: