Arm64: Forward memset/memcpy to CRT implementation #67326

kunalspathak · 2022-03-30T05:28:55Z

In x64, memset and memmove is forwarded to the CRT implementation as seen below:

runtime/src/coreclr/vm/amd64/CrtHelpers.asm

Line 49 in a85a2f5

jmp memset ; forward to the CRT implementation

runtime/src/coreclr/vm/amd64/CrtHelpers.asm

Line 83 in a85a2f5

jmp memmove ; forward to the CRT implementation

However, in Arm64, they are hand written in assembly as seen in https://github.com/dotnet/runtime/blob/2453f16807b85b279efc26d17d6f20de87801c09/src/coreclr/vm/arm64/crthelpers.asm. Experiment if CRT implementation of memset/memmove for Arm64 is faster and if yes, just use it. We might also need to readjust the heuristics that we do today to unroll the copy block.

Here is the benchmark run difference between x64 (base) and arm64 (diff)

Here is the x64 code for CopyBlock128() benchmark that just uses memcpy:

G_M19447_IG03:
       lea      rcx, bword ptr [rsp+08H]
       lea      rdx, bword ptr [rsp+88H]
       mov      r8d, 128
       call     CORINFO_HELP_MEMCPY
       inc      edi
       cmp      edi, 100
       jl       SHORT G_M19447_IG03

But Arm64 unrolls the loop to do so

G_M19447_IG03:
            ldr     x1, [fp,#152]
            str     x1, [fp,#24]
            ldp     q16, q17, [fp,#160]
            stp     q16, q17, [fp,#32]
            ldp     q16, q17, [fp,#192]
            stp     q16, q17, [fp,#64]
            ldp     q16, q17, [fp,#224]
            stp     q16, q17, [fp,#96]
            ldr     q16, [fp,#0xd1ffab1e]
            str     q16, [fp,#128]
            ldr     x1, [fp,#0xd1ffab1e]
            str     x1, [fp,#144]
            add     w0, w0, #1
            cmp     w0, #100
            blt     G_M19447_IG03

I will perform some experiments and update the results here.

The text was updated successfully, but these errors were encountered:

ghost · 2022-03-30T05:29:02Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

In x64, memset and memmove is forwarded to the CRT implementation as seen below:

runtime/src/coreclr/vm/amd64/CrtHelpers.asm

Line 49 in a85a2f5

jmp memset ; forward to the CRT implementation

runtime/src/coreclr/vm/amd64/CrtHelpers.asm

Line 83 in a85a2f5

jmp memmove ; forward to the CRT implementation

However, in Arm64, they are hand written in assembly as seen in https://github.com/dotnet/runtime/blob/2453f16807b85b279efc26d17d6f20de87801c09/src/coreclr/vm/arm64/crthelpers.asm. Experiment if CRT implementation of memset/memmove for Arm64 is faster and if yes, just use it. We might also need to readjust the heuristics that we do today to unroll the copy block.

Here is the benchmark run difference between x64 (base) and arm64 (diff)

Here is the x64 code for CopyBlock128() benchmark that just uses memcpy:

G_M19447_IG03:
       lea      rcx, bword ptr [rsp+08H]
       lea      rdx, bword ptr [rsp+88H]
       mov      r8d, 128
       call     CORINFO_HELP_MEMCPY
       inc      edi
       cmp      edi, 100
       jl       SHORT G_M19447_IG03

But Arm64 unrolls the loop to do so

G_M19447_IG03:
            ldr     x1, [fp,#152]
            str     x1, [fp,#24]
            ldp     q16, q17, [fp,#160]
            stp     q16, q17, [fp,#32]
            ldp     q16, q17, [fp,#192]
            stp     q16, q17, [fp,#64]
            ldp     q16, q17, [fp,#224]
            stp     q16, q17, [fp,#96]
            ldr     q16, [fp,#0xd1ffab1e]
            str     q16, [fp,#128]
            ldr     x1, [fp,#0xd1ffab1e]
            str     x1, [fp,#144]
            add     w0, w0, #1
            cmp     w0, #100
            blt     G_M19447_IG03

Author:	kunalspathak
Assignees:	kunalspathak
Labels:	`area-CodeGen-coreclr`, `untriaged`
Milestone:	-

kunalspathak · 2022-03-30T05:29:26Z

@dotnet/jit-contrib

kunalspathak · 2022-03-30T05:30:12Z

@a74nh - FYI

kunalspathak · 2022-03-30T05:47:33Z

If we switch to CRT memset we will see some benefits (at least) on linux because LLVM uses DC ZVA for that.

https://github.com/llvm/llvm-project/blob/main/libc/AOR_v20.02/string/aarch64/memset.S

Likewise memcpy is much better - https://github.com/llvm/llvm-project/blob/main/libc/AOR_v20.02/string/aarch64/memcpy.S and https://github.com/llvm/llvm-project/blob/main/libc/AOR_v20.02/string/aarch64/memcpy-advsimd.S

EgorBo · 2022-03-30T08:08:15Z

From my understanding we do use native memset on windows-arm64, it's just *nix-arm64 - should be easy to fix

echesakov · 2022-03-31T04:55:03Z

If we switch to CRT memset we will see some benefits (at least) on linux because LLVM uses DC ZVA for that.

This is correct, although, I would not expect to see much of the performance improvement (especially, for smaller block sizes) only due to using DC ZVA. Using DC ZVA requires zeroing up to 64 bytes at the beginning and the end of the block manually. It also requires checking for DC ZVA availability on the target platform (e.g. as in https://github.com/llvm/llvm-project/blob/881350a92d821d4f8e4fa648443ed1d17e251188/libc/AOR_v20.02/string/aarch64/memset.S#L81) at runtime.

It might be interesting to have two JIT_MemSet helpers - one that uses DC ZVA and another that doesn't. JIT will do the check for DC ZVA availability at compile time and emit a call to corresponding helper.

kunalspathak · 2022-04-08T23:00:44Z

Few years ago, we already switched to using platform memset/memcpy in linux/arm64 in dotnet/coreclr#17536 and likewise for windows and linux on x64 in dotnet/coreclr#25750. The only left out was windows/arm64 which was still using the hand written assembly code. The platform implementation is highly optimized and has ways to use appropriate routines based on the size of block.

kunalspathak · 2022-04-13T19:32:47Z

All the benchmarks access the array element using ref and currently we don't generate decent addressing mode for them. Tracked in #64819.

kunalspathak · 2022-04-14T00:52:15Z

For some benchmarks that involve AnyLocation's init/copy block of 32+ bytes, we have a logic of using vector registers (and that would reduce the number of instructions), but we don't do that optimization if the address is not known, or it is not 16-byte aligned.

runtime/src/coreclr/jit/codegenarmarch.cpp

Lines 2679 to 2685 in e5eee9a

    
           // Store operations that cross a 16-byte boundary reduce bandwidth or incur additional latency. 
        
           // The following condition prevents using 16-byte stores when dstRegAddrAlignment is: 
        
           //   1) unknown (i.e. dstReg is neither FP nor SP) or 
        
           //   2) non-zero (i.e. dstRegAddr is not 16-byte aligned). 
        
           const bool hasAvailableSimdReg = isDstRegAddrAlignmentKnown && (size > FP_REGSIZE_BYTES); 
        
           const bool canUse16ByteWideInstrs = 
        
               hasAvailableSimdReg && (dstRegAddrAlignment == 0) && helper.CanEncodeAllOffsets(FP_REGSIZE_BYTES);

for (int startOffset = 0; startOffset < Size; startOffset += 64)
{
    Unsafe.InitBlock(ref _dstData[startOffset], 255, 64);
}

Because of that, here is the diff (left = x64 and right =arm64).

Reference:

kunalspathak · 2022-04-14T01:13:19Z

Probably, I will experiment to loosen the heuristics if the loads/stores are happening in a loop and if so, ignore the alignment part and just use SIMD registers.

kunalspathak · 2022-04-14T01:24:45Z

Probably, I will experiment to loosen the heuristics if the loads/stores are happening in a loop and if so, ignore the alignment part and just use SIMD registers.

@EgorChesakov - Did you experimented this?

kunalspathak · 2022-04-14T06:05:21Z

Seems like none of the C++ compilers have any such restrictions and does use SIMD register wherever possible.
https://godbolt.org/z/eb53xPvYT

@TamarChristinaArm - any thoughts if we should also use SIMD registers and drop the condition about alignment?

echesakov · 2022-04-15T19:46:40Z

Probably, I will experiment to loosen the heuristics if the loads/stores are happening in a loop and if so, ignore the alignment part and just use SIMD registers.

@EgorChesakov - Did you experimented this?

@kunalspathak Yes, this was how I implemented the algorithm in the first place. However, during testing I found that some benchmarks regressed due to using SIMD loads/stores with unaligned addresses. Hence, I extended the algorithm to check for src/dst 16-byte alignment. But, if you loose the heuristic you will get even better code size improvement that I had originally.

kunalspathak · 2022-04-15T19:54:28Z

However, during testing I found that some benchmarks regressed due to using SIMD loads/stores with unaligned addresses.

Thanks for confirming @EgorChesakov . In case you recall, was that from local testing or from the perf lab. If it was local testing, I am inclined to get this change in and see how perf lab reacts.

echesakov · 2022-04-16T00:27:14Z

Thanks for confirming @EgorChesakov . In case you recall, was that from local testing or from the perf lab. If it was local testing, I am inclined to get this change in and see how perf lab reacts.

I did a private run of perf lab pipeline. And when the results came back I could reproduce the regressions locally on Surface Pro X.

TamarChristinaArm · 2022-04-16T09:49:29Z

Seems like none of the C++ compilers have any such restrictions and does use SIMD register wherever possible. https://godbolt.org/z/eb53xPvYT

Yes C/C++ compilers do mostly always use SIMD for copies unless strict alignment is required (-mstrict-align) in which case they'll punt to memcpy. The reason we can do this without a perf loss in general is because of the padding and alignment of structures in C.

So in your example link, if you look, .bss section is always 16bytes aligned. And the structs are padded to a multiple of 8. See e.g. https://godbolt.org/z/cbn8d98ez and look at the .bss section, particularly the .align and .size directives. So fldTa, fldTb are at an aligned location.

For stack based copies the structure is aligned on the stack https://godbolt.org/z/8fv1eY7va so again unless the user has specified a custom alignment it works out.

Which leaves general heap copies which normally have a bound higher than we would inline, so those get punted to memcopy.

@TamarChristinaArm - any thoughts if we should also use SIMD registers and drop the condition about alignment?

I think @EgorChesakov 's earlier experiment have shown that for the dotnet case this doesn't seem to be a win in general, The behavior here is somewhat dependent on the core, how it handles pairs and the total available memory bandwidth.

I have been meaning to ask what the actual alignment and padding for structs in .NET is.

kunalspathak · 2022-04-22T22:25:22Z

Fixed via #67788 and #68085

kunalspathak self-assigned this Mar 30, 2022

dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Mar 30, 2022

kunalspathak mentioned this issue Mar 30, 2022

Improving ARM64 Performance in .NET 7.0 #64820

Closed

32 tasks

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Apr 4, 2022

JulieLeeMSFT added this to the 7.0.0 milestone Apr 4, 2022

kunalspathak mentioned this issue Apr 9, 2022

Use CRT implementation of memset/memcpy #67788

Merged

kunalspathak mentioned this issue Apr 15, 2022

Arm64: Have CpBlkUnroll and InitBlkUnroll use SIMD registers #68085

Merged

kunalspathak closed this as completed Apr 22, 2022

ghost locked as resolved and limited conversation to collaborators May 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm64: Forward memset/memcpy to CRT implementation #67326

Arm64: Forward memset/memcpy to CRT implementation #67326

kunalspathak commented Mar 30, 2022 •

edited

Loading

ghost commented Mar 30, 2022

kunalspathak commented Mar 30, 2022

kunalspathak commented Mar 30, 2022

kunalspathak commented Mar 30, 2022

EgorBo commented Mar 30, 2022 •

edited

Loading

echesakov commented Mar 31, 2022

kunalspathak commented Apr 8, 2022

kunalspathak commented Apr 13, 2022

kunalspathak commented Apr 14, 2022 •

edited

Loading

kunalspathak commented Apr 14, 2022

kunalspathak commented Apr 14, 2022

kunalspathak commented Apr 14, 2022

echesakov commented Apr 15, 2022

kunalspathak commented Apr 15, 2022

echesakov commented Apr 16, 2022

TamarChristinaArm commented Apr 16, 2022

kunalspathak commented Apr 22, 2022

Arm64: Forward memset/memcpy to CRT implementation #67326

Arm64: Forward memset/memcpy to CRT implementation #67326

Comments

kunalspathak commented Mar 30, 2022 • edited Loading

ghost commented Mar 30, 2022

kunalspathak commented Mar 30, 2022

kunalspathak commented Mar 30, 2022

kunalspathak commented Mar 30, 2022

EgorBo commented Mar 30, 2022 • edited Loading

echesakov commented Mar 31, 2022

kunalspathak commented Apr 8, 2022

kunalspathak commented Apr 13, 2022

kunalspathak commented Apr 14, 2022 • edited Loading

kunalspathak commented Apr 14, 2022

kunalspathak commented Apr 14, 2022

kunalspathak commented Apr 14, 2022

echesakov commented Apr 15, 2022

kunalspathak commented Apr 15, 2022

echesakov commented Apr 16, 2022

TamarChristinaArm commented Apr 16, 2022

kunalspathak commented Apr 22, 2022

kunalspathak commented Mar 30, 2022 •

edited

Loading

EgorBo commented Mar 30, 2022 •

edited

Loading

kunalspathak commented Apr 14, 2022 •

edited

Loading