Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm64: Forward memset/memcpy to CRT implementation #67326

Closed
kunalspathak opened this issue Mar 30, 2022 · 17 comments
Closed

Arm64: Forward memset/memcpy to CRT implementation #67326

kunalspathak opened this issue Mar 30, 2022 · 17 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@kunalspathak
Copy link
Member

kunalspathak commented Mar 30, 2022

In x64, memset and memmove is forwarded to the CRT implementation as seen below:

jmp memset ; forward to the CRT implementation

jmp memmove ; forward to the CRT implementation

However, in Arm64, they are hand written in assembly as seen in https://github.com/dotnet/runtime/blob/2453f16807b85b279efc26d17d6f20de87801c09/src/coreclr/vm/arm64/crthelpers.asm. Experiment if CRT implementation of memset/memmove for Arm64 is faster and if yes, just use it. We might also need to readjust the heuristics that we do today to unroll the copy block.

Here is the benchmark run difference between x64 (base) and arm64 (diff)

perf_diff

Here is the x64 code for CopyBlock128() benchmark that just uses memcpy:

G_M19447_IG03:
       lea      rcx, bword ptr [rsp+08H]
       lea      rdx, bword ptr [rsp+88H]
       mov      r8d, 128
       call     CORINFO_HELP_MEMCPY
       inc      edi
       cmp      edi, 100
       jl       SHORT G_M19447_IG03

But Arm64 unrolls the loop to do so

G_M19447_IG03:
            ldr     x1, [fp,#152]
            str     x1, [fp,#24]
            ldp     q16, q17, [fp,#160]
            stp     q16, q17, [fp,#32]
            ldp     q16, q17, [fp,#192]
            stp     q16, q17, [fp,#64]
            ldp     q16, q17, [fp,#224]
            stp     q16, q17, [fp,#96]
            ldr     q16, [fp,#0xd1ffab1e]
            str     q16, [fp,#128]
            ldr     x1, [fp,#0xd1ffab1e]
            str     x1, [fp,#144]
            add     w0, w0, #1
            cmp     w0, #100
            blt     G_M19447_IG03

I will perform some experiments and update the results here.

@kunalspathak kunalspathak self-assigned this Mar 30, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Mar 30, 2022
@ghost
Copy link

ghost commented Mar 30, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

In x64, memset and memmove is forwarded to the CRT implementation as seen below:

jmp memset ; forward to the CRT implementation

jmp memmove ; forward to the CRT implementation

However, in Arm64, they are hand written in assembly as seen in https://github.com/dotnet/runtime/blob/2453f16807b85b279efc26d17d6f20de87801c09/src/coreclr/vm/arm64/crthelpers.asm. Experiment if CRT implementation of memset/memmove for Arm64 is faster and if yes, just use it. We might also need to readjust the heuristics that we do today to unroll the copy block.

Here is the benchmark run difference between x64 (base) and arm64 (diff)

perf_diff

Here is the x64 code for CopyBlock128() benchmark that just uses memcpy:

G_M19447_IG03:
       lea      rcx, bword ptr [rsp+08H]
       lea      rdx, bword ptr [rsp+88H]
       mov      r8d, 128
       call     CORINFO_HELP_MEMCPY
       inc      edi
       cmp      edi, 100
       jl       SHORT G_M19447_IG03

But Arm64 unrolls the loop to do so

G_M19447_IG03:
            ldr     x1, [fp,#152]
            str     x1, [fp,#24]
            ldp     q16, q17, [fp,#160]
            stp     q16, q17, [fp,#32]
            ldp     q16, q17, [fp,#192]
            stp     q16, q17, [fp,#64]
            ldp     q16, q17, [fp,#224]
            stp     q16, q17, [fp,#96]
            ldr     q16, [fp,#0xd1ffab1e]
            str     q16, [fp,#128]
            ldr     x1, [fp,#0xd1ffab1e]
            str     x1, [fp,#144]
            add     w0, w0, #1
            cmp     w0, #100
            blt     G_M19447_IG03
Author: kunalspathak
Assignees: kunalspathak
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib

@kunalspathak
Copy link
Member Author

@a74nh - FYI

@kunalspathak
Copy link
Member Author

@EgorBo
Copy link
Member

EgorBo commented Mar 30, 2022

From my understanding we do use native memset on windows-arm64, it's just *nix-arm64 - should be easy to fix

@echesakov
Copy link
Contributor

If we switch to CRT memset we will see some benefits (at least) on linux because LLVM uses DC ZVA for that.

This is correct, although, I would not expect to see much of the performance improvement (especially, for smaller block sizes) only due to using DC ZVA. Using DC ZVA requires zeroing up to 64 bytes at the beginning and the end of the block manually. It also requires checking for DC ZVA availability on the target platform (e.g. as in https://github.com/llvm/llvm-project/blob/881350a92d821d4f8e4fa648443ed1d17e251188/libc/AOR_v20.02/string/aarch64/memset.S#L81) at runtime.

It might be interesting to have two JIT_MemSet helpers - one that uses DC ZVA and another that doesn't. JIT will do the check for DC ZVA availability at compile time and emit a call to corresponding helper.

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Apr 4, 2022
@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Apr 4, 2022
@kunalspathak
Copy link
Member Author

Few years ago, we already switched to using platform memset/memcpy in linux/arm64 in dotnet/coreclr#17536 and likewise for windows and linux on x64 in dotnet/coreclr#25750. The only left out was windows/arm64 which was still using the hand written assembly code. The platform implementation is highly optimized and has ways to use appropriate routines based on the size of block.

@kunalspathak
Copy link
Member Author

All the benchmarks access the array element using ref and currently we don't generate decent addressing mode for them. Tracked in #64819.

@kunalspathak
Copy link
Member Author

kunalspathak commented Apr 14, 2022

For some benchmarks that involve AnyLocation's init/copy block of 32+ bytes, we have a logic of using vector registers (and that would reduce the number of instructions), but we don't do that optimization if the address is not known, or it is not 16-byte aligned.

// Store operations that cross a 16-byte boundary reduce bandwidth or incur additional latency.
// The following condition prevents using 16-byte stores when dstRegAddrAlignment is:
// 1) unknown (i.e. dstReg is neither FP nor SP) or
// 2) non-zero (i.e. dstRegAddr is not 16-byte aligned).
const bool hasAvailableSimdReg = isDstRegAddrAlignmentKnown && (size > FP_REGSIZE_BYTES);
const bool canUse16ByteWideInstrs =
hasAvailableSimdReg && (dstRegAddrAlignment == 0) && helper.CanEncodeAllOffsets(FP_REGSIZE_BYTES);

for (int startOffset = 0; startOffset < Size; startOffset += 64)
{
    Unsafe.InitBlock(ref _dstData[startOffset], 255, 64);
}

Because of that, here is the diff (left = x64 and right =arm64).

image

Reference:
image

@kunalspathak
Copy link
Member Author

Probably, I will experiment to loosen the heuristics if the loads/stores are happening in a loop and if so, ignore the alignment part and just use SIMD registers.

@kunalspathak
Copy link
Member Author

Probably, I will experiment to loosen the heuristics if the loads/stores are happening in a loop and if so, ignore the alignment part and just use SIMD registers.

@EgorChesakov - Did you experimented this?

@kunalspathak
Copy link
Member Author

Seems like none of the C++ compilers have any such restrictions and does use SIMD register wherever possible.
https://godbolt.org/z/eb53xPvYT

@TamarChristinaArm - any thoughts if we should also use SIMD registers and drop the condition about alignment?

@echesakov
Copy link
Contributor

Probably, I will experiment to loosen the heuristics if the loads/stores are happening in a loop and if so, ignore the alignment part and just use SIMD registers.

@EgorChesakov - Did you experimented this?

@kunalspathak Yes, this was how I implemented the algorithm in the first place. However, during testing I found that some benchmarks regressed due to using SIMD loads/stores with unaligned addresses. Hence, I extended the algorithm to check for src/dst 16-byte alignment. But, if you loose the heuristic you will get even better code size improvement that I had originally.

@kunalspathak
Copy link
Member Author

However, during testing I found that some benchmarks regressed due to using SIMD loads/stores with unaligned addresses.

Thanks for confirming @EgorChesakov . In case you recall, was that from local testing or from the perf lab. If it was local testing, I am inclined to get this change in and see how perf lab reacts.

@echesakov
Copy link
Contributor

Thanks for confirming @EgorChesakov . In case you recall, was that from local testing or from the perf lab. If it was local testing, I am inclined to get this change in and see how perf lab reacts.

I did a private run of perf lab pipeline. And when the results came back I could reproduce the regressions locally on Surface Pro X.

@TamarChristinaArm
Copy link
Contributor

Seems like none of the C++ compilers have any such restrictions and does use SIMD register wherever possible. https://godbolt.org/z/eb53xPvYT

Yes C/C++ compilers do mostly always use SIMD for copies unless strict alignment is required (-mstrict-align) in which case they'll punt to memcpy. The reason we can do this without a perf loss in general is because of the padding and alignment of structures in C.

So in your example link, if you look, .bss section is always 16bytes aligned. And the structs are padded to a multiple of 8. See e.g. https://godbolt.org/z/cbn8d98ez and look at the .bss section, particularly the .align and .size directives. So fldTa, fldTb are at an aligned location.

For stack based copies the structure is aligned on the stack https://godbolt.org/z/8fv1eY7va so again unless the user has specified a custom alignment it works out.

Which leaves general heap copies which normally have a bound higher than we would inline, so those get punted to memcopy.

@TamarChristinaArm - any thoughts if we should also use SIMD registers and drop the condition about alignment?

I think @EgorChesakov 's earlier experiment have shown that for the dotnet case this doesn't seem to be a win in general, The behavior here is somewhat dependent on the core, how it handles pairs and the total available memory bandwidth.

I have been meaning to ask what the actual alignment and padding for structs in .NET is.

@kunalspathak
Copy link
Member Author

Fixed via #67788 and #68085

@ghost ghost locked as resolved and limited conversation to collaborators May 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

5 participants