Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable AVX-512 for block unrollings (both copying and zeroing) #85389

Merged
merged 6 commits into from
Apr 27, 2023

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Apr 26, 2023

Closes #83798

This PR enables AVX-512 for various unrollings using GT_BLK - it can be stackalloc zeroing, struct copy/initialization, Unsafe.InitBloc/Unsafe.BlockCopy calls, etc.

Examples:

struct MyStruct {
    long a,b,c,d,e,f,g,h;
}

// Copying
MyStruct StructCopy(MyStruct s)
{
    return s;
}

// Zeroing
void StackallocZeroing()
{
    byte* ptr = stackalloc byte[300];
    Consume(ptr);
}

Codegen diff: https://www.diffchecker.com/cxc6UYLf/ (this PR is on the right)

As the result, it increases ranges where we previously used to fallback to memcpy/memset calls.

Benchmark:

[Benchmark]
public void Test()
{
    var ptr = stackalloc long[42]; 
    Consume(ptr);
}

[MethodImpl(MethodImplOptions.NoInlining)]
static void Consume(void* ptr) { }
Method Job Toolchain Mean
Test Job-FEQJHS \Core_Root\corerun.exe 4.229 ns
Test Job-CMMDYR \Core_Root_PR\corerun.exe 2.262 ns

Ryzen 7950x, avx512, win-x64

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 26, 2023
@ghost ghost assigned EgorBo Apr 26, 2023
@ghost
Copy link

ghost commented Apr 26, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #83798

This PR enables AVX-512 for various unrollings using GT_BLK - it can be stackalloc zeroing, struct copy/initialization, Unsafe.InitBloc/Unsafe.BlockCopy calls, etc.

Examples:

struct MyStruct {
    long a,b,c,d,e,f,g,h;
}

// Copying
MyStruct StructCopy(MyStruct s)
{
    return s;
}

// Zeroing
void StackallocZeroing()
{
    byte* ptr = stackalloc byte[300];
    Consume(ptr);
}

old codegen:

; Method Tests:StructCopy(Tests+MyStruct):Tests+MyStruct:this
       vzeroupper 
       vmovdqu  ymm0, ymmword ptr [r8]
       vmovdqu  ymmword ptr [rdx], ymm0
       vmovdqu  ymm0, ymmword ptr [r8+20H]
       vmovdqu  ymmword ptr [rdx+20H], ymm0
       mov      rax, rdx
       vzeroupper 
       ret      
; Total bytes of code: 30

; Method Tests:StackallocZeroing():this
       push     rbp
       sub      rsp, 48
       lea      rbp, [rsp+20H]
       mov      rax, 0xD1FFAB1E
       mov      qword ptr [rbp], rax
       test     dword ptr [rsp], esp
       sub      rsp, 304
       lea      rcx, [rsp+20H]
       mov      qword ptr [rbp+08H], rcx
       xor      edx, edx
       mov      r8d, 304
       call     CORINFO_HELP_MEMSET
       mov      rcx, qword ptr [rbp+08H]
       call     [Tests:Consume(ulong)]
       mov      rcx, 0xD1FFAB1E
       cmp      qword ptr [rbp], rcx
       je       SHORT G_M16409_IG03
       call     CORINFO_HELP_FAIL_FAST
       nop      
       lea      rsp, [rbp+10H]
       pop      rbp
       ret      
; Total bytes of code: 94

new codegen:

; Method Tests:StructCopy(Tests+MyStruct):Tests+MyStruct:this
       vzeroupper 
       vmovdqu32 zmm0, zmmword ptr [r8]
       vmovdqu32 zmmword ptr [rdx], zmm0
       mov      rax, rdx
       vzeroupper 
       ret      
; Total bytes of code: 22


; Method Tests:StackallocZeroing():this
       push     rbp
       sub      rsp, 48
       vzeroupper 
       lea      rbp, [rsp+20H]
       mov      rax, 0xD1FFAB1E
       mov      qword ptr [rbp+08H], rax
       test     dword ptr [rsp], esp
       sub      rsp, 304
       lea      rcx, [rsp+20H]
       vxorps   zmm0, zmm0
       vmovdqu32 zmmword ptr [rcx], zmm0
       vmovdqu32 zmmword ptr [rcx+40H], zmm0
       vmovdqu32 zmmword ptr [rcx+80H], zmm0
       vmovdqu32 zmmword ptr [rcx+C0H], zmm0
       vmovdqu32 zmmword ptr [rcx+F0H], zmm0
       call     [Tests:Consume(ulong)]
       mov      rcx, 0xD1FFAB1E
       cmp      qword ptr [rbp+08H], rcx
       je       SHORT G_M16409_IG03
       call     CORINFO_HELP_FAIL_FAST
       nop      
       lea      rsp, [rbp+10H]
       pop      rbp
       ret      
; Total bytes of code: 119
Author: EgorBo
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo EgorBo added the avx512 Related to the AVX-512 architecture label Apr 26, 2023
@EgorBo
Copy link
Member Author

EgorBo commented Apr 26, 2023

/azp list

@azure-pipelines

This comment was marked as resolved.

@EgorBo
Copy link
Member Author

EgorBo commented Apr 26, 2023

/azp run runtime-coreclr outerloop, runtime-coreclr jitstress-isas-x86

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@EgorBo
Copy link
Member Author

EgorBo commented Apr 26, 2023

Diffs
Size regressions are expected (call memset/memcpy is smaller)

@EgorBo
Copy link
Member Author

EgorBo commented Apr 26, 2023

@tannergooding @BruceForstall @dotnet/avx512-contrib PTAL, I didn't enable it for non-zeroing init (e.g. Unsafe.InitBlockUnaligned(ref a, value: 42, count: 32)) because I plan to work on that separately, even for XMM/YMM it might be improved

@EgorBo
Copy link
Member Author

EgorBo commented Apr 26, 2023

Failures are mostly #85403

@EgorBo
Copy link
Member Author

EgorBo commented Apr 27, 2023

benchmark:

[Benchmark]
public void Test()
{
    var ptr = stackalloc long[42]; 
    Consume(ptr);
}

[MethodImpl(MethodImplOptions.NoInlining)]
static void Consume(void* ptr) { }
}
Method Job Toolchain Mean
Test Job-FEQJHS \Core_Root\corerun.exe 4.229 ns
Test Job-CMMDYR \Core_Root_PR\corerun.exe 2.262 ns

@EgorBo EgorBo merged commit 953d290 into dotnet:main Apr 27, 2023
@EgorBo EgorBo deleted the blk-avx-512 branch April 27, 2023 23:48
@ghost ghost locked as resolved and limited conversation to collaborators May 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize block unrolling operations using AVX-512
2 participants