Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove redundant sign/zero extension for SIMD broadcasts #108824

Merged
merged 3 commits into from
Nov 25, 2024

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Oct 13, 2024

Closes #108820

static Vector512<sbyte> Create1(ulong value) =>
    Vector512.Create((sbyte)value);
; Method Bencha:Create1(ulong)
-      movsx    rdx, dl
       vpbroadcastb zmm0, edx
       vmovups  zmmword ptr [rcx], zmm0
       mov      rax, rcx
       vzeroupper 
       ret      
-; Total bytes of code: 23
+; Total bytes of code: 19

Diffs

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 13, 2024
@EgorBo
Copy link
Member Author

EgorBo commented Oct 14, 2024

@tannergooding @kunalspathak @dotnet/jit-contrib PTAL

@xtqqczze
Copy link
Contributor

xtqqczze commented Oct 15, 2024

Diffs show additional improvement with memory source:

-       mov      eax, dword ptr [rbp-0x4C]
-       movzx    rax, ax
-       vpbroadcastw ymm0, eax
+       vpbroadcastw ymm0, word  ptr [rbp-0x4C]

@xtqqczze
Copy link
Contributor

xtqqczze commented Oct 15, 2024

@EgorBo

Missed redundant sign-extension with memory source:

System.PackedSpanHelpers:IndexOf[System.SpanHelpers+DontNegate`1[short],System.PackedSpanHelpers+NopTransform](byref,short,int):int (Instrumented Tier0)

        mov      eax, dword ptr [rbp-0x34]
        cwde     
        vpbroadcastb xmm0, eax

@tannergooding
Copy link
Member

I'm confident that changes like this are better:

Diffs show additional improvement with memory source:

-       mov      eax, dword ptr [rbp-0x4C]
-       movzx    rax, ax
-       vpbroadcastw ymm0, eax
+       vpbroadcastw ymm0, word  ptr [rbp-0x4C]

I'm not so confident on changes like:

-      movsx    rdx, dl
       vpbroadcastb zmm0, edx

movzx and movsx can avoid stalls otherwise caused by a partial write/merge and so while removing them should be better in the case that a previous instruction initialized either the full 32-bit or 64-bit state, it will likely be worse in cases where something instead only wrote the lower 8/16-bits of state prior to the broadcast.

It might be that the typical JIT patterns are already avoiding such 8/16-bit register writes in most cases, but it would be good to double check this.

Comment on lines +10364 to +10365
if (!varTypeIsFloating(cast->CastToType()) &&
!varTypeIsFloating(cast->CastFromType()) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this rather check for varTypeIsInteger on both?

We have several types (integer, floating, simd, and mask) so a negative test may accidentally include the wrong things

@EgorBo
Copy link
Member Author

EgorBo commented Oct 17, 2024

movzx and movsx can avoid stalls otherwise caused by a partial write/merge and so while removing them should be better in the case that a previous instruction initialized either the full 32-bit or 64-bit state, it will likely be worse in cases where something instead only wrote the lower 8/16-bits of state prior to the broadcast.

It might be that the typical JIT patterns are already avoiding such 8/16-bit register writes in most cases, but it would be good to double check this.

I can probably bench it (my bot is currently down), but looks like native compilers don't emit any movs e.g. https://godbolt.org/z/hKrGnz7nj

@tannergooding
Copy link
Member

I can probably bench it (my bot is currently down), but looks like native compilers don't emit any movs e.g. https://godbolt.org/z/hKrGnz7nj

Notably this is coming from memory, where its safe to elide.

The consideration would be when it's coming from register and the value was because of a partial register write (i.e. it wrote just al or just ax). I think this is fine to take overall (hence the sign-off), just something we should watch out for in the perf regressions and potentially consider if tweak is needed as a follow up.

@xtqqczze
Copy link
Contributor

@MihuBot

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unnecessary sign extension in Vector512.Create(sbyte) and Vector512.Create(short)
3 participants