Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve codegen for Vector128.Shift* operations where a direct intrinsic is not available #82564

Open
MihaZupan opened this issue Feb 23, 2023 · 4 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@MihaZupan
Copy link
Member

MihaZupan commented Feb 23, 2023

(applies to Vector256 as well)

Consider Vector128.ShiftRightLogical(ref byte) where X86 does not have a ShiftRightLogical instruction that operates on bytes:

Vector128<byte> v0 = Vector128.LoadUnsafe(ref source);
Vector128<byte> v1 = Vector128.ShiftRightLogical(v0, 4);

Which currently emits a scalar fallback

TestClass.Foo(Byte ByRef)
    L0000: push rsi
    L0001: sub rsp, 0x40
    L0005: vzeroupper
    L0008: vmovdqu xmm0, [rcx]
    L000c: vmovapd [rsp+0x20], xmm0
    L0012: xor esi, esi
    L0014: lea rcx, [rsp+0x20]
    L0019: movsxd rdx, esi
    L001c: movzx ecx, byte ptr [rcx+rdx]
    L0020: mov edx, 4
    L0025: mov rax, 0x7ffa0845bc60
    L002f: call qword ptr [rax]
    L0031: lea rdx, [rsp+0x30]
    L0036: movsxd rcx, esi
    L0039: mov [rdx+rcx], al
    L003c: inc esi
    L003e: cmp esi, 0x10
    L0041: jl short L0014
    L0043: vmovapd xmm0, [rsp+0x30]
    L0049: vpmovmskb eax, xmm0
    L004d: add rsp, 0x40
    L0051: pop rsi
    L0052: ret

where it could instead emit a 32-bit shift and an AND to clear the overlapping bits

Vector128<byte> v0 = Vector128.LoadUnsafe(ref source);
Vector128<byte> v1 = Vector128.ShiftRightLogical(v0.AsInt32(), 4).AsByte() & Vector128.Create((byte)0xF);
TestClass.Bar(Byte ByRef)
    L0000: vzeroupper
    L0003: vmovdqu xmm0, [rcx]
    L0007: vpsrld xmm0, xmm0, 4
    L000c: vpand xmm0, xmm0, [0x7ffa087600d0]
    L0014: vpmovmskb eax, xmm0
    L0018: ret

We have a few places in runtime that are aware of this issue and employ workarounds, e.g.:

@MihaZupan MihaZupan added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 23, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Feb 23, 2023
@ghost
Copy link

ghost commented Feb 23, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Consider Vector128.ShiftRightLogical(ref byte) where X86 does not have a ShiftRightLogical instruction that operates on bytes:

Vector128<byte> v0 = Vector128.LoadUnsafe(ref source);
Vector128<byte> v1 = Vector128.ShiftRightLogical(v0, 4);

Which currently emits a scalar fallback

TestClass.Foo(Byte ByRef)
    L0000: push rsi
    L0001: sub rsp, 0x40
    L0005: vzeroupper
    L0008: vmovdqu xmm0, [rcx]
    L000c: vmovapd [rsp+0x20], xmm0
    L0012: xor esi, esi
    L0014: lea rcx, [rsp+0x20]
    L0019: movsxd rdx, esi
    L001c: movzx ecx, byte ptr [rcx+rdx]
    L0020: mov edx, 4
    L0025: mov rax, 0x7ffa0845bc60
    L002f: call qword ptr [rax]
    L0031: lea rdx, [rsp+0x30]
    L0036: movsxd rcx, esi
    L0039: mov [rdx+rcx], al
    L003c: inc esi
    L003e: cmp esi, 0x10
    L0041: jl short L0014
    L0043: vmovapd xmm0, [rsp+0x30]
    L0049: vpmovmskb eax, xmm0
    L004d: add rsp, 0x40
    L0051: pop rsi
    L0052: ret

where it could instead emit a 32-bit shift and an AND to clear the overlapping bits

Vector128<byte> v0 = Vector128.LoadUnsafe(ref source);
Vector128<byte> v1 = Vector128.ShiftRightLogical(v0.AsInt32(), 4).AsByte() & Vector128.Create((byte)0xF);
TestClass.Bar(Byte ByRef)
    L0000: vzeroupper
    L0003: vmovdqu xmm0, [rcx]
    L0007: vpsrld xmm0, xmm0, 4
    L000c: vpand xmm0, xmm0, [0x7ffa087600d0]
    L0014: vpmovmskb eax, xmm0
    L0018: ret

We have a few places in runtime that are aware of this issue and employ workarounds, e.g.:

Author: MihaZupan
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@gfoidl
Copy link
Member

gfoidl commented Feb 24, 2023

For

Vector128<byte> hiNibbles = Vector128.ShiftRightLogical(str.AsInt32(), 4).AsByte() & mask2F;
just noting that with the proposed codegen a extra register for Vector128.Create((byte)0xF) is needed which was avoided by intention by re-using the already present 0x2F which has effectively the same masking effect.

I don't think any compiler is smart enough these days to have that knowledge / information in order to do such optimizations.
On the other hand I don't know how much perf would regress by using the simpler C#-code that this issue proposes.

@JulieLeeMSFT
Copy link
Member

Assigning to @tannergooding to respond to the request.

@JulieLeeMSFT JulieLeeMSFT added this to the Future milestone Mar 14, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Mar 14, 2023
@JulieLeeMSFT
Copy link
Member

JulieLeeMSFT commented Mar 14, 2023

We will not have time to implement this code optimization in .NET8.
Cc @kunalspathak @TIHan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

4 participants