Improve codegen for Vector128.Shift* operations where a direct intrinsic is not available #82564

MihaZupan · 2023-02-23T23:20:28Z

(applies to Vector256 as well)

Consider Vector128.ShiftRightLogical(ref byte) where X86 does not have a ShiftRightLogical instruction that operates on bytes:

Vector128<byte> v0 = Vector128.LoadUnsafe(ref source);
Vector128<byte> v1 = Vector128.ShiftRightLogical(v0, 4);

Which currently emits a scalar fallback

TestClass.Foo(Byte ByRef)
    L0000: push rsi
    L0001: sub rsp, 0x40
    L0005: vzeroupper
    L0008: vmovdqu xmm0, [rcx]
    L000c: vmovapd [rsp+0x20], xmm0
    L0012: xor esi, esi
    L0014: lea rcx, [rsp+0x20]
    L0019: movsxd rdx, esi
    L001c: movzx ecx, byte ptr [rcx+rdx]
    L0020: mov edx, 4
    L0025: mov rax, 0x7ffa0845bc60
    L002f: call qword ptr [rax]
    L0031: lea rdx, [rsp+0x30]
    L0036: movsxd rcx, esi
    L0039: mov [rdx+rcx], al
    L003c: inc esi
    L003e: cmp esi, 0x10
    L0041: jl short L0014
    L0043: vmovapd xmm0, [rsp+0x30]
    L0049: vpmovmskb eax, xmm0
    L004d: add rsp, 0x40
    L0051: pop rsi
    L0052: ret

where it could instead emit a 32-bit shift and an AND to clear the overlapping bits

Vector128<byte> v0 = Vector128.LoadUnsafe(ref source);
Vector128<byte> v1 = Vector128.ShiftRightLogical(v0.AsInt32(), 4).AsByte() & Vector128.Create((byte)0xF);

TestClass.Bar(Byte ByRef)
    L0000: vzeroupper
    L0003: vmovdqu xmm0, [rcx]
    L0007: vpsrld xmm0, xmm0, 4
    L000c: vpand xmm0, xmm0, [0x7ffa087600d0]
    L0014: vpmovmskb eax, xmm0
    L0018: ret

We have a few places in runtime that are aware of this issue and employ workarounds, e.g.:

runtime/src/libraries/System.Private.CoreLib/src/System/IndexOfAnyValues/IndexOfAnyAsciiSearcher.cs

Line 875 in c1abf87

: Sse2.ShiftRightLogical(source.AsInt32(), 4).AsByte() & Vector128.Create((byte)0xF);
runtime/src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

Line 594 in dc6ad37

Vector128<byte> hiNibbles = Vector128.ShiftRightLogical(str.AsInt32(), 4).AsByte() & mask2F;
https://github.com/dotnet/runtime/blob/8482f562a8b5d96bb0a0fb201bfabea7e5e6b115/src/libraries/System.Private.CoreLib/src/System/IndexOfAnyValues/ProbabilisticMap.cs#L168-L170

The text was updated successfully, but these errors were encountered:

ghost · 2023-02-23T23:20:34Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Consider Vector128.ShiftRightLogical(ref byte) where X86 does not have a ShiftRightLogical instruction that operates on bytes:

Vector128<byte> v0 = Vector128.LoadUnsafe(ref source);
Vector128<byte> v1 = Vector128.ShiftRightLogical(v0, 4);

Which currently emits a scalar fallback

TestClass.Foo(Byte ByRef)
    L0000: push rsi
    L0001: sub rsp, 0x40
    L0005: vzeroupper
    L0008: vmovdqu xmm0, [rcx]
    L000c: vmovapd [rsp+0x20], xmm0
    L0012: xor esi, esi
    L0014: lea rcx, [rsp+0x20]
    L0019: movsxd rdx, esi
    L001c: movzx ecx, byte ptr [rcx+rdx]
    L0020: mov edx, 4
    L0025: mov rax, 0x7ffa0845bc60
    L002f: call qword ptr [rax]
    L0031: lea rdx, [rsp+0x30]
    L0036: movsxd rcx, esi
    L0039: mov [rdx+rcx], al
    L003c: inc esi
    L003e: cmp esi, 0x10
    L0041: jl short L0014
    L0043: vmovapd xmm0, [rsp+0x30]
    L0049: vpmovmskb eax, xmm0
    L004d: add rsp, 0x40
    L0051: pop rsi
    L0052: ret

where it could instead emit a 32-bit shift and an AND to clear the overlapping bits

Vector128<byte> v0 = Vector128.LoadUnsafe(ref source);
Vector128<byte> v1 = Vector128.ShiftRightLogical(v0.AsInt32(), 4).AsByte() & Vector128.Create((byte)0xF);

TestClass.Bar(Byte ByRef)
    L0000: vzeroupper
    L0003: vmovdqu xmm0, [rcx]
    L0007: vpsrld xmm0, xmm0, 4
    L000c: vpand xmm0, xmm0, [0x7ffa087600d0]
    L0014: vpmovmskb eax, xmm0
    L0018: ret

We have a few places in runtime that are aware of this issue and employ workarounds, e.g.:

runtime/src/libraries/System.Private.CoreLib/src/System/IndexOfAnyValues/IndexOfAnyAsciiSearcher.cs

Line 875 in c1abf87

: Sse2.ShiftRightLogical(source.AsInt32(), 4).AsByte() & Vector128.Create((byte)0xF);
runtime/src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

Line 594 in dc6ad37

Vector128<byte> hiNibbles = Vector128.ShiftRightLogical(str.AsInt32(), 4).AsByte() & mask2F;
https://github.com/dotnet/runtime/blob/5b6f7a27868e3a72ae1f993df16f41686fcc589b/src/libraries/System.Private.CoreLib/src/System/IndexOfAnyValues/ProbabilisticMap.cs#L166-L168

Author:	MihaZupan
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

gfoidl · 2023-02-24T10:54:35Z

For

runtime/src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

Line 594 in dc6ad37

    
           Vector128<byte> hiNibbles = Vector128.ShiftRightLogical(str.AsInt32(), 4).AsByte() & mask2F;

just noting that with the proposed codegen a extra register for Vector128.Create((byte)0xF) is needed which was avoided by intention by re-using the already present 0x2F which has effectively the same masking effect.

I don't think any compiler is smart enough these days to have that knowledge / information in order to do such optimizations.
On the other hand I don't know how much perf would regress by using the simpler C#-code that this issue proposes.

JulieLeeMSFT · 2023-03-06T18:50:02Z

Assigning to @tannergooding to respond to the request.

JulieLeeMSFT · 2023-03-14T11:39:13Z

We will not have time to implement this code optimization in .NET8.
Cc @kunalspathak @TIHan.

MihaZupan added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 23, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Feb 23, 2023

MihaZupan mentioned this issue Feb 23, 2023

Vectorize ProbabilisticMap.IndexOfAny #80963

Merged

JulieLeeMSFT assigned tannergooding Mar 6, 2023

JulieLeeMSFT added this to the Future milestone Mar 14, 2023

ghost removed the untriaged New issue has not been triaged by the area owner label Mar 14, 2023

MihaZupan mentioned this issue May 27, 2023

JIT: Optimize const ShiftRightLogical for byte values on XArch #86841

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve codegen for Vector128.Shift* operations where a direct intrinsic is not available #82564

Improve codegen for Vector128.Shift* operations where a direct intrinsic is not available #82564

MihaZupan commented Feb 23, 2023 •

edited

Loading

ghost commented Feb 23, 2023

gfoidl commented Feb 24, 2023

JulieLeeMSFT commented Mar 6, 2023

JulieLeeMSFT commented Mar 14, 2023 •

edited

Loading

Improve codegen for Vector128.Shift* operations where a direct intrinsic is not available #82564

Improve codegen for Vector128.Shift* operations where a direct intrinsic is not available #82564

Comments

MihaZupan commented Feb 23, 2023 • edited Loading

ghost commented Feb 23, 2023

gfoidl commented Feb 24, 2023

JulieLeeMSFT commented Mar 6, 2023

JulieLeeMSFT commented Mar 14, 2023 • edited Loading

MihaZupan commented Feb 23, 2023 •

edited

Loading

JulieLeeMSFT commented Mar 14, 2023 •

edited

Loading