Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vector support to System.Numerics.Tensors.TensorPrimitives.LeadingZeroCount for Byte and Int16 #110333

Merged
merged 11 commits into from
Dec 19, 2024

Conversation

alexcovington
Copy link
Contributor

This PR adds vector support for integer types of size Byte and Int16 to System.Numerics.Tensors.TensorPrimitives.LeadingZeroCount.

To verify there is a performance improvement, I ran against the existing microbenchmarks here. This does not currently include coverage for Int16, so I built a version locally that included short.

On my AMD64 system, I see the following improvements:

Baseline

| Type                                      | Method           | BufferLength | Mean      | Error    | StdDev   | Median    | Min       | Max       | Allocated |
|------------------------------------------ |----------------- |------------- |----------:|---------:|---------:|----------:|----------:|----------:|----------:|
| Perf_BinaryIntegerTensorPrimitives<Byte>  | LeadingZeroCount | 128          |  34.38 ns | 0.163 ns | 0.136 ns |  34.40 ns |  34.14 ns |  34.62 ns |         - |
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | 128          |  52.31 ns | 1.148 ns | 1.322 ns |  51.66 ns |  50.78 ns |  54.56 ns |         - |
| Perf_BinaryIntegerTensorPrimitives<Byte>  | LeadingZeroCount | 3079         | 702.93 ns | 2.372 ns | 2.218 ns | 703.63 ns | 699.32 ns | 706.21 ns |         - |
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | 3079         | 793.33 ns | 4.117 ns | 3.650 ns | 793.13 ns | 788.37 ns | 800.02 ns |         - |

Diff

| Type                                      | Method           | BufferLength | Mean      | Error     | StdDev    | Median    | Min       | Max       | Allocated |
|------------------------------------------ |----------------- |------------- |----------:|----------:|----------:|----------:|----------:|----------:|----------:|
| Perf_BinaryIntegerTensorPrimitives<Byte>  | LeadingZeroCount | 128          |  6.407 ns | 0.0571 ns | 0.0534 ns |  6.394 ns |  6.344 ns |  6.509 ns |         - |
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | 128          | 10.175 ns | 0.0432 ns | 0.0361 ns | 10.184 ns | 10.107 ns | 10.224 ns |         - |
| Perf_BinaryIntegerTensorPrimitives<Byte>  | LeadingZeroCount | 3079         | 24.931 ns | 0.1143 ns | 0.1069 ns | 24.936 ns | 24.783 ns | 25.110 ns |         - |
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | 3079         | 67.412 ns | 0.5054 ns | 0.4221 ns | 67.317 ns | 66.923 ns | 68.465 ns |         - |

Alex Covington (Advanced Micro Devices added 3 commits December 2, 2024 13:41
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Dec 2, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-numerics-tensors
See info in area-owners.md if you want to be subscribed.

@alexcovington alexcovington marked this pull request as ready for review December 3, 2024 00:09
Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Seems reasonable to me, but @tannergooding should have a look, too.

…ors/netcore/TensorPrimitives.LeadingZeroCount.cs

Co-authored-by: Tanner Gooding <[email protected]>
Comment on lines +60 to +66
Vector128<uint> lowHalf = Vector128.Create((uint)0x0000FFFF);
Vector128<uint> x_bot16 = Sse2.Or(Sse2.ShiftLeftLogical(x.AsUInt32(), 16), lowHalf);
Vector128<uint> x_top16 = Sse2.Or(x.AsUInt32(), lowHalf);
Vector128<uint> lz_bot16 = Avx512CD.VL.LeadingZeroCount(x_bot16);
Vector128<uint> lz_top16 = Avx512CD.VL.LeadingZeroCount(x_top16);
Vector128<uint> lz_top16_shift = Sse2.ShiftLeftLogical(lz_top16, 16);
return Sse2.Or(lz_bot16, lz_top16_shift).AsUInt16().As<ushort, T>();
Copy link
Member

@tannergooding tannergooding Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this cheaper than:

Vector256<int> x32 = Avx2.ConvertToVector256Int32(x.AsUInt16());
Vector256<int> lz = Avx512CD.VL.LeadingZeroCount(x32);
return Avx512F.VL.ConvertToVector128UInt16(lz) - Vector128.Create(16);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Widening-unwidening has slightly worse performance here:

| Type                                      | Method           | Job        | Toolchain                                       | BufferLength | Mean      | Error     | StdDev    | Median    | Min       | Max       | Ratio | RatioSD | Allocated | Alloc Ratio |
|------------------------------------------ |----------------- |----------- |------------------------------------------------ |------------- |----------:|----------:|----------:|----------:|----------:|----------:|------:|--------:|----------:|------------:|
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | Job-WWKLJQ | Current PR                                      | 8            |  2.499 ns | 0.0319 ns | 0.0299 ns |  2.505 ns |  2.401 ns |  2.528 ns |  1.00 |    0.02 |         - |          NA |
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | Job-VFJFZO | Widen                                           | 8            |  2.633 ns | 0.0478 ns | 0.0447 ns |  2.650 ns |  2.570 ns |  2.676 ns |  1.05 |    0.02 |         - |          NA |

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment (and the related ones below) are still pending a response.

The current pattern used looks significantly more expensive (in size, instruction count, and micro-ops) than the more naive method of widen + lzcnt + narrow + subtract

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current pattern used looks significantly more expensive (in size, instruction count, and micro-ops) than the more naive method of widen + lzcnt + narrow + subtract

I agree that it does look more expensive because of the increased instruction count, however the benchmark shows that the current pattern is more performant because it avoids the overhead of widening+narrowing.

Another example, this one using BufferSize=3079 like in the original benchmark:

| Type                                      | Method           | Job        | Toolchain                                                                                      | BufferLength | Mean          | Error       | StdDev      | Median        | Min           | Max           | Ratio | RatioSD | Allocated | Alloc Ratio |
|------------------------------------------ |----------------- |----------- |----------------------------------------------------------------------------------------------- |------------- |--------------:|------------:|------------:|--------------:|--------------:|--------------:|------:|--------:|----------:|------------:|
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | Job-QIBRWQ | Current PR                                                                                     | 3079         |     66.700 ns |   0.3187 ns |   0.2981 ns |     66.761 ns |     66.319 ns |     67.168 ns |  1.00 |    0.01 |         - |          NA |
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | Job-TEWLEG | Widen + LZCNT + Subtract + Narrow                                                              | 3079         |    133.049 ns |   2.4410 ns |   2.1639 ns |    132.240 ns |    131.039 ns |    138.318 ns |  1.99 |    0.03 |         - |          NA |

And looking at the codegen for this case, the current PR actually generates a smaller function than the widen+narrow suggestion.

This function is where the hot loop of the benchmark is. Current PR generates a function with 1915 bytes of code, Widen+Narrow generates a function with 2356 bytes of code:

Codegen - Current PR
; Assembly listing for method System.Numerics.Tensors.TensorPrimitives:<InvokeSpanIntoSpan>g__Vectorized512|105_3[short,short,System.Numerics.Tensors.TensorPrimitives+LeadingZeroCountOperator`1[short]](byref,byref,ulong) (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 24 single block inlinees; 25 inlinees without PGO data
G_M000_IG01:                ;; offset=0x0000
       push     rbx
       sub      rsp, 16
       xor      eax, eax
       mov      qword ptr [rsp+0x08], rax
       mov      qword ptr [rsp], rax

G_M000_IG02:                ;; offset=0x0010
       mov      rax, rdx
       vmovups  zmm0, zmmword ptr [rcx]
       vmovups  zmm1, zmmword ptr [reloc @RWD00]
       vmovaps  zmm2, zmm1
       vpslld   zmm3, zmm0, 16
       vpord    zmm3, zmm3, zmm2
       vplzcntd zmm3, zmm3
       vpord    zmm0, zmm0, zmm2
       vplzcntd zmm0, zmm0
       vpslld   zmm0, zmm0, 16
       vpord    zmm0, zmm0, zmm3
       vmovups  zmm2, zmmword ptr [rcx+2*r8-0x40]
       vmovaps  zmm3, zmm1
       vpslld   zmm4, zmm2, 16
       vpord    zmm4, zmm4, zmm3
       vplzcntd zmm4, zmm4
       vpord    zmm2, zmm2, zmm3
       vplzcntd zmm2, zmm2
       vpslld   zmm2, zmm2, 16
       vpord    zmm2, zmm2, zmm4
       cmp      r8, 256
       jbe      G_M000_IG13

G_M000_IG03:                ;; offset=0x009C
       mov      bword ptr [rsp+0x08], rcx
       mov      bword ptr [rsp], rax
       mov      rdx, rcx
       mov      r10, rax
       mov      r9, r10
       test     r9b, 1
       sete     r11b
       movzx    r11, r11b
       test     r11d, r11d
       je       SHORT G_M000_IG04
       mov      rdx, r10
       and      rdx, 63
       neg      rdx
       add      rdx, 64
       shr      rdx, 1
       lea      r9, [rdx+rdx]
       add      rcx, r9
       add      r9, r10
       sub      r8, rdx
       mov      rdx, rcx

G_M000_IG04:                ;; offset=0x00E0
       cmp      r8, 0x20000
       seta     cl
       movzx    rcx, cl
       test     ecx, r11d
       je       G_M000_IG07
       jmp      G_M000_IG10

G_M000_IG05:                ;; offset=0x00FB
       vmovups  zmm3, zmmword ptr [rdx]
       vmovaps  zmm4, zmm1
       vpslld   zmm5, zmm3, 16
       vpord    zmm5, zmm5, zmm4
       vplzcntd zmm5, zmm5
       vpord    zmm3, zmm3, zmm4
       vplzcntd zmm3, zmm3
       vpslld   zmm3, zmm3, 16
       vpord    zmm3, zmm3, zmm5
       vmovups  zmm4, zmmword ptr [rdx+0x40]
       vmovaps  zmm5, zmm1
       vpslld   zmm16, zmm4, 16
       vpord    zmm16, zmm16, zmm5
       vplzcntd zmm16, zmm16
       vpord    zmm4, zmm4, zmm5
       vplzcntd zmm4, zmm4
       vpslld   zmm4, zmm4, 16
       vpord    zmm4, zmm4, zmm16
       vmovups  zmm5, zmmword ptr [rdx+0x80]
       vmovaps  zmm16, zmm1
       vpslld   zmm17, zmm5, 16
       vpord    zmm17, zmm17, zmm16
       vplzcntd zmm17, zmm17
       vpord    zmm5, zmm5, zmm16
       vplzcntd zmm5, zmm5
       vpslld   zmm5, zmm5, 16
       vpord    zmm5, zmm5, zmm17
       vmovups  zmm16, zmmword ptr [rdx+0xC0]
       vmovaps  zmm17, zmm1
       vpslld   zmm18, zmm16, 16
       vpord    zmm18, zmm18, zmm17
       vplzcntd zmm18, zmm18
       vpord    zmm16, zmm16, zmm17
       vplzcntd zmm16, zmm16
       vpslld   zmm16, zmm16, 16
       vpord    zmm16, zmm16, zmm18
       vmovups  zmmword ptr [r9], zmm3
       vmovups  zmmword ptr [r9+0x40], zmm4
       vmovups  zmmword ptr [r9+0x80], zmm5
       vmovups  zmmword ptr [r9+0xC0], zmm16
       vmovups  zmm3, zmmword ptr [rdx+0x100]
       vmovaps  zmm4, zmm1
       vpslld   zmm5, zmm3, 16
       vpord    zmm5, zmm5, zmm4
       vplzcntd zmm5, zmm5
       vpord    zmm3, zmm3, zmm4
       vplzcntd zmm3, zmm3
       vpslld   zmm3, zmm3, 16
       vpord    zmm3, zmm3, zmm5
       vmovups  zmm4, zmmword ptr [rdx+0x140]
       vmovaps  zmm5, zmm1
       vpslld   zmm16, zmm4, 16
       vpord    zmm16, zmm16, zmm5
       vplzcntd zmm16, zmm16
       vpord    zmm4, zmm4, zmm5
       vplzcntd zmm4, zmm4
       vpslld   zmm4, zmm4, 16
       vpord    zmm4, zmm4, zmm16
       vmovups  zmm5, zmmword ptr [rdx+0x180]
       vmovaps  zmm16, zmm1
       vpslld   zmm17, zmm5, 16
       vpord    zmm17, zmm17, zmm16
       vplzcntd zmm17, zmm17
       vpord    zmm5, zmm5, zmm16
       vplzcntd zmm5, zmm5
       vpslld   zmm5, zmm5, 16
       vpord    zmm5, zmm5, zmm17
       vmovups  zmm16, zmmword ptr [rdx+0x1C0]
       vmovaps  zmm17, zmm1
       vpslld   zmm18, zmm16, 16
       vpord    zmm18, zmm18, zmm17

G_M000_IG06:                ;; offset=0x02BE
       vplzcntd zmm18, zmm18
       vpord    zmm16, zmm16, zmm17
       vplzcntd zmm16, zmm16
       vpslld   zmm16, zmm16, 16
       vpord    zmm16, zmm16, zmm18
       vmovups  zmmword ptr [r9+0x100], zmm3
       vmovups  zmmword ptr [r9+0x140], zmm4
       vmovups  zmmword ptr [r9+0x180], zmm5
       vmovups  zmmword ptr [r9+0x1C0], zmm16
       add      rdx, 512
       add      r9, 512
       add      r8, -256

G_M000_IG07:                ;; offset=0x030E
       cmp      r8, 256
       jae      G_M000_IG05
       jmp      G_M000_IG11
       align    [0 bytes for IG08]

G_M000_IG08:                ;; offset=0x0320
       vmovups  zmm3, zmmword ptr [rdx]
       vmovaps  zmm4, zmm1
       vpslld   zmm5, zmm3, 16
       vpord    zmm5, zmm5, zmm4
       vplzcntd zmm5, zmm5
       vpord    zmm3, zmm3, zmm4
       vplzcntd zmm3, zmm3
       vpslld   zmm3, zmm3, 16
       vpord    zmm3, zmm3, zmm5
       vmovups  zmm4, zmmword ptr [rdx+0x40]
       vmovaps  zmm5, zmm1
       vpslld   zmm16, zmm4, 16
       vpord    zmm16, zmm16, zmm5
       vplzcntd zmm16, zmm16
       vpord    zmm4, zmm4, zmm5
       vplzcntd zmm4, zmm4
       vpslld   zmm4, zmm4, 16
       vpord    zmm4, zmm4, zmm16
       vmovups  zmm5, zmmword ptr [rdx+0x80]
       vmovaps  zmm16, zmm1
       vpslld   zmm17, zmm5, 16
       vpord    zmm17, zmm17, zmm16
       vplzcntd zmm17, zmm17
       vpord    zmm5, zmm5, zmm16
       vplzcntd zmm5, zmm5
       vpslld   zmm5, zmm5, 16
       vpord    zmm5, zmm5, zmm17
       vmovups  zmm16, zmmword ptr [rdx+0xC0]
       vmovaps  zmm17, zmm1
       vpslld   zmm18, zmm16, 16
       vpord    zmm18, zmm18, zmm17
       vplzcntd zmm18, zmm18
       vpord    zmm16, zmm16, zmm17
       vplzcntd zmm16, zmm16
       vpslld   zmm16, zmm16, 16
       vpord    zmm16, zmm16, zmm18
       vmovntdq zmmword ptr [r9], zmm3
       vmovntdq zmmword ptr [r9+0x40], zmm4
       vmovntdq zmmword ptr [r9+0x80], zmm5
       vmovntdq zmmword ptr [r9+0xC0], zmm16
       vmovups  zmm3, zmmword ptr [rdx+0x100]
       vmovaps  zmm4, zmm1
       vpslld   zmm5, zmm3, 16
       vpord    zmm5, zmm5, zmm4
       vplzcntd zmm5, zmm5
       vpord    zmm3, zmm3, zmm4
       vplzcntd zmm3, zmm3
       vpslld   zmm3, zmm3, 16
       vpord    zmm3, zmm3, zmm5
       vmovups  zmm4, zmmword ptr [rdx+0x140]
       vmovaps  zmm5, zmm1
       vpslld   zmm16, zmm4, 16
       vpord    zmm16, zmm16, zmm5
       vplzcntd zmm16, zmm16
       vpord    zmm4, zmm4, zmm5
       vplzcntd zmm4, zmm4
       vpslld   zmm4, zmm4, 16
       vpord    zmm4, zmm4, zmm16
       vmovups  zmm5, zmmword ptr [rdx+0x180]
       vmovaps  zmm16, zmm1
       vpslld   zmm17, zmm5, 16
       vpord    zmm17, zmm17, zmm16
       vplzcntd zmm17, zmm17
       vpord    zmm5, zmm5, zmm16
       vplzcntd zmm5, zmm5
       vpslld   zmm5, zmm5, 16
       vpord    zmm5, zmm5, zmm17
       vmovups  zmm16, zmmword ptr [rdx+0x1C0]
       vmovaps  zmm17, zmm1
       vpslld   zmm18, zmm16, 16
       vpord    zmm18, zmm18, zmm17

G_M000_IG09:                ;; offset=0x04E3
       vplzcntd zmm18, zmm18
       vpord    zmm16, zmm16, zmm17
       vplzcntd zmm16, zmm16
       vpslld   zmm16, zmm16, 16
       vpord    zmm16, zmm16, zmm18
       vmovntdq zmmword ptr [r9+0x100], zmm3
       vmovntdq zmmword ptr [r9+0x140], zmm4
       vmovntdq zmmword ptr [r9+0x180], zmm5
       vmovntdq zmmword ptr [r9+0x1C0], zmm16
       add      rdx, 512
       add      r9, 512
       add      r8, -256

G_M000_IG10:                ;; offset=0x0533
       cmp      r8, 256
       jae      G_M000_IG08

G_M000_IG11:                ;; offset=0x0540
       mov      rcx, rdx
       mov      rdx, r9
       xor      r10d, r10d
       mov      bword ptr [rsp], r10

G_M000_IG12:                ;; offset=0x054D
       mov      bword ptr [rsp+0x08], r10

G_M000_IG13:                ;; offset=0x0552
       mov      r10, r8
       lea      r8, [r10+0x1F]
       and      r8, -32
       mov      r9, r8
       shr      r9, 5
       cmp      r9, 8
       ja       G_M000_IG26

G_M000_IG14:                ;; offset=0x056E
       cmp      r9d, 8
       ja       G_M000_IG26

G_M000_IG15:                ;; offset=0x0578
       mov      r9d, r9d
       lea      r11, [reloc @RWD64]
       mov      r11d, dword ptr [r11+4*r9]
       lea      rbx, G_M000_IG02
       add      r11, rbx
       jmp      r11

G_M000_IG16:                ;; offset=0x0593
       vmovups  zmm3, zmmword ptr [rcx+2*r8-0x200]
       vmovaps  zmm4, zmm1
       vpslld   zmm5, zmm3, 16
       vpord    zmm5, zmm5, zmm4
       vplzcntd zmm5, zmm5
       vpord    zmm3, zmm3, zmm4
       vplzcntd zmm3, zmm3
       vpslld   zmm3, zmm3, 16
       vpord    zmm3, zmm3, zmm5
       vmovups  zmmword ptr [rdx+2*r8-0x200], zmm3

G_M000_IG17:                ;; offset=0x05D5
       vmovups  zmm3, zmmword ptr [rcx+2*r8-0x1C0]
       vmovaps  zmm4, zmm1
       vpslld   zmm5, zmm3, 16
       vpord    zmm5, zmm5, zmm4
       vplzcntd zmm5, zmm5
       vpord    zmm3, zmm3, zmm4
       vplzcntd zmm3, zmm3
       vpslld   zmm3, zmm3, 16
       vpord    zmm3, zmm3, zmm5
       vmovups  zmmword ptr [rdx+2*r8-0x1C0], zmm3

G_M000_IG18:                ;; offset=0x0617
       vmovups  zmm3, zmmword ptr [rcx+2*r8-0x180]
       vmovaps  zmm4, zmm1
       vpslld   zmm5, zmm3, 16
       vpord    zmm5, zmm5, zmm4
       vplzcntd zmm5, zmm5
       vpord    zmm3, zmm3, zmm4
       vplzcntd zmm3, zmm3
       vpslld   zmm3, zmm3, 16
       vpord    zmm3, zmm3, zmm5
       vmovups  zmmword ptr [rdx+2*r8-0x180], zmm3

G_M000_IG19:                ;; offset=0x0659
       vmovups  zmm3, zmmword ptr [rcx+2*r8-0x140]
       vmovaps  zmm4, zmm1
       vpslld   zmm5, zmm3, 16
       vpord    zmm5, zmm5, zmm4
       vplzcntd zmm5, zmm5
       vpord    zmm3, zmm3, zmm4
       vplzcntd zmm3, zmm3
       vpslld   zmm3, zmm3, 16
       vpord    zmm3, zmm3, zmm5
       vmovups  zmmword ptr [rdx+2*r8-0x140], zmm3

G_M000_IG20:                ;; offset=0x069B
       vmovups  zmm3, zmmword ptr [rcx+2*r8-0x100]
       vmovaps  zmm4, zmm1
       vpslld   zmm5, zmm3, 16
       vpord    zmm5, zmm5, zmm4
       vplzcntd zmm5, zmm5
       vpord    zmm3, zmm3, zmm4
       vplzcntd zmm3, zmm3
       vpslld   zmm3, zmm3, 16
       vpord    zmm3, zmm3, zmm5
       vmovups  zmmword ptr [rdx+2*r8-0x100], zmm3

G_M000_IG21:                ;; offset=0x06DD
       vmovups  zmm3, zmmword ptr [rcx+2*r8-0xC0]
       vmovaps  zmm4, zmm1
       vpslld   zmm5, zmm3, 16
       vpord    zmm5, zmm5, zmm4
       vplzcntd zmm5, zmm5
       vpord    zmm3, zmm3, zmm4
       vplzcntd zmm3, zmm3
       vpslld   zmm3, zmm3, 16
       vpord    zmm3, zmm3, zmm5
       vmovups  zmmword ptr [rdx+2*r8-0xC0], zmm3

G_M000_IG22:                ;; offset=0x071F
       vmovups  zmm3, zmmword ptr [rcx+2*r8-0x80]
       vpslld   zmm4, zmm3, 16
       vpord    zmm4, zmm4, zmm1
       vplzcntd zmm4, zmm4
       vpord    zmm1, zmm3, zmm1
       vplzcntd zmm1, zmm1
       vpslld   zmm1, zmm1, 16
       vpord    zmm1, zmm1, zmm4
       vmovups  zmmword ptr [rdx+2*r8-0x80], zmm1

G_M000_IG23:                ;; offset=0x075B
       vmovups  zmmword ptr [rdx+2*r10-0x40], zmm2

G_M000_IG24:                ;; offset=0x0763
       vmovups  zmmword ptr [rax], zmm0

G_M000_IG25:                ;; offset=0x0769
       vzeroupper
       add      rsp, 16
       pop      rbx
       ret

G_M000_IG26:                ;; offset=0x0772
       vzeroupper
       add      rsp, 16
       pop      rbx
       ret

RWD00   dq      0000FFFF0000FFFFh, 0000FFFF0000FFFFh, 0000FFFF0000FFFFh, 0000FFFF0000FFFFh, 0000FFFF0000FFFFh, 0000FFFF0000FFFFh, 0000FFFF0000FFFFh, 0000FFFF0000FFFFh
RWD64   dd      00000753h ; case G_M000_IG24
        dd      0000074Bh ; case G_M000_IG23
        dd      0000070Fh ; case G_M000_IG22
        dd      000006CDh ; case G_M000_IG21
        dd      0000068Bh ; case G_M000_IG20
        dd      00000649h ; case G_M000_IG19
        dd      00000607h ; case G_M000_IG18
        dd      000005C5h ; case G_M000_IG17
        dd      00000583h ; case G_M000_IG16
; Total bytes of code 1915
Codegen - Widen+narrow
; Assembly listing for method System.Numerics.Tensors.TensorPrimitives:<InvokeSpanIntoSpan>g__Vectorized512|105_3[short,short,System.Numerics.Tensors.TensorPrimitives+LeadingZeroCountOperator`1[short]](byref,byref,ulong) (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 124 single block inlinees; 50 inlinees without PGO data
G_M000_IG01:                ;; offset=0x0000
       push     rbx
       sub      rsp, 16
       xor      eax, eax
       mov      qword ptr [rsp+0x08], rax
       mov      qword ptr [rsp], rax
       vxorps   xmm0, xmm0, xmm0
       vmovaps  xmm1, xmm0
       vmovaps  xmm2, xmm0
       vmovaps  xmm3, xmm0
       vmovaps  xmm4, xmm0
       vmovaps  xmm5, xmm0
       vmovaps  xmm16, xmm0
       vmovaps  xmm17, xmm0
       vmovaps  xmm18, xmm0
       vmovaps  xmm19, xmm0
       vmovaps  xmm20, xmm0
       vmovaps  xmm21, xmm0
       vmovaps  xmm22, xmm0
       vmovaps  xmm23, xmm0
       vmovaps  xmm24, xmm0
       vmovaps  xmm25, xmm0

G_M000_IG02:                ;; offset=0x0064
       mov      rax, rdx
       vmovups  zmm26, zmmword ptr [rcx]
       vextracti32x8 ymm27, zmm26, 1
       vpmovzxwd zmm27, zmm27
       vplzcntd zmm27, zmm27
       vpmovzxwd zmm26, zmm26
       vplzcntd zmm26, zmm26
       vpmovdw  zmm26, zmm26
       vpmovdw  zmm27, zmm27
       vxorps   ymm28, ymm28, ymm28
       vinsertf64x4 zmm26, zmm28, ymm26, 0
       vinsertf64x4 zmm26, zmm26, ymm27, 1
       vmovups  zmm27, zmmword ptr [reloc @RWD00]
       vpsubw   zmm26, zmm26, zmm27
       vmovups  zmm28, zmmword ptr [rcx+2*r8-0x40]
       vextracti32x8 ymm29, zmm28, 1
       vpmovzxwd zmm29, zmm29
       vplzcntd zmm29, zmm29
       vpmovzxwd zmm28, zmm28
       vplzcntd zmm28, zmm28
       vpmovdw  zmm28, zmm28
       vpmovdw  zmm29, zmm29
       vxorps   ymm30, ymm30, ymm30
       vinsertf64x4 zmm28, zmm30, ymm28, 0
       vinsertf64x4 zmm28, zmm28, ymm29, 1
       vpsubw   zmm28, zmm28, zmm27
       cmp      r8, 256
       jbe      G_M000_IG13

G_M000_IG03:                ;; offset=0x0116
       mov      bword ptr [rsp+0x08], rcx
       mov      bword ptr [rsp], rax
       mov      rdx, rcx
       mov      r10, rax
       mov      r9, r10
       test     r9b, 1
       sete     r11b
       movzx    r11, r11b
       test     r11d, r11d
       je       SHORT G_M000_IG04
       mov      rdx, r10
       and      rdx, 63
       neg      rdx
       add      rdx, 64
       shr      rdx, 1
       lea      r9, [rdx+rdx]
       add      rcx, r9
       add      r9, r10
       sub      r8, rdx
       mov      rdx, rcx

G_M000_IG04:                ;; offset=0x015A
       cmp      r8, 0x20000
       seta     cl
       movzx    rcx, cl
       test     ecx, r11d
       jne      G_M000_IG10
       align    [0 bytes for IG05]

G_M000_IG05:                ;; offset=0x0170
       cmp      r8, 256
       jb       G_M000_IG11

G_M000_IG06:                ;; offset=0x017D
       vmovups  zmm0, zmmword ptr [rdx]
       vextracti32x8 ymm1, zmm0, 1
       vpmovzxwd zmm2, zmm1
       vplzcntd zmm3, zmm2
       vpmovzxwd zmm4, zmm0
       vplzcntd zmm5, zmm4
       vpmovdw  zmm5, zmm16
       vpmovdw  zmm3, zmm17
       vinsertf64x4 zmm18, zmm18, ymm16, 0
       vinsertf64x4 zmm18, zmm18, ymm17, 1
       vpsubw   zmm0, zmm18, zmm27
       vmovups  zmm1, zmmword ptr [rdx+0x40]
       vextracti32x8 ymm2, zmm1, 1
       vpmovzxwd zmm2, zmm2
       vplzcntd zmm2, zmm2
       vpmovzxwd zmm1, zmm1
       vplzcntd zmm1, zmm1
       vpmovdw  zmm1, zmm1
       vpmovdw  zmm2, zmm2
       vinsertf64x4 zmm19, zmm19, ymm1, 0
       vinsertf64x4 zmm19, zmm19, ymm2, 1
       vpsubw   zmm1, zmm19, zmm27
       vmovups  zmm2, zmmword ptr [rdx+0x80]
       vextracti32x8 ymm3, zmm2, 1
       vpmovzxwd zmm3, zmm3
       vplzcntd zmm3, zmm3
       vpmovzxwd zmm2, zmm2
       vplzcntd zmm2, zmm2
       vpmovdw  zmm2, zmm2
       vpmovdw  zmm3, zmm3
       vinsertf64x4 zmm20, zmm20, ymm2, 0
       vinsertf64x4 zmm20, zmm20, ymm3, 1
       vpsubw   zmm2, zmm20, zmm27
       vmovups  zmm3, zmmword ptr [rdx+0xC0]
       vextracti32x8 ymm4, zmm3, 1
       vpmovzxwd zmm4, zmm4
       vplzcntd zmm4, zmm4
       vpmovzxwd zmm3, zmm3
       vplzcntd zmm3, zmm3
       vpmovdw  zmm3, zmm3
       vpmovdw  zmm4, zmm4
       vinsertf64x4 zmm21, zmm21, ymm3, 0
       vinsertf64x4 zmm21, zmm21, ymm4, 1
       vpsubw   zmm3, zmm21, zmm27
       vmovups  zmmword ptr [r9], zmm0
       vmovups  zmmword ptr [r9+0x40], zmm1
       vmovups  zmmword ptr [r9+0x80], zmm2
       vmovups  zmmword ptr [r9+0xC0], zmm3
       vmovups  zmm0, zmmword ptr [rdx+0x100]
       vextracti32x8 ymm1, zmm0, 1
       vpmovzxwd zmm2, zmm1
       vplzcntd zmm1, zmm2
       vpmovzxwd zmm0, zmm0
       vplzcntd zmm0, zmm0
       vpmovdw  zmm0, zmm0
       vpmovdw  zmm1, zmm1
       vinsertf64x4 zmm22, zmm22, ymm0, 0
       vinsertf64x4 zmm22, zmm22, ymm1, 1
       vpsubw   zmm0, zmm22, zmm27
       vmovups  zmm1, zmmword ptr [rdx+0x140]
       vextracti32x8 ymm2, zmm1, 1
       vpmovzxwd zmm2, zmm2
       vplzcntd zmm2, zmm2
       vpmovzxwd zmm1, zmm1
       vplzcntd zmm1, zmm1
       vpmovdw  zmm1, zmm1
       vpmovdw  zmm2, zmm2
       vinsertf64x4 zmm23, zmm23, ymm1, 0
       vinsertf64x4 zmm23, zmm23, ymm2, 1
       vpsubw   zmm1, zmm23, zmm27
       vmovups  zmm2, zmmword ptr [rdx+0x180]
       vextracti32x8 ymm3, zmm2, 1
       vpmovzxwd zmm3, zmm3
       vplzcntd zmm3, zmm3

G_M000_IG07:                ;; offset=0x0355
       vpmovzxwd zmm2, zmm2
       vplzcntd zmm2, zmm2
       vpmovdw  zmm2, zmm2
       vpmovdw  zmm3, zmm3
       vinsertf64x4 zmm24, zmm24, ymm2, 0
       vinsertf64x4 zmm24, zmm24, ymm3, 1
       vpsubw   zmm2, zmm24, zmm27
       vmovups  zmm3, zmmword ptr [rdx+0x1C0]
       vextracti32x8 ymm4, zmm3, 1
       vpmovzxwd zmm4, zmm4
       vplzcntd zmm4, zmm4
       vpmovzxwd zmm3, zmm3
       vplzcntd zmm3, zmm3
       vpmovdw  zmm3, zmm3
       vpmovdw  zmm4, zmm4
       vinsertf64x4 zmm25, zmm25, ymm3, 0
       vinsertf64x4 zmm25, zmm25, ymm4, 1
       vpsubw   zmm3, zmm25, zmm27
       vmovups  zmmword ptr [r9+0x100], zmm0
       vmovups  zmmword ptr [r9+0x140], zmm1
       vmovups  zmmword ptr [r9+0x180], zmm2
       vmovups  zmmword ptr [r9+0x1C0], zmm3
       add      rdx, 512
       add      r9, 512
       add      r8, -256
       jmp      G_M000_IG05
       align    [0 bytes for IG08]

G_M000_IG08:                ;; offset=0x03FD
       vmovups  zmm18, zmmword ptr [rdx]
       vextracti32x8 ymm19, zmm18, 1
       vpmovzxwd zmm19, zmm19
       vplzcntd zmm19, zmm19
       vpmovzxwd zmm18, zmm18
       vplzcntd zmm18, zmm18
       vpmovdw  zmm18, zmm18
       vpmovdw  zmm19, zmm19
       vinsertf64x4 zmm0, zmm0, ymm18, 0
       vinsertf64x4 zmm0, zmm0, ymm19, 1
       vpsubw   zmm18, zmm0, zmm27
       vmovups  zmm19, zmmword ptr [rdx+0x40]
       vextracti32x8 ymm20, zmm19, 1
       vpmovzxwd zmm20, zmm20
       vplzcntd zmm20, zmm20
       vpmovzxwd zmm19, zmm19
       vplzcntd zmm19, zmm19
       vpmovdw  zmm19, zmm19
       vpmovdw  zmm20, zmm20
       vinsertf64x4 zmm1, zmm1, ymm19, 0
       vinsertf64x4 zmm1, zmm1, ymm20, 1
       vpsubw   zmm19, zmm1, zmm27
       vmovups  zmm20, zmmword ptr [rdx+0x80]
       vextracti32x8 ymm21, zmm20, 1
       vpmovzxwd zmm21, zmm21
       vplzcntd zmm21, zmm21
       vpmovzxwd zmm20, zmm20
       vplzcntd zmm20, zmm20
       vpmovdw  zmm20, zmm20
       vpmovdw  zmm21, zmm21
       vinsertf64x4 zmm2, zmm2, ymm20, 0
       vinsertf64x4 zmm2, zmm2, ymm21, 1
       vpsubw   zmm20, zmm2, zmm27
       vmovups  zmm21, zmmword ptr [rdx+0xC0]
       vextracti32x8 ymm22, zmm21, 1
       vpmovzxwd zmm22, zmm22
       vplzcntd zmm22, zmm22
       vpmovzxwd zmm21, zmm21
       vplzcntd zmm21, zmm21
       vpmovdw  zmm21, zmm21
       vpmovdw  zmm22, zmm22
       vinsertf64x4 zmm3, zmm3, ymm21, 0
       vinsertf64x4 zmm3, zmm3, ymm22, 1
       vpsubw   zmm21, zmm3, zmm27
       vmovntdq zmmword ptr [r9], zmm18
       vmovntdq zmmword ptr [r9+0x40], zmm19
       vmovntdq zmmword ptr [r9+0x80], zmm20
       vmovntdq zmmword ptr [r9+0xC0], zmm21
       vmovups  zmm18, zmmword ptr [rdx+0x100]
       vextracti32x8 ymm19, zmm18, 1
       vpmovzxwd zmm20, zmm19
       vplzcntd zmm19, zmm20
       vpmovzxwd zmm18, zmm18
       vplzcntd zmm18, zmm18
       vpmovdw  zmm18, zmm18
       vpmovdw  zmm19, zmm19
       vinsertf64x4 zmm4, zmm4, ymm18, 0
       vinsertf64x4 zmm4, zmm4, ymm19, 1
       vpsubw   zmm18, zmm4, zmm27
       vmovups  zmm19, zmmword ptr [rdx+0x140]
       vextracti32x8 ymm20, zmm19, 1
       vpmovzxwd zmm20, zmm20
       vplzcntd zmm20, zmm20
       vpmovzxwd zmm19, zmm19
       vplzcntd zmm19, zmm19
       vpmovdw  zmm19, zmm19
       vpmovdw  zmm20, zmm20
       vinsertf64x4 zmm5, zmm5, ymm19, 0
       vinsertf64x4 zmm5, zmm5, ymm20, 1
       vpsubw   zmm19, zmm5, zmm27
       vmovups  zmm20, zmmword ptr [rdx+0x180]
       vextracti32x8 ymm21, zmm20, 1
       vpmovzxwd zmm21, zmm21
       vplzcntd zmm21, zmm21

G_M000_IG09:                ;; offset=0x05D5
       vpmovzxwd zmm20, zmm20
       vplzcntd zmm20, zmm20
       vpmovdw  zmm20, zmm20
       vpmovdw  zmm21, zmm21
       vinsertf64x4 zmm16, zmm16, ymm20, 0
       vinsertf64x4 zmm16, zmm16, ymm21, 1
       vpsubw   zmm20, zmm16, zmm27
       vmovups  zmm21, zmmword ptr [rdx+0x1C0]
       vextracti32x8 ymm22, zmm21, 1
       vpmovzxwd zmm22, zmm22
       vplzcntd zmm22, zmm22
       vpmovzxwd zmm21, zmm21
       vplzcntd zmm21, zmm21
       vpmovdw  zmm21, zmm21
       vpmovdw  zmm22, zmm22
       vinsertf64x4 zmm17, zmm17, ymm21, 0
       vinsertf64x4 zmm17, zmm17, ymm22, 1
       vpsubw   zmm21, zmm17, zmm27
       vmovntdq zmmword ptr [r9+0x100], zmm18
       vmovntdq zmmword ptr [r9+0x140], zmm19
       vmovntdq zmmword ptr [r9+0x180], zmm20
       vmovntdq zmmword ptr [r9+0x1C0], zmm21
       add      rdx, 512
       add      r9, 512
       add      r8, -256

G_M000_IG10:                ;; offset=0x0678
       cmp      r8, 256
       jae      G_M000_IG08

G_M000_IG11:                ;; offset=0x0685
       mov      rcx, rdx
       mov      rdx, r9
       xor      r10d, r10d
       mov      bword ptr [rsp], r10

G_M000_IG12:                ;; offset=0x0692
       mov      bword ptr [rsp+0x08], r10

G_M000_IG13:                ;; offset=0x0697
       mov      r10, r8
       lea      r8, [r10+0x1F]
       and      r8, -32
       mov      r9, r8
       shr      r9, 5
       cmp      r9, 8
       ja       G_M000_IG25

G_M000_IG14:                ;; offset=0x06B3
       cmp      r9d, 8
       ja       G_M000_IG25

G_M000_IG15:                ;; offset=0x06BD
       mov      r9d, r9d
       lea      r11, [reloc @RWD64]
       mov      r11d, dword ptr [r11+4*r9]
       lea      rbx, G_M000_IG02
       add      r11, rbx
       jmp      r11

G_M000_IG16:                ;; offset=0x06D8
       vmovups  zmm0, zmmword ptr [rcx+2*r8-0x200]
       vextracti32x8 ymm1, zmm0, 1
       vpmovzxwd zmm1, zmm1
       vplzcntd zmm1, zmm1
       vpmovzxwd zmm0, zmm0
       vplzcntd zmm0, zmm0
       vpmovdw  zmm0, zmm0
       vpmovdw  zmm1, zmm1
       vxorps   ymm2, ymm2, ymm2
       vinsertf64x4 zmm0, zmm2, ymm0, 0
       vinsertf64x4 zmm0, zmm0, ymm1, 1
       vpsubw   zmm0, zmm0, zmm27
       vmovups  zmmword ptr [rdx+2*r8-0x200], zmm0

G_M000_IG17:                ;; offset=0x072B
       vmovups  zmm0, zmmword ptr [rcx+2*r8-0x1C0]
       vextracti32x8 ymm1, zmm0, 1
       vpmovzxwd zmm1, zmm1
       vplzcntd zmm1, zmm1
       vpmovzxwd zmm0, zmm0
       vplzcntd zmm0, zmm0
       vpmovdw  zmm0, zmm0
       vpmovdw  zmm1, zmm1
       vxorps   ymm2, ymm2, ymm2
       vinsertf64x4 zmm0, zmm2, ymm0, 0
       vinsertf64x4 zmm0, zmm0, ymm1, 1
       vpsubw   zmm0, zmm0, zmm27
       vmovups  zmmword ptr [rdx+2*r8-0x1C0], zmm0

G_M000_IG18:                ;; offset=0x077E
       vmovups  zmm0, zmmword ptr [rcx+2*r8-0x180]
       vextracti32x8 ymm1, zmm0, 1
       vpmovzxwd zmm1, zmm1
       vplzcntd zmm1, zmm1
       vpmovzxwd zmm0, zmm0
       vplzcntd zmm0, zmm0
       vpmovdw  zmm0, zmm0
       vpmovdw  zmm1, zmm1
       vxorps   ymm2, ymm2, ymm2
       vinsertf64x4 zmm0, zmm2, ymm0, 0
       vinsertf64x4 zmm0, zmm0, ymm1, 1
       vpsubw   zmm0, zmm0, zmm27
       vmovups  zmmword ptr [rdx+2*r8-0x180], zmm0

G_M000_IG19:                ;; offset=0x07D1
       vmovups  zmm0, zmmword ptr [rcx+2*r8-0x140]
       vextracti32x8 ymm1, zmm0, 1
       vpmovzxwd zmm1, zmm1
       vplzcntd zmm1, zmm1
       vpmovzxwd zmm0, zmm0
       vplzcntd zmm0, zmm0
       vpmovdw  zmm0, zmm0
       vpmovdw  zmm1, zmm1
       vxorps   ymm2, ymm2, ymm2
       vinsertf64x4 zmm0, zmm2, ymm0, 0
       vinsertf64x4 zmm0, zmm0, ymm1, 1
       vpsubw   zmm0, zmm0, zmm27
       vmovups  zmmword ptr [rdx+2*r8-0x140], zmm0

G_M000_IG20:                ;; offset=0x0824
       vmovups  zmm0, zmmword ptr [rcx+2*r8-0x100]
       vextracti32x8 ymm1, zmm0, 1
       vpmovzxwd zmm1, zmm1
       vplzcntd zmm1, zmm1
       vpmovzxwd zmm0, zmm0
       vplzcntd zmm0, zmm0
       vpmovdw  zmm0, zmm0
       vpmovdw  zmm1, zmm1
       vxorps   ymm2, ymm2, ymm2
       vinsertf64x4 zmm0, zmm2, ymm0, 0
       vinsertf64x4 zmm0, zmm0, ymm1, 1
       vpsubw   zmm0, zmm0, zmm27
       vmovups  zmmword ptr [rdx+2*r8-0x100], zmm0

G_M000_IG21:                ;; offset=0x0877
       vmovups  zmm0, zmmword ptr [rcx+2*r8-0xC0]
       vextracti32x8 ymm1, zmm0, 1
       vpmovzxwd zmm1, zmm1
       vplzcntd zmm1, zmm1
       vpmovzxwd zmm0, zmm0
       vplzcntd zmm0, zmm0
       vpmovdw  zmm0, zmm0
       vpmovdw  zmm1, zmm1
       vxorps   ymm2, ymm2, ymm2
       vinsertf64x4 zmm0, zmm2, ymm0, 0
       vinsertf64x4 zmm0, zmm0, ymm1, 1
       vpsubw   zmm0, zmm0, zmm27
       vmovups  zmmword ptr [rdx+2*r8-0xC0], zmm0

G_M000_IG22:                ;; offset=0x08CA
       vmovups  zmm0, zmmword ptr [rcx+2*r8-0x80]
       vextracti32x8 ymm1, zmm0, 1
       vpmovzxwd zmm1, zmm1
       vplzcntd zmm1, zmm1
       vpmovzxwd zmm0, zmm0
       vplzcntd zmm0, zmm0
       vpmovdw  zmm0, zmm0
       vpmovdw  zmm1, zmm1
       vxorps   ymm2, ymm2, ymm2
       vinsertf64x4 zmm0, zmm2, ymm0, 0
       vinsertf64x4 zmm0, zmm0, ymm1, 1
       vpsubw   zmm0, zmm0, zmm27
       vmovups  zmmword ptr [rdx+2*r8-0x80], zmm0

G_M000_IG23:                ;; offset=0x091D
       vmovups  zmmword ptr [rdx+2*r10-0x40], zmm28

G_M000_IG24:                ;; offset=0x0925
       vmovups  zmmword ptr [rax], zmm26

G_M000_IG25:                ;; offset=0x092B
       vzeroupper
       add      rsp, 16
       pop      rbx
       ret

RWD00   dq      0010001000100010h, 0010001000100010h, 0010001000100010h, 0010001000100010h, 0010001000100010h, 0010001000100010h, 0010001000100010h, 0010001000100010h
RWD64   dd      000008C1h ; case G_M000_IG24
        dd      000008B9h ; case G_M000_IG23
        dd      00000866h ; case G_M000_IG22
        dd      00000813h ; case G_M000_IG21
        dd      000007C0h ; case G_M000_IG20
        dd      0000076Dh ; case G_M000_IG19
        dd      0000071Ah ; case G_M000_IG18
        dd      000006C7h ; case G_M000_IG17
        dd      00000674h ; case G_M000_IG16
; Total bytes of code 2356

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing to highlight in the above codegen is that the current PR produces 9 instructions for each invocation of LeadingZeroCount. Using the widen+narrow approach produces 11 instructions.

Original loop is unrolled, I'm copying/pasting the codegen of just one invocation of LeadingZeroCount:

Current PR -- 9 total instructions

vmovups zmm3, zmmword ptr [rdx] 
vmovaps zmm4, zmm1 
vpslld zmm5, zmm3, 16 
vpord zmm5, zmm5, zmm4 
vplzcntd zmm5, zmm5 
vpord zmm3, zmm3, zmm4 
vplzcntd zmm3, zmm3 
vpslld zmm3, zmm3, 16 
vpord zmm3, zmm3, zmm5

Widen+Narrow -- 11 total instructions

vmovups zmm0, zmmword ptr [rdx] 
vextracti32x8 ymm1, zmm0, 1 
vpmovzxwd zmm2, zmm1 
vplzcntd zmm3, zmm2 
vpmovzxwd zmm4, zmm0 
vplzcntd zmm5, zmm4 
vpmovdw zmm5, zmm16 
vpmovdw zmm3, zmm17 
vinsertf64x4 zmm18, zmm18, ymm16, 0 
vinsertf64x4 zmm18, zmm18, ymm17, 1 
vpsubw zmm0, zmm18, zmm27

I'm still proposing the current PR as it produces faster and smaller codegen.

Comment on lines +43 to +53
Vector128<byte> lookupVectorLow = Vector128.Create((byte)8, 7, 6, 6, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4);
Vector128<byte> lookupVectorHigh = Vector128.Create((byte)3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0);
Vector128<byte> nibbleMask = Vector128.Create<byte>(0xF);
Vector128<byte> permuteMask = Vector128.Create<byte>(0x80);
Vector128<byte> lowNibble = x.AsByte() & nibbleMask;
Vector128<byte> highNibble = Sse2.ShiftRightLogical(x.AsInt32(), 4).AsByte() & nibbleMask;
Vector128<byte> nibbleSelectMask = Sse2.CompareEqual(highNibble, Vector128<byte>.Zero);
Vector128<byte> indexVector = Sse41.BlendVariable(highNibble, lowNibble, nibbleSelectMask) +
(~nibbleSelectMask & nibbleMask);
indexVector |= ~nibbleSelectMask & permuteMask;
return Avx512Vbmi.VL.PermuteVar16x8x2(lookupVectorLow, indexVector, lookupVectorHigh).As<byte, T>();
Copy link
Member

@tannergooding tannergooding Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this cheaper than:

Vector512<int> x32 = Avx512F.ConvertToVector512Int32(x.AsByte());
Vector512<int> lz = Avx512CD.LeadingZeroCount(x32);
return Avx512F.ConvertToVector128Byte(lz) - Vector128.Create(24);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is overhead when widening-unwidening.

For this case, the widening here gives a bimodal performance result. To verify, the same microbenchmark can be modified to stress this path specifically by using BufferLength=16.

Some runs look like this:

| Method           | Job        | Toolchain                                       | BufferLength | Mean      | Error     | StdDev    | Median    | Min       | Max       | Ratio | RatioSD | Allocated | Alloc Ratio |
|----------------- |----------- |------------------------------------------------ |------------- |----------:|----------:|----------:|----------:|----------:|----------:|------:|--------:|----------:|------------:|
| LeadingZeroCount | Job-RJAKMA | Current PR                                      | 16           |  2.676 ns | 0.0525 ns | 0.0491 ns |  2.680 ns |  2.610 ns |  2.754 ns |  1.00 |    0.03 |         - |          NA |
| LeadingZeroCount | Job-FSPMRZ | Widen                                           | 16           |  3.485 ns | 0.0365 ns | 0.0342 ns |  3.502 ns |  3.428 ns |  3.526 ns |  1.30 |    0.03 |         - |          NA |

Other runs look like this:

| Method           | Job        | Toolchain                                       | BufferLength | Mean      | Error     | StdDev    | Median    | Min       | Max       | Ratio | RatioSD | Allocated | Alloc Ratio |
|----------------- |----------- |------------------------------------------------ |------------- |----------:|----------:|----------:|----------:|----------:|----------:|------:|--------:|----------:|------------:|
| LeadingZeroCount | Job-MGUUAK | Current PR                                      | 16           |  2.683 ns | 0.0424 ns | 0.0396 ns |  2.695 ns |  2.616 ns |  2.733 ns |  1.00 |    0.02 |         - |          NA |
| LeadingZeroCount | Job-NBPOWJ | Widen                                           | 16           |  2.484 ns | 0.0334 ns | 0.0296 ns |  2.492 ns |  2.427 ns |  2.519 ns |  0.93 |    0.02 |         - |          NA |

I chose this version because it was more consistent.

Comment on lines +95 to +104
Vector256<byte> lookupVector =
Vector256.Create((byte)8, 7, 6, 6, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,
3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0);
Vector256<byte> nibbleMask = Vector256.Create<byte>(0xF);
Vector256<byte> lowNibble = x.AsByte() & nibbleMask;
Vector256<byte> highNibble = Avx2.ShiftRightLogical(x.AsInt32(), 4).AsByte() & nibbleMask;
Vector256<byte> nibbleSelectMask = Avx2.CompareEqual(highNibble, Vector256<byte>.Zero);
Vector256<byte> indexVector = Avx2.BlendVariable(highNibble, lowNibble, nibbleSelectMask) +
(~nibbleSelectMask & nibbleMask);
return Avx512Vbmi.VL.PermuteVar32x8(lookupVector, indexVector).As<byte, T>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question as previous, but doing WidenLower/WidenUpper since its 1024 bits total.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment as the Vector128<byte> case.

There is overhead when widening-unwidening. It isn't as bad here, but both versions perform very similarly. Can be verified with BufferLength=32 to stress this path:

| Method           | Job        | Toolchain                                       | BufferLength | Mean      | Error     | StdDev    | Median    | Min       | Max       | Ratio | RatioSD | Allocated | Alloc Ratio |
|----------------- |----------- |------------------------------------------------ |------------- |----------:|----------:|----------:|----------:|----------:|----------:|------:|--------:|----------:|------------:|
| LeadingZeroCount | Job-RJAKMA | Current PR                                      | 32           |  2.682 ns | 0.0387 ns | 0.0362 ns |  2.702 ns |  2.614 ns |  2.724 ns |  1.00 |    0.02 |         - |          NA |
| LeadingZeroCount | Job-FSPMRZ | Widen                                           | 32           |  2.685 ns | 0.0312 ns | 0.0292 ns |  2.691 ns |  2.608 ns |  2.721 ns |  0.91 |    0.02 |         - |          NA |

Comment on lines +111 to +117
Vector256<uint> lowHalf = Vector256.Create((uint)0x0000FFFF);
Vector256<uint> x_bot16 = Avx2.Or(Avx2.ShiftLeftLogical(x.AsUInt32(), 16), lowHalf);
Vector256<uint> x_top16 = Avx2.Or(x.AsUInt32(), lowHalf);
Vector256<uint> lz_bot16 = Avx512CD.VL.LeadingZeroCount(x_bot16);
Vector256<uint> lz_top16 = Avx512CD.VL.LeadingZeroCount(x_top16);
Vector256<uint> lz_top16_shift = Avx2.ShiftLeftLogical(lz_top16, 16);
return Avx2.Or(lz_bot16, lz_top16_shift).AsUInt16().As<ushort, T>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question as previous, widening to Vector512

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Widening-unwidening has similar performance in this case. Can be verified with BufferLength=16:

| Type                                      | Method           | Job        | Toolchain                                       | BufferLength | Mean      | Error     | StdDev    | Median    | Min       | Max       | Ratio | RatioSD | Allocated | Alloc Ratio |
|------------------------------------------ |----------------- |----------- |------------------------------------------------ |------------- |----------:|----------:|----------:|----------:|----------:|----------:|------:|--------:|----------:|------------:|
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | Job-WWKLJQ | Current PR                                      | 16           |  2.485 ns | 0.0442 ns | 0.0414 ns |  2.496 ns |  2.410 ns |  2.530 ns |  1.00 |    0.02 |         - |          NA |
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | Job-VFJFZO | Widen                                           | 16           |  2.474 ns | 0.0542 ns | 0.0507 ns |  2.495 ns |  2.402 ns |  2.529 ns |  1.00 |    0.03 |         - |          NA |

Comment on lines +157 to +163
Vector512<uint> lowHalf = Vector512.Create((uint)0x0000FFFF);
Vector512<uint> x_bot16 = Avx512F.Or(Avx512F.ShiftLeftLogical(x.AsUInt32(), 16), lowHalf);
Vector512<uint> x_top16 = Avx512F.Or(x.AsUInt32(), lowHalf);
Vector512<uint> lz_bot16 = Avx512CD.LeadingZeroCount(x_bot16);
Vector512<uint> lz_top16 = Avx512CD.LeadingZeroCount(x_top16);
Vector512<uint> lz_top16_shift = Avx512F.ShiftLeftLogical(lz_top16, 16);
return Avx512F.Or(lz_bot16, lz_top16_shift).AsUInt16().As<ushort, T>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question, doing WidenLower/WidenUpper

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The widening-unwidening performance difference is most obvious for this case.

| Type                                      | Method           | Job        | Toolchain                                       | BufferLength | Mean      | Error     | StdDev    | Median    | Min       | Max       | Ratio | RatioSD | Allocated | Alloc Ratio |
|------------------------------------------ |----------------- |----------- |------------------------------------------------ |------------- |----------:|----------:|----------:|----------:|----------:|----------:|------:|--------:|----------:|------------:|
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | Job-WWKLJQ | Current PR                                      | 32           |  4.110 ns | 0.0621 ns | 0.0581 ns |  4.141 ns |  4.011 ns |  4.161 ns |  1.00 |    0.02 |         - |          NA |
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | Job-VFJFZO | Widen                                           | 32           | 10.421 ns | 0.0738 ns | 0.0690 ns | 10.445 ns | 10.261 ns | 10.515 ns |  2.54 |    0.04 |         - |          NA |

Comment on lines +139 to +150
Vector512<byte> lookupVectorA =
Vector512.Create((byte)8, 7, 6, 6, 5, 5, 5, 5,
4, 4, 4, 4, 4, 4, 4, 4,
3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3,
2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2);
Vector512<byte> lookupVectorB = Vector512.Create((byte)1);
Vector512<byte> bit7ZeroMask = Avx512BW.CompareLessThan(x.AsByte(), Vector512.Create((byte)128));
return Avx512F.And(bit7ZeroMask, Avx512Vbmi.PermuteVar64x8x2(lookupVectorA, x.AsByte(), lookupVectorB)).As<byte, T>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the only one that shouldn't be simply Widen+Lzcnt. But it does warrant a comment elaborating on how the lookup works.

In particular, PermuteVar64x8x2 isn't immediately obvious how it operates, so elaborating that x is being used as an index where bit 6 selects the table, bits 5:0 select an index in the table, and anything where bit 7 is set is zeroed is goodness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I've added a comment to better explain how x is being used as an index and how the intrinsic is choosing between the two lookup vectors.

@tannergooding tannergooding merged commit 208b974 into dotnet:main Dec 19, 2024
83 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Numerics.Tensors community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants