-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vector support to System.Numerics.Tensors.TensorPrimitives.LeadingZeroCount for Byte and Int16 #110333
Conversation
Tagging subscribers to this area: @dotnet/area-system-numerics-tensors |
...em.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.LeadingZeroCount.cs
Show resolved
Hide resolved
…runtime into lzcnt-avx512-vbmi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Seems reasonable to me, but @tannergooding should have a look, too.
...em.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.LeadingZeroCount.cs
Outdated
Show resolved
Hide resolved
…ors/netcore/TensorPrimitives.LeadingZeroCount.cs Co-authored-by: Tanner Gooding <[email protected]>
Vector128<uint> lowHalf = Vector128.Create((uint)0x0000FFFF); | ||
Vector128<uint> x_bot16 = Sse2.Or(Sse2.ShiftLeftLogical(x.AsUInt32(), 16), lowHalf); | ||
Vector128<uint> x_top16 = Sse2.Or(x.AsUInt32(), lowHalf); | ||
Vector128<uint> lz_bot16 = Avx512CD.VL.LeadingZeroCount(x_bot16); | ||
Vector128<uint> lz_top16 = Avx512CD.VL.LeadingZeroCount(x_top16); | ||
Vector128<uint> lz_top16_shift = Sse2.ShiftLeftLogical(lz_top16, 16); | ||
return Sse2.Or(lz_bot16, lz_top16_shift).AsUInt16().As<ushort, T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this cheaper than:
Vector256<int> x32 = Avx2.ConvertToVector256Int32(x.AsUInt16());
Vector256<int> lz = Avx512CD.VL.LeadingZeroCount(x32);
return Avx512F.VL.ConvertToVector128UInt16(lz) - Vector128.Create(16);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Widening-unwidening has slightly worse performance here:
| Type | Method | Job | Toolchain | BufferLength | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Allocated | Alloc Ratio |
|------------------------------------------ |----------------- |----------- |------------------------------------------------ |------------- |----------:|----------:|----------:|----------:|----------:|----------:|------:|--------:|----------:|------------:|
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | Job-WWKLJQ | Current PR | 8 | 2.499 ns | 0.0319 ns | 0.0299 ns | 2.505 ns | 2.401 ns | 2.528 ns | 1.00 | 0.02 | - | NA |
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | Job-VFJFZO | Widen | 8 | 2.633 ns | 0.0478 ns | 0.0447 ns | 2.650 ns | 2.570 ns | 2.676 ns | 1.05 | 0.02 | - | NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment (and the related ones below) are still pending a response.
The current pattern used looks significantly more expensive (in size, instruction count, and micro-ops) than the more naive method of widen + lzcnt + narrow + subtract
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current pattern used looks significantly more expensive (in size, instruction count, and micro-ops) than the more naive method of widen + lzcnt + narrow + subtract
I agree that it does look more expensive because of the increased instruction count, however the benchmark shows that the current pattern is more performant because it avoids the overhead of widening+narrowing.
Another example, this one using BufferSize=3079
like in the original benchmark:
| Type | Method | Job | Toolchain | BufferLength | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Allocated | Alloc Ratio |
|------------------------------------------ |----------------- |----------- |----------------------------------------------------------------------------------------------- |------------- |--------------:|------------:|------------:|--------------:|--------------:|--------------:|------:|--------:|----------:|------------:|
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | Job-QIBRWQ | Current PR | 3079 | 66.700 ns | 0.3187 ns | 0.2981 ns | 66.761 ns | 66.319 ns | 67.168 ns | 1.00 | 0.01 | - | NA |
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | Job-TEWLEG | Widen + LZCNT + Subtract + Narrow | 3079 | 133.049 ns | 2.4410 ns | 2.1639 ns | 132.240 ns | 131.039 ns | 138.318 ns | 1.99 | 0.03 | - | NA |
And looking at the codegen for this case, the current PR actually generates a smaller function than the widen+narrow suggestion.
This function is where the hot loop of the benchmark is. Current PR generates a function with 1915 bytes of code, Widen+Narrow generates a function with 2356 bytes of code:
Codegen - Current PR
; Assembly listing for method System.Numerics.Tensors.TensorPrimitives:<InvokeSpanIntoSpan>g__Vectorized512|105_3[short,short,System.Numerics.Tensors.TensorPrimitives+LeadingZeroCountOperator`1[short]](byref,byref,ulong) (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 24 single block inlinees; 25 inlinees without PGO data
G_M000_IG01: ;; offset=0x0000
push rbx
sub rsp, 16
xor eax, eax
mov qword ptr [rsp+0x08], rax
mov qword ptr [rsp], rax
G_M000_IG02: ;; offset=0x0010
mov rax, rdx
vmovups zmm0, zmmword ptr [rcx]
vmovups zmm1, zmmword ptr [reloc @RWD00]
vmovaps zmm2, zmm1
vpslld zmm3, zmm0, 16
vpord zmm3, zmm3, zmm2
vplzcntd zmm3, zmm3
vpord zmm0, zmm0, zmm2
vplzcntd zmm0, zmm0
vpslld zmm0, zmm0, 16
vpord zmm0, zmm0, zmm3
vmovups zmm2, zmmword ptr [rcx+2*r8-0x40]
vmovaps zmm3, zmm1
vpslld zmm4, zmm2, 16
vpord zmm4, zmm4, zmm3
vplzcntd zmm4, zmm4
vpord zmm2, zmm2, zmm3
vplzcntd zmm2, zmm2
vpslld zmm2, zmm2, 16
vpord zmm2, zmm2, zmm4
cmp r8, 256
jbe G_M000_IG13
G_M000_IG03: ;; offset=0x009C
mov bword ptr [rsp+0x08], rcx
mov bword ptr [rsp], rax
mov rdx, rcx
mov r10, rax
mov r9, r10
test r9b, 1
sete r11b
movzx r11, r11b
test r11d, r11d
je SHORT G_M000_IG04
mov rdx, r10
and rdx, 63
neg rdx
add rdx, 64
shr rdx, 1
lea r9, [rdx+rdx]
add rcx, r9
add r9, r10
sub r8, rdx
mov rdx, rcx
G_M000_IG04: ;; offset=0x00E0
cmp r8, 0x20000
seta cl
movzx rcx, cl
test ecx, r11d
je G_M000_IG07
jmp G_M000_IG10
G_M000_IG05: ;; offset=0x00FB
vmovups zmm3, zmmword ptr [rdx]
vmovaps zmm4, zmm1
vpslld zmm5, zmm3, 16
vpord zmm5, zmm5, zmm4
vplzcntd zmm5, zmm5
vpord zmm3, zmm3, zmm4
vplzcntd zmm3, zmm3
vpslld zmm3, zmm3, 16
vpord zmm3, zmm3, zmm5
vmovups zmm4, zmmword ptr [rdx+0x40]
vmovaps zmm5, zmm1
vpslld zmm16, zmm4, 16
vpord zmm16, zmm16, zmm5
vplzcntd zmm16, zmm16
vpord zmm4, zmm4, zmm5
vplzcntd zmm4, zmm4
vpslld zmm4, zmm4, 16
vpord zmm4, zmm4, zmm16
vmovups zmm5, zmmword ptr [rdx+0x80]
vmovaps zmm16, zmm1
vpslld zmm17, zmm5, 16
vpord zmm17, zmm17, zmm16
vplzcntd zmm17, zmm17
vpord zmm5, zmm5, zmm16
vplzcntd zmm5, zmm5
vpslld zmm5, zmm5, 16
vpord zmm5, zmm5, zmm17
vmovups zmm16, zmmword ptr [rdx+0xC0]
vmovaps zmm17, zmm1
vpslld zmm18, zmm16, 16
vpord zmm18, zmm18, zmm17
vplzcntd zmm18, zmm18
vpord zmm16, zmm16, zmm17
vplzcntd zmm16, zmm16
vpslld zmm16, zmm16, 16
vpord zmm16, zmm16, zmm18
vmovups zmmword ptr [r9], zmm3
vmovups zmmword ptr [r9+0x40], zmm4
vmovups zmmword ptr [r9+0x80], zmm5
vmovups zmmword ptr [r9+0xC0], zmm16
vmovups zmm3, zmmword ptr [rdx+0x100]
vmovaps zmm4, zmm1
vpslld zmm5, zmm3, 16
vpord zmm5, zmm5, zmm4
vplzcntd zmm5, zmm5
vpord zmm3, zmm3, zmm4
vplzcntd zmm3, zmm3
vpslld zmm3, zmm3, 16
vpord zmm3, zmm3, zmm5
vmovups zmm4, zmmword ptr [rdx+0x140]
vmovaps zmm5, zmm1
vpslld zmm16, zmm4, 16
vpord zmm16, zmm16, zmm5
vplzcntd zmm16, zmm16
vpord zmm4, zmm4, zmm5
vplzcntd zmm4, zmm4
vpslld zmm4, zmm4, 16
vpord zmm4, zmm4, zmm16
vmovups zmm5, zmmword ptr [rdx+0x180]
vmovaps zmm16, zmm1
vpslld zmm17, zmm5, 16
vpord zmm17, zmm17, zmm16
vplzcntd zmm17, zmm17
vpord zmm5, zmm5, zmm16
vplzcntd zmm5, zmm5
vpslld zmm5, zmm5, 16
vpord zmm5, zmm5, zmm17
vmovups zmm16, zmmword ptr [rdx+0x1C0]
vmovaps zmm17, zmm1
vpslld zmm18, zmm16, 16
vpord zmm18, zmm18, zmm17
G_M000_IG06: ;; offset=0x02BE
vplzcntd zmm18, zmm18
vpord zmm16, zmm16, zmm17
vplzcntd zmm16, zmm16
vpslld zmm16, zmm16, 16
vpord zmm16, zmm16, zmm18
vmovups zmmword ptr [r9+0x100], zmm3
vmovups zmmword ptr [r9+0x140], zmm4
vmovups zmmword ptr [r9+0x180], zmm5
vmovups zmmword ptr [r9+0x1C0], zmm16
add rdx, 512
add r9, 512
add r8, -256
G_M000_IG07: ;; offset=0x030E
cmp r8, 256
jae G_M000_IG05
jmp G_M000_IG11
align [0 bytes for IG08]
G_M000_IG08: ;; offset=0x0320
vmovups zmm3, zmmword ptr [rdx]
vmovaps zmm4, zmm1
vpslld zmm5, zmm3, 16
vpord zmm5, zmm5, zmm4
vplzcntd zmm5, zmm5
vpord zmm3, zmm3, zmm4
vplzcntd zmm3, zmm3
vpslld zmm3, zmm3, 16
vpord zmm3, zmm3, zmm5
vmovups zmm4, zmmword ptr [rdx+0x40]
vmovaps zmm5, zmm1
vpslld zmm16, zmm4, 16
vpord zmm16, zmm16, zmm5
vplzcntd zmm16, zmm16
vpord zmm4, zmm4, zmm5
vplzcntd zmm4, zmm4
vpslld zmm4, zmm4, 16
vpord zmm4, zmm4, zmm16
vmovups zmm5, zmmword ptr [rdx+0x80]
vmovaps zmm16, zmm1
vpslld zmm17, zmm5, 16
vpord zmm17, zmm17, zmm16
vplzcntd zmm17, zmm17
vpord zmm5, zmm5, zmm16
vplzcntd zmm5, zmm5
vpslld zmm5, zmm5, 16
vpord zmm5, zmm5, zmm17
vmovups zmm16, zmmword ptr [rdx+0xC0]
vmovaps zmm17, zmm1
vpslld zmm18, zmm16, 16
vpord zmm18, zmm18, zmm17
vplzcntd zmm18, zmm18
vpord zmm16, zmm16, zmm17
vplzcntd zmm16, zmm16
vpslld zmm16, zmm16, 16
vpord zmm16, zmm16, zmm18
vmovntdq zmmword ptr [r9], zmm3
vmovntdq zmmword ptr [r9+0x40], zmm4
vmovntdq zmmword ptr [r9+0x80], zmm5
vmovntdq zmmword ptr [r9+0xC0], zmm16
vmovups zmm3, zmmword ptr [rdx+0x100]
vmovaps zmm4, zmm1
vpslld zmm5, zmm3, 16
vpord zmm5, zmm5, zmm4
vplzcntd zmm5, zmm5
vpord zmm3, zmm3, zmm4
vplzcntd zmm3, zmm3
vpslld zmm3, zmm3, 16
vpord zmm3, zmm3, zmm5
vmovups zmm4, zmmword ptr [rdx+0x140]
vmovaps zmm5, zmm1
vpslld zmm16, zmm4, 16
vpord zmm16, zmm16, zmm5
vplzcntd zmm16, zmm16
vpord zmm4, zmm4, zmm5
vplzcntd zmm4, zmm4
vpslld zmm4, zmm4, 16
vpord zmm4, zmm4, zmm16
vmovups zmm5, zmmword ptr [rdx+0x180]
vmovaps zmm16, zmm1
vpslld zmm17, zmm5, 16
vpord zmm17, zmm17, zmm16
vplzcntd zmm17, zmm17
vpord zmm5, zmm5, zmm16
vplzcntd zmm5, zmm5
vpslld zmm5, zmm5, 16
vpord zmm5, zmm5, zmm17
vmovups zmm16, zmmword ptr [rdx+0x1C0]
vmovaps zmm17, zmm1
vpslld zmm18, zmm16, 16
vpord zmm18, zmm18, zmm17
G_M000_IG09: ;; offset=0x04E3
vplzcntd zmm18, zmm18
vpord zmm16, zmm16, zmm17
vplzcntd zmm16, zmm16
vpslld zmm16, zmm16, 16
vpord zmm16, zmm16, zmm18
vmovntdq zmmword ptr [r9+0x100], zmm3
vmovntdq zmmword ptr [r9+0x140], zmm4
vmovntdq zmmword ptr [r9+0x180], zmm5
vmovntdq zmmword ptr [r9+0x1C0], zmm16
add rdx, 512
add r9, 512
add r8, -256
G_M000_IG10: ;; offset=0x0533
cmp r8, 256
jae G_M000_IG08
G_M000_IG11: ;; offset=0x0540
mov rcx, rdx
mov rdx, r9
xor r10d, r10d
mov bword ptr [rsp], r10
G_M000_IG12: ;; offset=0x054D
mov bword ptr [rsp+0x08], r10
G_M000_IG13: ;; offset=0x0552
mov r10, r8
lea r8, [r10+0x1F]
and r8, -32
mov r9, r8
shr r9, 5
cmp r9, 8
ja G_M000_IG26
G_M000_IG14: ;; offset=0x056E
cmp r9d, 8
ja G_M000_IG26
G_M000_IG15: ;; offset=0x0578
mov r9d, r9d
lea r11, [reloc @RWD64]
mov r11d, dword ptr [r11+4*r9]
lea rbx, G_M000_IG02
add r11, rbx
jmp r11
G_M000_IG16: ;; offset=0x0593
vmovups zmm3, zmmword ptr [rcx+2*r8-0x200]
vmovaps zmm4, zmm1
vpslld zmm5, zmm3, 16
vpord zmm5, zmm5, zmm4
vplzcntd zmm5, zmm5
vpord zmm3, zmm3, zmm4
vplzcntd zmm3, zmm3
vpslld zmm3, zmm3, 16
vpord zmm3, zmm3, zmm5
vmovups zmmword ptr [rdx+2*r8-0x200], zmm3
G_M000_IG17: ;; offset=0x05D5
vmovups zmm3, zmmword ptr [rcx+2*r8-0x1C0]
vmovaps zmm4, zmm1
vpslld zmm5, zmm3, 16
vpord zmm5, zmm5, zmm4
vplzcntd zmm5, zmm5
vpord zmm3, zmm3, zmm4
vplzcntd zmm3, zmm3
vpslld zmm3, zmm3, 16
vpord zmm3, zmm3, zmm5
vmovups zmmword ptr [rdx+2*r8-0x1C0], zmm3
G_M000_IG18: ;; offset=0x0617
vmovups zmm3, zmmword ptr [rcx+2*r8-0x180]
vmovaps zmm4, zmm1
vpslld zmm5, zmm3, 16
vpord zmm5, zmm5, zmm4
vplzcntd zmm5, zmm5
vpord zmm3, zmm3, zmm4
vplzcntd zmm3, zmm3
vpslld zmm3, zmm3, 16
vpord zmm3, zmm3, zmm5
vmovups zmmword ptr [rdx+2*r8-0x180], zmm3
G_M000_IG19: ;; offset=0x0659
vmovups zmm3, zmmword ptr [rcx+2*r8-0x140]
vmovaps zmm4, zmm1
vpslld zmm5, zmm3, 16
vpord zmm5, zmm5, zmm4
vplzcntd zmm5, zmm5
vpord zmm3, zmm3, zmm4
vplzcntd zmm3, zmm3
vpslld zmm3, zmm3, 16
vpord zmm3, zmm3, zmm5
vmovups zmmword ptr [rdx+2*r8-0x140], zmm3
G_M000_IG20: ;; offset=0x069B
vmovups zmm3, zmmword ptr [rcx+2*r8-0x100]
vmovaps zmm4, zmm1
vpslld zmm5, zmm3, 16
vpord zmm5, zmm5, zmm4
vplzcntd zmm5, zmm5
vpord zmm3, zmm3, zmm4
vplzcntd zmm3, zmm3
vpslld zmm3, zmm3, 16
vpord zmm3, zmm3, zmm5
vmovups zmmword ptr [rdx+2*r8-0x100], zmm3
G_M000_IG21: ;; offset=0x06DD
vmovups zmm3, zmmword ptr [rcx+2*r8-0xC0]
vmovaps zmm4, zmm1
vpslld zmm5, zmm3, 16
vpord zmm5, zmm5, zmm4
vplzcntd zmm5, zmm5
vpord zmm3, zmm3, zmm4
vplzcntd zmm3, zmm3
vpslld zmm3, zmm3, 16
vpord zmm3, zmm3, zmm5
vmovups zmmword ptr [rdx+2*r8-0xC0], zmm3
G_M000_IG22: ;; offset=0x071F
vmovups zmm3, zmmword ptr [rcx+2*r8-0x80]
vpslld zmm4, zmm3, 16
vpord zmm4, zmm4, zmm1
vplzcntd zmm4, zmm4
vpord zmm1, zmm3, zmm1
vplzcntd zmm1, zmm1
vpslld zmm1, zmm1, 16
vpord zmm1, zmm1, zmm4
vmovups zmmword ptr [rdx+2*r8-0x80], zmm1
G_M000_IG23: ;; offset=0x075B
vmovups zmmword ptr [rdx+2*r10-0x40], zmm2
G_M000_IG24: ;; offset=0x0763
vmovups zmmword ptr [rax], zmm0
G_M000_IG25: ;; offset=0x0769
vzeroupper
add rsp, 16
pop rbx
ret
G_M000_IG26: ;; offset=0x0772
vzeroupper
add rsp, 16
pop rbx
ret
RWD00 dq 0000FFFF0000FFFFh, 0000FFFF0000FFFFh, 0000FFFF0000FFFFh, 0000FFFF0000FFFFh, 0000FFFF0000FFFFh, 0000FFFF0000FFFFh, 0000FFFF0000FFFFh, 0000FFFF0000FFFFh
RWD64 dd 00000753h ; case G_M000_IG24
dd 0000074Bh ; case G_M000_IG23
dd 0000070Fh ; case G_M000_IG22
dd 000006CDh ; case G_M000_IG21
dd 0000068Bh ; case G_M000_IG20
dd 00000649h ; case G_M000_IG19
dd 00000607h ; case G_M000_IG18
dd 000005C5h ; case G_M000_IG17
dd 00000583h ; case G_M000_IG16
; Total bytes of code 1915
Codegen - Widen+narrow
; Assembly listing for method System.Numerics.Tensors.TensorPrimitives:<InvokeSpanIntoSpan>g__Vectorized512|105_3[short,short,System.Numerics.Tensors.TensorPrimitives+LeadingZeroCountOperator`1[short]](byref,byref,ulong) (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 124 single block inlinees; 50 inlinees without PGO data
G_M000_IG01: ;; offset=0x0000
push rbx
sub rsp, 16
xor eax, eax
mov qword ptr [rsp+0x08], rax
mov qword ptr [rsp], rax
vxorps xmm0, xmm0, xmm0
vmovaps xmm1, xmm0
vmovaps xmm2, xmm0
vmovaps xmm3, xmm0
vmovaps xmm4, xmm0
vmovaps xmm5, xmm0
vmovaps xmm16, xmm0
vmovaps xmm17, xmm0
vmovaps xmm18, xmm0
vmovaps xmm19, xmm0
vmovaps xmm20, xmm0
vmovaps xmm21, xmm0
vmovaps xmm22, xmm0
vmovaps xmm23, xmm0
vmovaps xmm24, xmm0
vmovaps xmm25, xmm0
G_M000_IG02: ;; offset=0x0064
mov rax, rdx
vmovups zmm26, zmmword ptr [rcx]
vextracti32x8 ymm27, zmm26, 1
vpmovzxwd zmm27, zmm27
vplzcntd zmm27, zmm27
vpmovzxwd zmm26, zmm26
vplzcntd zmm26, zmm26
vpmovdw zmm26, zmm26
vpmovdw zmm27, zmm27
vxorps ymm28, ymm28, ymm28
vinsertf64x4 zmm26, zmm28, ymm26, 0
vinsertf64x4 zmm26, zmm26, ymm27, 1
vmovups zmm27, zmmword ptr [reloc @RWD00]
vpsubw zmm26, zmm26, zmm27
vmovups zmm28, zmmword ptr [rcx+2*r8-0x40]
vextracti32x8 ymm29, zmm28, 1
vpmovzxwd zmm29, zmm29
vplzcntd zmm29, zmm29
vpmovzxwd zmm28, zmm28
vplzcntd zmm28, zmm28
vpmovdw zmm28, zmm28
vpmovdw zmm29, zmm29
vxorps ymm30, ymm30, ymm30
vinsertf64x4 zmm28, zmm30, ymm28, 0
vinsertf64x4 zmm28, zmm28, ymm29, 1
vpsubw zmm28, zmm28, zmm27
cmp r8, 256
jbe G_M000_IG13
G_M000_IG03: ;; offset=0x0116
mov bword ptr [rsp+0x08], rcx
mov bword ptr [rsp], rax
mov rdx, rcx
mov r10, rax
mov r9, r10
test r9b, 1
sete r11b
movzx r11, r11b
test r11d, r11d
je SHORT G_M000_IG04
mov rdx, r10
and rdx, 63
neg rdx
add rdx, 64
shr rdx, 1
lea r9, [rdx+rdx]
add rcx, r9
add r9, r10
sub r8, rdx
mov rdx, rcx
G_M000_IG04: ;; offset=0x015A
cmp r8, 0x20000
seta cl
movzx rcx, cl
test ecx, r11d
jne G_M000_IG10
align [0 bytes for IG05]
G_M000_IG05: ;; offset=0x0170
cmp r8, 256
jb G_M000_IG11
G_M000_IG06: ;; offset=0x017D
vmovups zmm0, zmmword ptr [rdx]
vextracti32x8 ymm1, zmm0, 1
vpmovzxwd zmm2, zmm1
vplzcntd zmm3, zmm2
vpmovzxwd zmm4, zmm0
vplzcntd zmm5, zmm4
vpmovdw zmm5, zmm16
vpmovdw zmm3, zmm17
vinsertf64x4 zmm18, zmm18, ymm16, 0
vinsertf64x4 zmm18, zmm18, ymm17, 1
vpsubw zmm0, zmm18, zmm27
vmovups zmm1, zmmword ptr [rdx+0x40]
vextracti32x8 ymm2, zmm1, 1
vpmovzxwd zmm2, zmm2
vplzcntd zmm2, zmm2
vpmovzxwd zmm1, zmm1
vplzcntd zmm1, zmm1
vpmovdw zmm1, zmm1
vpmovdw zmm2, zmm2
vinsertf64x4 zmm19, zmm19, ymm1, 0
vinsertf64x4 zmm19, zmm19, ymm2, 1
vpsubw zmm1, zmm19, zmm27
vmovups zmm2, zmmword ptr [rdx+0x80]
vextracti32x8 ymm3, zmm2, 1
vpmovzxwd zmm3, zmm3
vplzcntd zmm3, zmm3
vpmovzxwd zmm2, zmm2
vplzcntd zmm2, zmm2
vpmovdw zmm2, zmm2
vpmovdw zmm3, zmm3
vinsertf64x4 zmm20, zmm20, ymm2, 0
vinsertf64x4 zmm20, zmm20, ymm3, 1
vpsubw zmm2, zmm20, zmm27
vmovups zmm3, zmmword ptr [rdx+0xC0]
vextracti32x8 ymm4, zmm3, 1
vpmovzxwd zmm4, zmm4
vplzcntd zmm4, zmm4
vpmovzxwd zmm3, zmm3
vplzcntd zmm3, zmm3
vpmovdw zmm3, zmm3
vpmovdw zmm4, zmm4
vinsertf64x4 zmm21, zmm21, ymm3, 0
vinsertf64x4 zmm21, zmm21, ymm4, 1
vpsubw zmm3, zmm21, zmm27
vmovups zmmword ptr [r9], zmm0
vmovups zmmword ptr [r9+0x40], zmm1
vmovups zmmword ptr [r9+0x80], zmm2
vmovups zmmword ptr [r9+0xC0], zmm3
vmovups zmm0, zmmword ptr [rdx+0x100]
vextracti32x8 ymm1, zmm0, 1
vpmovzxwd zmm2, zmm1
vplzcntd zmm1, zmm2
vpmovzxwd zmm0, zmm0
vplzcntd zmm0, zmm0
vpmovdw zmm0, zmm0
vpmovdw zmm1, zmm1
vinsertf64x4 zmm22, zmm22, ymm0, 0
vinsertf64x4 zmm22, zmm22, ymm1, 1
vpsubw zmm0, zmm22, zmm27
vmovups zmm1, zmmword ptr [rdx+0x140]
vextracti32x8 ymm2, zmm1, 1
vpmovzxwd zmm2, zmm2
vplzcntd zmm2, zmm2
vpmovzxwd zmm1, zmm1
vplzcntd zmm1, zmm1
vpmovdw zmm1, zmm1
vpmovdw zmm2, zmm2
vinsertf64x4 zmm23, zmm23, ymm1, 0
vinsertf64x4 zmm23, zmm23, ymm2, 1
vpsubw zmm1, zmm23, zmm27
vmovups zmm2, zmmword ptr [rdx+0x180]
vextracti32x8 ymm3, zmm2, 1
vpmovzxwd zmm3, zmm3
vplzcntd zmm3, zmm3
G_M000_IG07: ;; offset=0x0355
vpmovzxwd zmm2, zmm2
vplzcntd zmm2, zmm2
vpmovdw zmm2, zmm2
vpmovdw zmm3, zmm3
vinsertf64x4 zmm24, zmm24, ymm2, 0
vinsertf64x4 zmm24, zmm24, ymm3, 1
vpsubw zmm2, zmm24, zmm27
vmovups zmm3, zmmword ptr [rdx+0x1C0]
vextracti32x8 ymm4, zmm3, 1
vpmovzxwd zmm4, zmm4
vplzcntd zmm4, zmm4
vpmovzxwd zmm3, zmm3
vplzcntd zmm3, zmm3
vpmovdw zmm3, zmm3
vpmovdw zmm4, zmm4
vinsertf64x4 zmm25, zmm25, ymm3, 0
vinsertf64x4 zmm25, zmm25, ymm4, 1
vpsubw zmm3, zmm25, zmm27
vmovups zmmword ptr [r9+0x100], zmm0
vmovups zmmword ptr [r9+0x140], zmm1
vmovups zmmword ptr [r9+0x180], zmm2
vmovups zmmword ptr [r9+0x1C0], zmm3
add rdx, 512
add r9, 512
add r8, -256
jmp G_M000_IG05
align [0 bytes for IG08]
G_M000_IG08: ;; offset=0x03FD
vmovups zmm18, zmmword ptr [rdx]
vextracti32x8 ymm19, zmm18, 1
vpmovzxwd zmm19, zmm19
vplzcntd zmm19, zmm19
vpmovzxwd zmm18, zmm18
vplzcntd zmm18, zmm18
vpmovdw zmm18, zmm18
vpmovdw zmm19, zmm19
vinsertf64x4 zmm0, zmm0, ymm18, 0
vinsertf64x4 zmm0, zmm0, ymm19, 1
vpsubw zmm18, zmm0, zmm27
vmovups zmm19, zmmword ptr [rdx+0x40]
vextracti32x8 ymm20, zmm19, 1
vpmovzxwd zmm20, zmm20
vplzcntd zmm20, zmm20
vpmovzxwd zmm19, zmm19
vplzcntd zmm19, zmm19
vpmovdw zmm19, zmm19
vpmovdw zmm20, zmm20
vinsertf64x4 zmm1, zmm1, ymm19, 0
vinsertf64x4 zmm1, zmm1, ymm20, 1
vpsubw zmm19, zmm1, zmm27
vmovups zmm20, zmmword ptr [rdx+0x80]
vextracti32x8 ymm21, zmm20, 1
vpmovzxwd zmm21, zmm21
vplzcntd zmm21, zmm21
vpmovzxwd zmm20, zmm20
vplzcntd zmm20, zmm20
vpmovdw zmm20, zmm20
vpmovdw zmm21, zmm21
vinsertf64x4 zmm2, zmm2, ymm20, 0
vinsertf64x4 zmm2, zmm2, ymm21, 1
vpsubw zmm20, zmm2, zmm27
vmovups zmm21, zmmword ptr [rdx+0xC0]
vextracti32x8 ymm22, zmm21, 1
vpmovzxwd zmm22, zmm22
vplzcntd zmm22, zmm22
vpmovzxwd zmm21, zmm21
vplzcntd zmm21, zmm21
vpmovdw zmm21, zmm21
vpmovdw zmm22, zmm22
vinsertf64x4 zmm3, zmm3, ymm21, 0
vinsertf64x4 zmm3, zmm3, ymm22, 1
vpsubw zmm21, zmm3, zmm27
vmovntdq zmmword ptr [r9], zmm18
vmovntdq zmmword ptr [r9+0x40], zmm19
vmovntdq zmmword ptr [r9+0x80], zmm20
vmovntdq zmmword ptr [r9+0xC0], zmm21
vmovups zmm18, zmmword ptr [rdx+0x100]
vextracti32x8 ymm19, zmm18, 1
vpmovzxwd zmm20, zmm19
vplzcntd zmm19, zmm20
vpmovzxwd zmm18, zmm18
vplzcntd zmm18, zmm18
vpmovdw zmm18, zmm18
vpmovdw zmm19, zmm19
vinsertf64x4 zmm4, zmm4, ymm18, 0
vinsertf64x4 zmm4, zmm4, ymm19, 1
vpsubw zmm18, zmm4, zmm27
vmovups zmm19, zmmword ptr [rdx+0x140]
vextracti32x8 ymm20, zmm19, 1
vpmovzxwd zmm20, zmm20
vplzcntd zmm20, zmm20
vpmovzxwd zmm19, zmm19
vplzcntd zmm19, zmm19
vpmovdw zmm19, zmm19
vpmovdw zmm20, zmm20
vinsertf64x4 zmm5, zmm5, ymm19, 0
vinsertf64x4 zmm5, zmm5, ymm20, 1
vpsubw zmm19, zmm5, zmm27
vmovups zmm20, zmmword ptr [rdx+0x180]
vextracti32x8 ymm21, zmm20, 1
vpmovzxwd zmm21, zmm21
vplzcntd zmm21, zmm21
G_M000_IG09: ;; offset=0x05D5
vpmovzxwd zmm20, zmm20
vplzcntd zmm20, zmm20
vpmovdw zmm20, zmm20
vpmovdw zmm21, zmm21
vinsertf64x4 zmm16, zmm16, ymm20, 0
vinsertf64x4 zmm16, zmm16, ymm21, 1
vpsubw zmm20, zmm16, zmm27
vmovups zmm21, zmmword ptr [rdx+0x1C0]
vextracti32x8 ymm22, zmm21, 1
vpmovzxwd zmm22, zmm22
vplzcntd zmm22, zmm22
vpmovzxwd zmm21, zmm21
vplzcntd zmm21, zmm21
vpmovdw zmm21, zmm21
vpmovdw zmm22, zmm22
vinsertf64x4 zmm17, zmm17, ymm21, 0
vinsertf64x4 zmm17, zmm17, ymm22, 1
vpsubw zmm21, zmm17, zmm27
vmovntdq zmmword ptr [r9+0x100], zmm18
vmovntdq zmmword ptr [r9+0x140], zmm19
vmovntdq zmmword ptr [r9+0x180], zmm20
vmovntdq zmmword ptr [r9+0x1C0], zmm21
add rdx, 512
add r9, 512
add r8, -256
G_M000_IG10: ;; offset=0x0678
cmp r8, 256
jae G_M000_IG08
G_M000_IG11: ;; offset=0x0685
mov rcx, rdx
mov rdx, r9
xor r10d, r10d
mov bword ptr [rsp], r10
G_M000_IG12: ;; offset=0x0692
mov bword ptr [rsp+0x08], r10
G_M000_IG13: ;; offset=0x0697
mov r10, r8
lea r8, [r10+0x1F]
and r8, -32
mov r9, r8
shr r9, 5
cmp r9, 8
ja G_M000_IG25
G_M000_IG14: ;; offset=0x06B3
cmp r9d, 8
ja G_M000_IG25
G_M000_IG15: ;; offset=0x06BD
mov r9d, r9d
lea r11, [reloc @RWD64]
mov r11d, dword ptr [r11+4*r9]
lea rbx, G_M000_IG02
add r11, rbx
jmp r11
G_M000_IG16: ;; offset=0x06D8
vmovups zmm0, zmmword ptr [rcx+2*r8-0x200]
vextracti32x8 ymm1, zmm0, 1
vpmovzxwd zmm1, zmm1
vplzcntd zmm1, zmm1
vpmovzxwd zmm0, zmm0
vplzcntd zmm0, zmm0
vpmovdw zmm0, zmm0
vpmovdw zmm1, zmm1
vxorps ymm2, ymm2, ymm2
vinsertf64x4 zmm0, zmm2, ymm0, 0
vinsertf64x4 zmm0, zmm0, ymm1, 1
vpsubw zmm0, zmm0, zmm27
vmovups zmmword ptr [rdx+2*r8-0x200], zmm0
G_M000_IG17: ;; offset=0x072B
vmovups zmm0, zmmword ptr [rcx+2*r8-0x1C0]
vextracti32x8 ymm1, zmm0, 1
vpmovzxwd zmm1, zmm1
vplzcntd zmm1, zmm1
vpmovzxwd zmm0, zmm0
vplzcntd zmm0, zmm0
vpmovdw zmm0, zmm0
vpmovdw zmm1, zmm1
vxorps ymm2, ymm2, ymm2
vinsertf64x4 zmm0, zmm2, ymm0, 0
vinsertf64x4 zmm0, zmm0, ymm1, 1
vpsubw zmm0, zmm0, zmm27
vmovups zmmword ptr [rdx+2*r8-0x1C0], zmm0
G_M000_IG18: ;; offset=0x077E
vmovups zmm0, zmmword ptr [rcx+2*r8-0x180]
vextracti32x8 ymm1, zmm0, 1
vpmovzxwd zmm1, zmm1
vplzcntd zmm1, zmm1
vpmovzxwd zmm0, zmm0
vplzcntd zmm0, zmm0
vpmovdw zmm0, zmm0
vpmovdw zmm1, zmm1
vxorps ymm2, ymm2, ymm2
vinsertf64x4 zmm0, zmm2, ymm0, 0
vinsertf64x4 zmm0, zmm0, ymm1, 1
vpsubw zmm0, zmm0, zmm27
vmovups zmmword ptr [rdx+2*r8-0x180], zmm0
G_M000_IG19: ;; offset=0x07D1
vmovups zmm0, zmmword ptr [rcx+2*r8-0x140]
vextracti32x8 ymm1, zmm0, 1
vpmovzxwd zmm1, zmm1
vplzcntd zmm1, zmm1
vpmovzxwd zmm0, zmm0
vplzcntd zmm0, zmm0
vpmovdw zmm0, zmm0
vpmovdw zmm1, zmm1
vxorps ymm2, ymm2, ymm2
vinsertf64x4 zmm0, zmm2, ymm0, 0
vinsertf64x4 zmm0, zmm0, ymm1, 1
vpsubw zmm0, zmm0, zmm27
vmovups zmmword ptr [rdx+2*r8-0x140], zmm0
G_M000_IG20: ;; offset=0x0824
vmovups zmm0, zmmword ptr [rcx+2*r8-0x100]
vextracti32x8 ymm1, zmm0, 1
vpmovzxwd zmm1, zmm1
vplzcntd zmm1, zmm1
vpmovzxwd zmm0, zmm0
vplzcntd zmm0, zmm0
vpmovdw zmm0, zmm0
vpmovdw zmm1, zmm1
vxorps ymm2, ymm2, ymm2
vinsertf64x4 zmm0, zmm2, ymm0, 0
vinsertf64x4 zmm0, zmm0, ymm1, 1
vpsubw zmm0, zmm0, zmm27
vmovups zmmword ptr [rdx+2*r8-0x100], zmm0
G_M000_IG21: ;; offset=0x0877
vmovups zmm0, zmmword ptr [rcx+2*r8-0xC0]
vextracti32x8 ymm1, zmm0, 1
vpmovzxwd zmm1, zmm1
vplzcntd zmm1, zmm1
vpmovzxwd zmm0, zmm0
vplzcntd zmm0, zmm0
vpmovdw zmm0, zmm0
vpmovdw zmm1, zmm1
vxorps ymm2, ymm2, ymm2
vinsertf64x4 zmm0, zmm2, ymm0, 0
vinsertf64x4 zmm0, zmm0, ymm1, 1
vpsubw zmm0, zmm0, zmm27
vmovups zmmword ptr [rdx+2*r8-0xC0], zmm0
G_M000_IG22: ;; offset=0x08CA
vmovups zmm0, zmmword ptr [rcx+2*r8-0x80]
vextracti32x8 ymm1, zmm0, 1
vpmovzxwd zmm1, zmm1
vplzcntd zmm1, zmm1
vpmovzxwd zmm0, zmm0
vplzcntd zmm0, zmm0
vpmovdw zmm0, zmm0
vpmovdw zmm1, zmm1
vxorps ymm2, ymm2, ymm2
vinsertf64x4 zmm0, zmm2, ymm0, 0
vinsertf64x4 zmm0, zmm0, ymm1, 1
vpsubw zmm0, zmm0, zmm27
vmovups zmmword ptr [rdx+2*r8-0x80], zmm0
G_M000_IG23: ;; offset=0x091D
vmovups zmmword ptr [rdx+2*r10-0x40], zmm28
G_M000_IG24: ;; offset=0x0925
vmovups zmmword ptr [rax], zmm26
G_M000_IG25: ;; offset=0x092B
vzeroupper
add rsp, 16
pop rbx
ret
RWD00 dq 0010001000100010h, 0010001000100010h, 0010001000100010h, 0010001000100010h, 0010001000100010h, 0010001000100010h, 0010001000100010h, 0010001000100010h
RWD64 dd 000008C1h ; case G_M000_IG24
dd 000008B9h ; case G_M000_IG23
dd 00000866h ; case G_M000_IG22
dd 00000813h ; case G_M000_IG21
dd 000007C0h ; case G_M000_IG20
dd 0000076Dh ; case G_M000_IG19
dd 0000071Ah ; case G_M000_IG18
dd 000006C7h ; case G_M000_IG17
dd 00000674h ; case G_M000_IG16
; Total bytes of code 2356
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing to highlight in the above codegen is that the current PR produces 9 instructions for each invocation of LeadingZeroCount. Using the widen+narrow approach produces 11 instructions.
Original loop is unrolled, I'm copying/pasting the codegen of just one invocation of LeadingZeroCount:
Current PR -- 9 total instructions
vmovups zmm3, zmmword ptr [rdx]
vmovaps zmm4, zmm1
vpslld zmm5, zmm3, 16
vpord zmm5, zmm5, zmm4
vplzcntd zmm5, zmm5
vpord zmm3, zmm3, zmm4
vplzcntd zmm3, zmm3
vpslld zmm3, zmm3, 16
vpord zmm3, zmm3, zmm5
Widen+Narrow -- 11 total instructions
vmovups zmm0, zmmword ptr [rdx]
vextracti32x8 ymm1, zmm0, 1
vpmovzxwd zmm2, zmm1
vplzcntd zmm3, zmm2
vpmovzxwd zmm4, zmm0
vplzcntd zmm5, zmm4
vpmovdw zmm5, zmm16
vpmovdw zmm3, zmm17
vinsertf64x4 zmm18, zmm18, ymm16, 0
vinsertf64x4 zmm18, zmm18, ymm17, 1
vpsubw zmm0, zmm18, zmm27
I'm still proposing the current PR as it produces faster and smaller codegen.
Vector128<byte> lookupVectorLow = Vector128.Create((byte)8, 7, 6, 6, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4); | ||
Vector128<byte> lookupVectorHigh = Vector128.Create((byte)3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0); | ||
Vector128<byte> nibbleMask = Vector128.Create<byte>(0xF); | ||
Vector128<byte> permuteMask = Vector128.Create<byte>(0x80); | ||
Vector128<byte> lowNibble = x.AsByte() & nibbleMask; | ||
Vector128<byte> highNibble = Sse2.ShiftRightLogical(x.AsInt32(), 4).AsByte() & nibbleMask; | ||
Vector128<byte> nibbleSelectMask = Sse2.CompareEqual(highNibble, Vector128<byte>.Zero); | ||
Vector128<byte> indexVector = Sse41.BlendVariable(highNibble, lowNibble, nibbleSelectMask) + | ||
(~nibbleSelectMask & nibbleMask); | ||
indexVector |= ~nibbleSelectMask & permuteMask; | ||
return Avx512Vbmi.VL.PermuteVar16x8x2(lookupVectorLow, indexVector, lookupVectorHigh).As<byte, T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this cheaper than:
Vector512<int> x32 = Avx512F.ConvertToVector512Int32(x.AsByte());
Vector512<int> lz = Avx512CD.LeadingZeroCount(x32);
return Avx512F.ConvertToVector128Byte(lz) - Vector128.Create(24);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is overhead when widening-unwidening.
For this case, the widening here gives a bimodal performance result. To verify, the same microbenchmark can be modified to stress this path specifically by using BufferLength=16
.
Some runs look like this:
| Method | Job | Toolchain | BufferLength | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Allocated | Alloc Ratio |
|----------------- |----------- |------------------------------------------------ |------------- |----------:|----------:|----------:|----------:|----------:|----------:|------:|--------:|----------:|------------:|
| LeadingZeroCount | Job-RJAKMA | Current PR | 16 | 2.676 ns | 0.0525 ns | 0.0491 ns | 2.680 ns | 2.610 ns | 2.754 ns | 1.00 | 0.03 | - | NA |
| LeadingZeroCount | Job-FSPMRZ | Widen | 16 | 3.485 ns | 0.0365 ns | 0.0342 ns | 3.502 ns | 3.428 ns | 3.526 ns | 1.30 | 0.03 | - | NA |
Other runs look like this:
| Method | Job | Toolchain | BufferLength | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Allocated | Alloc Ratio |
|----------------- |----------- |------------------------------------------------ |------------- |----------:|----------:|----------:|----------:|----------:|----------:|------:|--------:|----------:|------------:|
| LeadingZeroCount | Job-MGUUAK | Current PR | 16 | 2.683 ns | 0.0424 ns | 0.0396 ns | 2.695 ns | 2.616 ns | 2.733 ns | 1.00 | 0.02 | - | NA |
| LeadingZeroCount | Job-NBPOWJ | Widen | 16 | 2.484 ns | 0.0334 ns | 0.0296 ns | 2.492 ns | 2.427 ns | 2.519 ns | 0.93 | 0.02 | - | NA |
I chose this version because it was more consistent.
Vector256<byte> lookupVector = | ||
Vector256.Create((byte)8, 7, 6, 6, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, | ||
3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0); | ||
Vector256<byte> nibbleMask = Vector256.Create<byte>(0xF); | ||
Vector256<byte> lowNibble = x.AsByte() & nibbleMask; | ||
Vector256<byte> highNibble = Avx2.ShiftRightLogical(x.AsInt32(), 4).AsByte() & nibbleMask; | ||
Vector256<byte> nibbleSelectMask = Avx2.CompareEqual(highNibble, Vector256<byte>.Zero); | ||
Vector256<byte> indexVector = Avx2.BlendVariable(highNibble, lowNibble, nibbleSelectMask) + | ||
(~nibbleSelectMask & nibbleMask); | ||
return Avx512Vbmi.VL.PermuteVar32x8(lookupVector, indexVector).As<byte, T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar question as previous, but doing WidenLower/WidenUpper since its 1024 bits total.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment as the Vector128<byte>
case.
There is overhead when widening-unwidening. It isn't as bad here, but both versions perform very similarly. Can be verified with BufferLength=32
to stress this path:
| Method | Job | Toolchain | BufferLength | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Allocated | Alloc Ratio |
|----------------- |----------- |------------------------------------------------ |------------- |----------:|----------:|----------:|----------:|----------:|----------:|------:|--------:|----------:|------------:|
| LeadingZeroCount | Job-RJAKMA | Current PR | 32 | 2.682 ns | 0.0387 ns | 0.0362 ns | 2.702 ns | 2.614 ns | 2.724 ns | 1.00 | 0.02 | - | NA |
| LeadingZeroCount | Job-FSPMRZ | Widen | 32 | 2.685 ns | 0.0312 ns | 0.0292 ns | 2.691 ns | 2.608 ns | 2.721 ns | 0.91 | 0.02 | - | NA |
Vector256<uint> lowHalf = Vector256.Create((uint)0x0000FFFF); | ||
Vector256<uint> x_bot16 = Avx2.Or(Avx2.ShiftLeftLogical(x.AsUInt32(), 16), lowHalf); | ||
Vector256<uint> x_top16 = Avx2.Or(x.AsUInt32(), lowHalf); | ||
Vector256<uint> lz_bot16 = Avx512CD.VL.LeadingZeroCount(x_bot16); | ||
Vector256<uint> lz_top16 = Avx512CD.VL.LeadingZeroCount(x_top16); | ||
Vector256<uint> lz_top16_shift = Avx2.ShiftLeftLogical(lz_top16, 16); | ||
return Avx2.Or(lz_bot16, lz_top16_shift).AsUInt16().As<ushort, T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar question as previous, widening to Vector512
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Widening-unwidening has similar performance in this case. Can be verified with BufferLength=16
:
| Type | Method | Job | Toolchain | BufferLength | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Allocated | Alloc Ratio |
|------------------------------------------ |----------------- |----------- |------------------------------------------------ |------------- |----------:|----------:|----------:|----------:|----------:|----------:|------:|--------:|----------:|------------:|
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | Job-WWKLJQ | Current PR | 16 | 2.485 ns | 0.0442 ns | 0.0414 ns | 2.496 ns | 2.410 ns | 2.530 ns | 1.00 | 0.02 | - | NA |
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | Job-VFJFZO | Widen | 16 | 2.474 ns | 0.0542 ns | 0.0507 ns | 2.495 ns | 2.402 ns | 2.529 ns | 1.00 | 0.03 | - | NA |
Vector512<uint> lowHalf = Vector512.Create((uint)0x0000FFFF); | ||
Vector512<uint> x_bot16 = Avx512F.Or(Avx512F.ShiftLeftLogical(x.AsUInt32(), 16), lowHalf); | ||
Vector512<uint> x_top16 = Avx512F.Or(x.AsUInt32(), lowHalf); | ||
Vector512<uint> lz_bot16 = Avx512CD.LeadingZeroCount(x_bot16); | ||
Vector512<uint> lz_top16 = Avx512CD.LeadingZeroCount(x_top16); | ||
Vector512<uint> lz_top16_shift = Avx512F.ShiftLeftLogical(lz_top16, 16); | ||
return Avx512F.Or(lz_bot16, lz_top16_shift).AsUInt16().As<ushort, T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question, doing WidenLower/WidenUpper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The widening-unwidening performance difference is most obvious for this case.
| Type | Method | Job | Toolchain | BufferLength | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Allocated | Alloc Ratio |
|------------------------------------------ |----------------- |----------- |------------------------------------------------ |------------- |----------:|----------:|----------:|----------:|----------:|----------:|------:|--------:|----------:|------------:|
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | Job-WWKLJQ | Current PR | 32 | 4.110 ns | 0.0621 ns | 0.0581 ns | 4.141 ns | 4.011 ns | 4.161 ns | 1.00 | 0.02 | - | NA |
| Perf_BinaryIntegerTensorPrimitives<Int16> | LeadingZeroCount | Job-VFJFZO | Widen | 32 | 10.421 ns | 0.0738 ns | 0.0690 ns | 10.445 ns | 10.261 ns | 10.515 ns | 2.54 | 0.04 | - | NA |
Vector512<byte> lookupVectorA = | ||
Vector512.Create((byte)8, 7, 6, 6, 5, 5, 5, 5, | ||
4, 4, 4, 4, 4, 4, 4, 4, | ||
3, 3, 3, 3, 3, 3, 3, 3, | ||
3, 3, 3, 3, 3, 3, 3, 3, | ||
2, 2, 2, 2, 2, 2, 2, 2, | ||
2, 2, 2, 2, 2, 2, 2, 2, | ||
2, 2, 2, 2, 2, 2, 2, 2, | ||
2, 2, 2, 2, 2, 2, 2, 2); | ||
Vector512<byte> lookupVectorB = Vector512.Create((byte)1); | ||
Vector512<byte> bit7ZeroMask = Avx512BW.CompareLessThan(x.AsByte(), Vector512.Create((byte)128)); | ||
return Avx512F.And(bit7ZeroMask, Avx512Vbmi.PermuteVar64x8x2(lookupVectorA, x.AsByte(), lookupVectorB)).As<byte, T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the only one that shouldn't be simply Widen+Lzcnt. But it does warrant a comment elaborating on how the lookup works.
In particular, PermuteVar64x8x2
isn't immediately obvious how it operates, so elaborating that x
is being used as an index where bit 6 selects the table, bits 5:0 select an index in the table, and anything where bit 7 is set is zeroed is goodness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I've added a comment to better explain how x
is being used as an index and how the intrinsic is choosing between the two lookup vectors.
This PR adds vector support for integer types of size
Byte
andInt16
toSystem.Numerics.Tensors.TensorPrimitives.LeadingZeroCount
.To verify there is a performance improvement, I ran against the existing microbenchmarks here. This does not currently include coverage for
Int16
, so I built a version locally that includedshort
.On my AMD64 system, I see the following improvements:
Baseline
Diff