Use `@simd` in `_vecdot` #603

tkoolen · 2019-04-30T05:47:21Z

Ref #512. Use @simd in _vecdot instead of manually unrolling and relying on the SLP vectorizer.

I noticed that the current implementation never uses 256-bit wide instructions (no ymm registers being used), while the @simd version does. Note that using @simd in conjunction with the statically known size of the matrix, the loop is fully unrolled up to a reasonable threshold even without using a generated function.

Benchmark 1

The naive benchmark; but I'm doubtful that this gives accurate results.

using StaticArrays
using LinearAlgebra
using BenchmarkTools

for n in [0 : 32; 100]
    # println(n)
    b = @benchmark dot(x, y) setup = begin
        x = rand(SVector{$n, Float64})
        y = rand(SVector{$n, Float64})
        # x = rand($n)
        # y = rand($n)
    end
    println(minimum(b.times))
end

Results:

`n`	`Vector`	before (ns)	after (ns)	Improvement
0	6.143	0.033	0.033	0%
1	10.051	1.504	1.548	-3%
2	10.531	1.744	1.549	11%
3	10.767	1.694	1.504	11%
4	11.516	1.651	1.663	-1%
5	11.728	1.654	1.700	-3%
6	12.198	1.943	1.804	7%
7	12.441	2.034	1.925	5%
8	12.914	2.358	1.734	26%
9	13.383	2.830	1.818	36%
10	13.868	3.257	1.925	41%
11	14.119	3.567	2.312	35%
12	14.601	4.034	2.164	46%
13	15.093	4.507	2.261	50%
14	15.765	4.912	2.484	49%
15	16.209	5.330	2.859	46%
16	12.557	5.743	2.190	62%
17	13.299	6.618	2.230	66%
18	14.031	7.334	2.759	62%
19	14.530	8.087	2.963	63%
20	15.095	8.212	2.887	65%
21	16.024	8.853	2.706	69%
22	16.920	9.457	3.354	65%
23	17.576	9.724	3.687	62%
24	18.084	10.118	3.150	69%
25	19.034	10.666	2.946	72%
26	19.785	11.177	3.728	67%
27	20.078	11.632	3.906	66%
28	20.881	12.372	4.025	67%
29	21.931	13.560	3.736	72%
30	22.603	14.221	4.060	71%
31	23.542	15.352	5.525	64%
32	13.158	16.121	5.030	69%
100	23.990	67.351	12.907	81%

The Vector column corresponds to using plain Vectors as a baseline. It should be noted that for n = 1 and n = 2, the code_native is the same before and after.

Benchmark 2

Here's perhaps a more realistic usage scenario:

using StaticArrays
using LinearAlgebra
using BenchmarkTools

for n in [0 : 32; 100]
    # println(n)
    vec_length_max = 100_000
    m = vec_length_max ÷ max(1, n)
    xs = [rand(SVector{n, Float64}) for _ = 1 : m]
    ys = [rand(SVector{n, Float64}) for _ = 1 : m]
    # xs = [rand(n) for _ = 1 : m]
    # ys = [rand(n) for _ = 1 : m]
    zs = zeros(m)
    b = @benchmark begin
        @inbounds $zs .= dot.($xs, $ys)
    end evals=1
    println(minimum(b.times) / m)
end

Results:

`n`	`Vector`	before (ns)	after (ns)	Improvement
0	6.143	0.140	0.139	0%
1	10.051	0.368	0.367	0%
2	10.531	0.599	0.583	3%
3	10.767	1.792	1.732	3%
4	11.516	2.269	2.213	2%
5	11.728	2.995	2.899	3%
6	12.198	3.502	2.160	38%
7	12.441	4.260	2.492	41%
8	12.914	3.671	3.572	3%
9	13.383	3.902	3.711	5%
10	13.868	4.384	4.093	7%
11	14.119	4.725	4.482	5%
12	14.601	5.188	5.074	2%
13	15.093	5.543	5.453	2%
14	15.765	6.303	6.013	5%
15	16.209	6.938	6.785	2%
16	12.557	7.364	7.564	-3%
17	13.299	7.484	7.338	2%
18	14.031	8.233	8.065	2%
19	14.530	8.948	8.922	0%
20	15.095	9.577	9.477	1%
21	16.024	10.023	9.836	2%
22	16.920	10.552	10.419	1%
23	17.576	11.297	10.843	4%
24	18.084	11.569	11.685	-1%
25	19.034	12.052	11.921	1%
26	19.785	12.598	12.459	1%
27	20.078	13.092	13.227	-1%
28	20.881	13.688	14.067	-3%
29	21.931	14.337	14.959	-4%
30	22.603	15.488	10.679	31%
31	23.542	16.047	13.199	18%
32	13.158	16.854	10.547	37%
100	23.990	70.075	41.048	41%

Note that n = 30 is the point at which the loop is no longer fully unrolled.

Example `code_native` difference (`n = 6`)

Before:

	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ linalg.jl:217 within `dot'
; │┌ @ linalg.jl:219 within `_vecdot'
; ││┌ @ linalg.jl:230 within `macro expansion'
; │││┌ @ linalg.jl:217 within `*'
	vmovsd	(%edi), %xmm0           ## xmm0 = mem[0],zero
	vmulsd	(%esi), %xmm0, %xmm0
	vmovsd	8(%edi), %xmm1          ## xmm1 = mem[0],zero
	vmulsd	8(%esi), %xmm1, %xmm1
	vmovsd	16(%edi), %xmm2         ## xmm2 = mem[0],zero
	vmulsd	16(%esi), %xmm2, %xmm2
; ││└└
; ││┌ @ float.jl:395 within `macro expansion'
	vaddsd	%xmm1, %xmm0, %xmm0
	vaddsd	%xmm2, %xmm0, %xmm0
; ││└
; ││┌ @ linalg.jl:230 within `macro expansion'
; │││┌ @ float.jl:399 within `*'
	vmovsd	24(%edi), %xmm1         ## xmm1 = mem[0],zero
	vmulsd	24(%esi), %xmm1, %xmm1
; │││└
; │││┌ @ float.jl:395 within `+'
	vaddsd	%xmm1, %xmm0, %xmm0
; │││└
; │││┌ @ float.jl:399 within `*'
	vmovupd	32(%edi), %xmm1
	vmulpd	32(%esi), %xmm1, %xmm1
; │││└
; │││┌ @ float.jl:395 within `+'
	vaddsd	%xmm1, %xmm0, %xmm0
	vpermilpd	$1, %xmm1, %xmm1 ## xmm1 = xmm1[1,0]
	vaddsd	%xmm1, %xmm0, %xmm0
; │└└└
	retl
	nopl	(%eax,%eax)
;

After:

	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ linalg.jl:217 within `dot'
; │┌ @ linalg.jl:222 within `_vecdot'
; ││┌ @ simdloop.jl:73 within `macro expansion' @ linalg.jl:223
; │││┌ @ linalg.jl:217 within `*'
	vmovupd	(%edi), %ymm0
	vmulpd	(%esi), %ymm0, %ymm0
; │││└
; │││┌ @ float.jl:395 within `+'
	vextractf128	$1, %ymm0, %xmm1
	vaddpd	%ymm1, %ymm0, %ymm0
	vhaddpd	%ymm0, %ymm0, %ymm0
; │││└
; │││ @ simdloop.jl:73 within `macro expansion' @ float.jl:399
	vmovsd	32(%edi), %xmm1         ## xmm1 = mem[0],zero
	vmovsd	40(%edi), %xmm2         ## xmm2 = mem[0],zero
	vfmadd231sd	32(%esi), %xmm1, %xmm0
	vfmadd231sd	40(%esi), %xmm2, %xmm0
; │└└
	vzeroupper
	retl
;

…he SLP vectorizer.

coveralls · 2019-04-30T06:13:10Z

Coverage decreased (-0.06%) to 81.114% when pulling 90ffc02 on tkoolen:tk/dot-simdloop into ec26f0a on JuliaArrays:master.

andyferris · 2019-04-30T06:18:47Z

src/linalg.jl

-    return quote
-        @_inline_meta
-        @inbounds return $expr
+    za = zero(a[1])


Have you checked if this still works for block vectors, SVector{Vector{<:Number}}? (Part of the advantage of unrolling in this case was that you don't need to start with a "zero" element).

Yes, that case is already being tested for:

StaticArrays.jl/test/linalg.jl

Line 120 in d3f739e

@test @inferred(dot(@SVector[[1,2],[3,4]], @SVector[[1,2],[3,4]])) === 30

Reverting to the 2 : prod(S) loop results in the same performance in benchmark 2 up to n = 30 (the point at which the loop is no longer unrolled). After that point, there's a significant performance drop compared to the PR branch (up to 2x for n = 32!).

KristofferC · 2019-04-30T07:04:14Z

Tangential ref: #494.

c42f

Looks good to me.

c42f · 2019-04-30T05:57:25Z

src/linalg.jl

-    expr = :(adjoint(a[1]) * b[1])
-    for j = 2:prod(S)
-        expr = :($expr + adjoint(a[$j]) * b[$j])
+        return zero(promote_op(*, eltype(a), eltype(b)))


Would this be better as zero(one(eltype(a)) * one(eltype(b)))?

I just copied this line over unmodified, but I'm happy to change it. How about the changes I pushed?

tkoolen · 2019-04-30T08:14:11Z

Tangential ref: #494.

Ah right, I'd forgotten about that PR, it's a little more than tangentially related. Well, this one's up to date with master, and I think the loop range change and getting rid of @generated are slight improvements.

c42f · 2019-04-30T08:32:40Z

Looks nice and clean to me, provided the optimizer is happy with that modification 👍

tkoolen · 2019-04-30T17:55:17Z

I'll leave this open for comments for another day and then I'll merge. Thanks for the quick reviews everybody!

After that, perhaps we can try the other reduction operations from #494 again using this approach.

tkoolen · 2019-04-30T21:28:52Z

I decided to expand the scope of this PR just a little bit to include bilinear_vecdot, since it's so similar. There was in fact a lot of code duplication, so I merged _vecdot with _bilinear_vecdot. I checked that benchmark results for dot are still the same.

In doing so, I was wondering whether (x, y) -> adjoint(x) * y shouldn't just be replaced with dot, given that that's what the generic dot implementation in Base does: https://github.com/JuliaLang/julia/blob/b4e1d0eef9c8f2fe3a6e43847bb96cceeac4f4fc/stdlib/LinearAlgebra/src/generic.jl#L758.

c42f · 2019-05-01T03:45:06Z

Beautiful!

c42f · 2019-05-01T03:45:48Z

CI errors seem related to latest julia, not this PR?

tkoolen · 2019-05-01T10:08:05Z

Yes, the cron job build of master also fails on Julia nightly.
https://travis-ci.org/JuliaArrays/StaticArrays.jl/builds/526388464

Use at-simd in _vecdot instead of manually unrolling and relying on t…

73eb5bc

…he SLP vectorizer.

andyferris reviewed Apr 30, 2019

View reviewed changes

c42f approved these changes Apr 30, 2019

View reviewed changes

Tweak for empty case.

3c9a2bc

Consolidate _bilinear_vecdot and _vecdot.

fefe5f3

tkoolen force-pushed the tk/dot-simdloop branch from f064d76 to fefe5f3 Compare April 30, 2019 21:20

Use dot(x, y) instead of adjoint(x) * y in _vecdot.

90ffc02

tkoolen merged commit 407c65f into JuliaArrays:master May 1, 2019

tkoolen deleted the tk/dot-simdloop branch May 1, 2019 19:39

c42f mentioned this pull request Jun 6, 2019

New Release 0.11.0 #616

Closed

c42f mentioned this pull request Aug 17, 2019

Bounds checking in dot #650

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `@simd` in `_vecdot` #603

Use `@simd` in `_vecdot` #603

tkoolen commented Apr 30, 2019

coveralls commented Apr 30, 2019 •

edited

Loading

andyferris Apr 30, 2019

tkoolen Apr 30, 2019

tkoolen Apr 30, 2019

KristofferC commented Apr 30, 2019 •

edited

Loading

c42f left a comment

c42f Apr 30, 2019

tkoolen Apr 30, 2019

tkoolen commented Apr 30, 2019

c42f commented Apr 30, 2019

tkoolen commented Apr 30, 2019

tkoolen commented Apr 30, 2019

c42f commented May 1, 2019

c42f commented May 1, 2019

tkoolen commented May 1, 2019

Use @simd in _vecdot #603

Use @simd in _vecdot #603

Conversation

tkoolen commented Apr 30, 2019

Benchmark 1

Benchmark 2

Example code_native difference (n = 6)

coveralls commented Apr 30, 2019 • edited Loading

andyferris Apr 30, 2019

Choose a reason for hiding this comment

tkoolen Apr 30, 2019

Choose a reason for hiding this comment

tkoolen Apr 30, 2019

Choose a reason for hiding this comment

KristofferC commented Apr 30, 2019 • edited Loading

c42f left a comment

Choose a reason for hiding this comment

c42f Apr 30, 2019

Choose a reason for hiding this comment

tkoolen Apr 30, 2019

Choose a reason for hiding this comment

tkoolen commented Apr 30, 2019

c42f commented Apr 30, 2019

tkoolen commented Apr 30, 2019

tkoolen commented Apr 30, 2019

c42f commented May 1, 2019

c42f commented May 1, 2019

tkoolen commented May 1, 2019

Use `@simd` in `_vecdot` #603

Use `@simd` in `_vecdot` #603

Example `code_native` difference (`n = 6`)

coveralls commented Apr 30, 2019 •

edited

Loading

KristofferC commented Apr 30, 2019 •

edited

Loading