-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use @simd
in _vecdot
#603
Use @simd
in _vecdot
#603
Conversation
…he SLP vectorizer.
src/linalg.jl
Outdated
return quote | ||
@_inline_meta | ||
@inbounds return $expr | ||
za = zero(a[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you checked if this still works for block vectors, SVector{Vector{<:Number}}
? (Part of the advantage of unrolling in this case was that you don't need to start with a "zero" element).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that case is already being tested for:
StaticArrays.jl/test/linalg.jl
Line 120 in d3f739e
@test @inferred(dot(@SVector[[1,2],[3,4]], @SVector[[1,2],[3,4]])) === 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverting to the 2 : prod(S)
loop results in the same performance in benchmark 2 up to n = 30
(the point at which the loop is no longer unrolled). After that point, there's a significant performance drop compared to the PR branch (up to 2x for n = 32
!).
Tangential ref: #494. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
src/linalg.jl
Outdated
expr = :(adjoint(a[1]) * b[1]) | ||
for j = 2:prod(S) | ||
expr = :($expr + adjoint(a[$j]) * b[$j]) | ||
return zero(promote_op(*, eltype(a), eltype(b))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be better as zero(one(eltype(a)) * one(eltype(b)))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied this line over unmodified, but I'm happy to change it. How about the changes I pushed?
Ah right, I'd forgotten about that PR, it's a little more than tangentially related. Well, this one's up to date with master, and I think the loop range change and getting rid of |
Looks nice and clean to me, provided the optimizer is happy with that modification 👍 |
I'll leave this open for comments for another day and then I'll merge. Thanks for the quick reviews everybody! After that, perhaps we can try the other reduction operations from #494 again using this approach. |
f064d76
to
fefe5f3
Compare
I decided to expand the scope of this PR just a little bit to include In doing so, I was wondering whether |
Beautiful! |
CI errors seem related to latest julia, not this PR? |
Yes, the cron job build of master also fails on Julia nightly. |
Ref #512. Use
@simd
in_vecdot
instead of manually unrolling and relying on the SLP vectorizer.I noticed that the current implementation never uses 256-bit wide instructions (no
ymm
registers being used), while the@simd
version does. Note that using@simd
in conjunction with the statically known size of the matrix, the loop is fully unrolled up to a reasonable threshold even without using a generated function.Benchmark 1
The naive benchmark; but I'm doubtful that this gives accurate results.
Results:
n
Vector
The
Vector
column corresponds to using plainVector
s as a baseline. It should be noted that forn = 1
andn = 2
, thecode_native
is the same before and after.Benchmark 2
Here's perhaps a more realistic usage scenario:
Results:
n
Vector
Note that
n = 30
is the point at which the loop is no longer fully unrolled.Example
code_native
difference (n = 6
)Before:
After: