-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change SIMD Loop from Fast to only reassoc/contract #49405
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably needs some benchmarks?
(*K)->setHasAllowReassoc(true); | ||
(*K)->setHasAllowContract(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use setFastMathFlags
to set multiple flags in one go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can probably use setHasNoSignedZeros()
as well. It can help when you initialize loops at zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be going a bit too far, though I kind of want to just expose all of the fastmath flags separately. @fastmath
implies too much and most of it is unsafe and useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO no signed zeros is reasonable. they don't do much and prevent a lot of useful transformations (i.e. 0-x to -x)
"reassoc" is clearly indicated by the That said, I would generally favor just "reassoc" and someone can manually add |
I did a very quick benchmark on A64FX with the function function sumsimd(v)
sum = zero(eltype(v))
@simd for x in v
sum += x
end
return sum
end and got pretty much same results, on 5da8d5f (latest build I have easily access to) julia> @btime sumsimd(x) setup=(x=randn(Float16, 1_000_000));
42.070 μs (0 allocations: 0 bytes)
julia> @btime sumsimd(x) setup=(x=randn(Float32, 1_000_000));
79.340 μs (0 allocations: 0 bytes)
julia> @btime sumsimd(x) setup=(x=randn(Float64, 1_000_000));
189.532 μs (0 allocations: 0 bytes) this PR: julia> @btime sumsimd(x) setup=(x=randn(Float16, 1_000_000));
42.020 μs (0 allocations: 0 bytes)
julia> @btime sumsimd(x) setup=(x=randn(Float32, 1_000_000));
79.041 μs (0 allocations: 0 bytes)
julia> @btime sumsimd(x) setup=(x=randn(Float64, 1_000_000));
185.452 μs (0 allocations: 0 bytes) The difference in %26 = fadd fast <vscale x 2 x double> %vec.phi, %wide.load
%27 = fadd fast <vscale x 2 x double> %vec.phi14, %wide.load17
%28 = fadd fast <vscale x 2 x double> %vec.phi15, %wide.load18
%29 = fadd fast <vscale x 2 x double> %vec.phi16, %wide.load19
%index.next = add nuw i64 %index, %10
%30 = icmp eq i64 %index.next, %n.vec
br i1 %30, label %middle.block, label %vector.body
middle.block: ; preds = %vector.body
%bin.rdx = fadd fast <vscale x 2 x double> %27, %26
%bin.rdx20 = fadd fast <vscale x 2 x double> %28, %bin.rdx
%bin.rdx21 = fadd fast <vscale x 2 x double> %29, %bin.rdx20
%31 = call fast double @llvm.vector.reduce.fadd.nxv2f64(double -0.000000e+00, <vscale x 2 x double> %bin.rdx21) vs %26 = fadd reassoc contract <vscale x 2 x double> %vec.phi, %wide.load
%27 = fadd reassoc contract <vscale x 2 x double> %vec.phi14, %wide.load17
%28 = fadd reassoc contract <vscale x 2 x double> %vec.phi15, %wide.load18
%29 = fadd reassoc contract <vscale x 2 x double> %vec.phi16, %wide.load19
%index.next = add nuw i64 %index, %10
%30 = icmp eq i64 %index.next, %n.vec
br i1 %30, label %middle.block, label %vector.body
middle.block: ; preds = %vector.body
%bin.rdx = fadd reassoc contract <vscale x 2 x double> %27, %26
%bin.rdx20 = fadd reassoc contract <vscale x 2 x double> %28, %bin.rdx
%bin.rdx21 = fadd reassoc contract <vscale x 2 x double> %29, %bin.rdx20
%31 = call reassoc contract double @llvm.vector.reduce.fadd.nxv2f64(double -0.000000e+00, <vscale x 2 x double> %bin.rdx21) as expected. At least in this specific case it doesn't look like the full |
With only
For reference, %23 = fadd reassoc <vscale x 2 x double> %vec.phi, %wide.load
%24 = fadd reassoc <vscale x 2 x double> %vec.phi9, %wide.load12
%25 = fadd reassoc <vscale x 2 x double> %vec.phi10, %wide.load13
%26 = fadd reassoc <vscale x 2 x double> %vec.phi11, %wide.load14
%index.next = add nuw i64 %index, %7
%27 = icmp eq i64 %index.next, %n.vec
br i1 %27, label %middle.block, label %vector.body
middle.block: ; preds = %vector.body
%bin.rdx = fadd reassoc <vscale x 2 x double> %24, %23
%bin.rdx15 = fadd reassoc <vscale x 2 x double> %25, %bin.rdx
%bin.rdx16 = fadd reassoc <vscale x 2 x double> %26, %bin.rdx15
%28 = call reassoc double @llvm.vector.reduce.fadd.nxv2f64(double -0.000000e+00, <vscale x 2 x double> %bin.rdx16) Sounds like |
Contract would only matter for a loop that implements the equivalent of |
@vchuravy im not sure about prod, but contract would make a difference for a vector dot product. |
@giordano if you only use reassoc, what does |
which is the same as what I get now on |
julia> using BenchmarkTools
julia> function dotsimd(x::AbstractVector{T}, y::AbstractVector{T}) where {T}
sum = zero(T)
@simd for idx in eachindex(x, y)
sum += x[idx] * y[idx]
end
return sum
end
dotsimd (generic function with 1 method)
This PR (with
Only
julia> function prodsimd(v)
prod = one(eltype(v))
@simd for x in v
prod *= x
end
return prod
end
prodsimd (generic function with 1 method)
PR:
Only
(unless I'm missing or I misunderstood something?) |
However, "in the real world" outside of microbenchmarks, fmas will be at least slightly better in theory. |
As shown by the fact that I had to change the %25 = fmul contract <4 x double> %wide.load, %wide.load15
%26 = fmul contract <4 x double> %wide.load12, %wide.load16
%27 = fmul contract <4 x double> %wide.load13, %wide.load17
%28 = fmul contract <4 x double> %wide.load14, %wide.load18
%29 = fadd fast <4 x double> %vec.phi, %25
%30 = fadd fast <4 x double> %vec.phi9, %26
%31 = fadd fast <4 x double> %vec.phi10, %27
%32 = fadd fast <4 x double> %vec.phi11, %28
%index.next = add nuw i64 %index, 16
%33 = icmp eq i64 %index.next, %n.vec
br i1 %33, label %middle.block, label %vector.body
middle.block: ; preds = %vector.body
%bin.rdx = fadd fast <4 x double> %30, %29
%bin.rdx19 = fadd fast <4 x double> %31, %bin.rdx
%bin.rdx20 = fadd fast <4 x double> %32, %bin.rdx19
%34 = call fast double @llvm.vector.reduce.fadd.v4f64(double -0.000000e+00, <4 x double> %bin.rdx20) with this PR %25 = fmul <4 x double> %wide.load, %wide.load15
%26 = fmul <4 x double> %wide.load12, %wide.load16
%27 = fmul <4 x double> %wide.load13, %wide.load17
%28 = fmul <4 x double> %wide.load14, %wide.load18
%29 = fadd reassoc <4 x double> %vec.phi, %25
%30 = fadd reassoc <4 x double> %vec.phi9, %26
%31 = fadd reassoc <4 x double> %vec.phi10, %27
%32 = fadd reassoc <4 x double> %vec.phi11, %28
%index.next = add nuw i64 %index, 16
%33 = icmp eq i64 %index.next, %n.vec
br i1 %33, label %middle.block, label %vector.body
middle.block: ; preds = %vector.body
%bin.rdx = fadd reassoc <4 x double> %30, %29
%bin.rdx19 = fadd reassoc <4 x double> %31, %bin.rdx
%bin.rdx20 = fadd reassoc <4 x double> %32, %bin.rdx19
%34 = call reassoc double @llvm.vector.reduce.fadd.v4f64(double -0.000000e+00, <4 x double> %bin.rdx20) The good news is that fused muladd in native code can be recovered with an explicit %25 = fmul contract <4 x double> %wide.load, %wide.load15
%26 = fmul contract <4 x double> %wide.load12, %wide.load16
%27 = fmul contract <4 x double> %wide.load13, %wide.load17
%28 = fmul contract <4 x double> %wide.load14, %wide.load18
%29 = fadd reassoc contract <4 x double> %vec.phi, %25
%30 = fadd reassoc contract <4 x double> %vec.phi9, %26
%31 = fadd reassoc contract <4 x double> %vec.phi10, %27
%32 = fadd reassoc contract <4 x double> %vec.phi11, %28
%index.next = add nuw i64 %index, 16
%33 = icmp eq i64 %index.next, %n.vec
br i1 %33, label %middle.block, label %vector.body
middle.block: ; preds = %vector.body
%bin.rdx = fadd reassoc contract <4 x double> %30, %29
%bin.rdx19 = fadd reassoc contract <4 x double> %31, %bin.rdx
%bin.rdx20 = fadd reassoc contract <4 x double> %32, %bin.rdx19
%34 = call reassoc contract double @llvm.vector.reduce.fadd.v4f64(double -0.000000e+00, <4 x double> %bin.rdx20) I presume because |
This is likely due to the mulAdd pass checking for Line 90 in 1cc10a6
|
Require `contract` only instead of full `fast` flag.
Okay, let's update the docs for |
👍
LinearAlgebra tests on aarch64 (both linux and darwin) are failing at julia/stdlib/LinearAlgebra/test/addmul.jl Line 192 in 1cc10a6
|
Maybe the latest change fixes that? But yeah it's kind of annoying that they are so brittle. |
Ok, yes, relaxing fused muladd requirement seems to have solved the linearalgebra tests, diff --git a/test/llvmpasses/loopinfo.jl b/test/llvmpasses/loopinfo.jl
index 20a450f6f1..ca2b307b3d 100644
--- a/test/llvmpasses/loopinfo.jl
+++ b/test/llvmpasses/loopinfo.jl
@@ -29,10 +29,10 @@ function simdf(X)
acc += x
# CHECK: call void @julia.loopinfo_marker(), {{.*}}, !julia.loopinfo [[LOOPINFO:![0-9]+]]
# LOWER-NOT: llvm.mem.parallel_loop_access
-# LOWER: fadd reassoc contract double
+# LOWER: fadd reassoc {{(contract )?}}double
# LOWER-NOT: call void @julia.loopinfo_marker()
# LOWER: br {{.*}}, !llvm.loop [[LOOPID:![0-9]+]]
-# FINAL: fadd reassoc contract <{{(vscale x )?}}{{[0-9]+}} x double>
+# FINAL: fadd reassoc {{(contract )?}}<{{(vscale x )?}}{{[0-9]+}} x double>
end
acc
end
@@ -46,7 +46,7 @@ function simdf2(X)
# CHECK: call void @julia.loopinfo_marker(), {{.*}}, !julia.loopinfo [[LOOPINFO2:![0-9]+]]
# LOWER: llvm.mem.parallel_loop_access
# LOWER-NOT: call void @julia.loopinfo_marker()
-# LOWER: fadd reassoc contract double
+# LOWER: fadd reassoc {{(contract )?}}double
# LOWER: br {{.*}}, !llvm.loop [[LOOPID2:![0-9]+]]
end
acc
diff --git a/test/llvmpasses/simdloop.ll b/test/llvmpasses/simdloop.ll
index e4f46f8f2b..7b4f538fc9 100644
--- a/test/llvmpasses/simdloop.ll
+++ b/test/llvmpasses/simdloop.ll
@@ -37,7 +37,7 @@ loop:
; CHECK: llvm.mem.parallel_loop_access
%aval = load double, double *%aptr
%nextv = fsub double %v, %aval
-; CHECK: fsub reassoc contract double %v, %aval
+; CHECK: fsub reassoc {{(contract )?}}double %v, %aval
%nexti = add i64 %i, 1
call void @julia.loopinfo_marker(), !julia.loopinfo !3
%done = icmp sgt i64 %nexti, 500
@@ -56,7 +56,7 @@ loop:
%aptr = getelementptr double, double *%a, i64 %i
%aval = load double, double *%aptr
%nextv = fsub double %v, %aval
-; CHECK: fsub reassoc contract double %v, %aval
+; CHECK: fsub reassoc {{(contract )?}}double %v, %aval
%nexti = add i64 %i, 1
call void @julia.loopinfo_marker(), !julia.loopinfo !2
%done = icmp sgt i64 %nexti, 500
Edit: nevermind, I fetched the local branch to the latest version of this PR, but somehow I still had the patch which removed |
Personally, I still don't think the decision to
Meanwhile, There will absolutely be cases (although not many cases) where someone requires that operations do not P.S. |
Note that benchmarks in #49405 (comment) were before relaxing the requirement for fusing muladd, now I get
which recovers the same performance as on |
@@ -100,7 +100,7 @@ The object iterated over in a `@simd for` loop should be a one-dimensional range | |||
By using `@simd`, you are asserting several properties of the loop: | |||
|
|||
* It is safe to execute iterations in arbitrary or overlapping order, with special consideration for reduction variables. | |||
* Floating-point operations on reduction variables can be reordered, possibly causing different results than without `@simd`. | |||
* Floating-point operations on reduction variables can be reordered or contracted, possibly causing different results than without `@simd`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually slightly broader I think. It's the entire reduction chain. Not just the reduction operations themselves
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestions for the wording?
Addresses #49387