Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change SIMD Loop from Fast to only reassoc/contract #49405

Merged
merged 8 commits into from
Jun 28, 2023
Merged

Conversation

vchuravy
Copy link
Member

Addresses #49387

Copy link
Contributor

@giordano giordano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably needs some benchmarks?

Comment on lines +126 to +127
(*K)->setHasAllowReassoc(true);
(*K)->setHasAllowContract(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use setFastMathFlags to set multiple flags in one go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably use setHasNoSignedZeros() as well. It can help when you initialize loops at zero.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be going a bit too far, though I kind of want to just expose all of the fastmath flags separately. @fastmath implies too much and most of it is unsafe and useless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO no signed zeros is reasonable. they don't do much and prevent a lot of useful transformations (i.e. 0-x to -x)

@mikmoore
Copy link
Contributor

mikmoore commented Apr 18, 2023

"reassoc" is clearly indicated by the help?> @simd documentation, but "contract" is not. If "contract" is included, the documentation should be updated to indicate so.

That said, I would generally favor just "reassoc" and someone can manually add @fastmath or muladd for additional optimizations where they want.

@giordano
Copy link
Contributor

Probably needs some benchmarks?

I did a very quick benchmark on A64FX with the function

function sumsimd(v)
    sum = zero(eltype(v))
    @simd for x in v
        sum += x
    end
    return sum
end

and got pretty much same results, on 5da8d5f (latest build I have easily access to)

julia> @btime sumsimd(x) setup=(x=randn(Float16, 1_000_000));
  42.070 μs (0 allocations: 0 bytes)

julia> @btime sumsimd(x) setup=(x=randn(Float32, 1_000_000));
  79.340 μs (0 allocations: 0 bytes)

julia> @btime sumsimd(x) setup=(x=randn(Float64, 1_000_000));
  189.532 μs (0 allocations: 0 bytes)

this PR:

julia> @btime sumsimd(x) setup=(x=randn(Float16, 1_000_000));
  42.020 μs (0 allocations: 0 bytes)

julia> @btime sumsimd(x) setup=(x=randn(Float32, 1_000_000));
  79.041 μs (0 allocations: 0 bytes)

julia> @btime sumsimd(x) setup=(x=randn(Float64, 1_000_000));
  185.452 μs (0 allocations: 0 bytes)

The difference in code_llvm is

  %26 = fadd fast <vscale x 2 x double> %vec.phi, %wide.load
  %27 = fadd fast <vscale x 2 x double> %vec.phi14, %wide.load17
  %28 = fadd fast <vscale x 2 x double> %vec.phi15, %wide.load18
  %29 = fadd fast <vscale x 2 x double> %vec.phi16, %wide.load19
  %index.next = add nuw i64 %index, %10
  %30 = icmp eq i64 %index.next, %n.vec
  br i1 %30, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %bin.rdx = fadd fast <vscale x 2 x double> %27, %26
  %bin.rdx20 = fadd fast <vscale x 2 x double> %28, %bin.rdx
  %bin.rdx21 = fadd fast <vscale x 2 x double> %29, %bin.rdx20
  %31 = call fast double @llvm.vector.reduce.fadd.nxv2f64(double -0.000000e+00, <vscale x 2 x double> %bin.rdx21)

vs

  %26 = fadd reassoc contract <vscale x 2 x double> %vec.phi, %wide.load
  %27 = fadd reassoc contract <vscale x 2 x double> %vec.phi14, %wide.load17
  %28 = fadd reassoc contract <vscale x 2 x double> %vec.phi15, %wide.load18
  %29 = fadd reassoc contract <vscale x 2 x double> %vec.phi16, %wide.load19
  %index.next = add nuw i64 %index, %10
  %30 = icmp eq i64 %index.next, %n.vec
  br i1 %30, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %bin.rdx = fadd reassoc contract <vscale x 2 x double> %27, %26
  %bin.rdx20 = fadd reassoc contract <vscale x 2 x double> %28, %bin.rdx
  %bin.rdx21 = fadd reassoc contract <vscale x 2 x double> %29, %bin.rdx20
  %31 = call reassoc contract double @llvm.vector.reduce.fadd.nxv2f64(double -0.000000e+00, <vscale x 2 x double> %bin.rdx21)                                                                                    

as expected. At least in this specific case it doesn't look like the full fast flags makes a difference compared to only reassoc contract, which is good news, I guess.

@giordano
Copy link
Contributor

With only reassoc:

julia> @btime sumsimd(x) setup=(x=randn(Float16, 1_000_000));
  41.820 μs (0 allocations: 0 bytes)

julia> @btime sumsimd(x) setup=(x=randn(Float32, 1_000_000));
  79.131 μs (0 allocations: 0 bytes)

julia> @btime sumsimd(x) setup=(x=randn(Float64, 1_000_000));
  184.232 μs (0 allocations: 0 bytes)

For reference, code_llvm:

  %23 = fadd reassoc <vscale x 2 x double> %vec.phi, %wide.load
  %24 = fadd reassoc <vscale x 2 x double> %vec.phi9, %wide.load12
  %25 = fadd reassoc <vscale x 2 x double> %vec.phi10, %wide.load13
  %26 = fadd reassoc <vscale x 2 x double> %vec.phi11, %wide.load14
  %index.next = add nuw i64 %index, %7
  %27 = icmp eq i64 %index.next, %n.vec
  br i1 %27, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %bin.rdx = fadd reassoc <vscale x 2 x double> %24, %23
  %bin.rdx15 = fadd reassoc <vscale x 2 x double> %25, %bin.rdx
  %bin.rdx16 = fadd reassoc <vscale x 2 x double> %26, %bin.rdx15
  %28 = call reassoc double @llvm.vector.reduce.fadd.nxv2f64(double -0.000000e+00, <vscale x 2 x double> %bin.rdx16)                                                                                             

Sounds like reassoc alone would be enough to retain the performance speedup of @simd at least in this simple case.

@vchuravy
Copy link
Member Author

Contract would only matter for a loop that implements the equivalent of prod

@simonbyrne
Copy link
Contributor

@vchuravy im not sure about prod, but contract would make a difference for a vector dot product.

@simonbyrne
Copy link
Contributor

@giordano if you only use reassoc, what does simdsum([-0.0]) give?

@giordano
Copy link
Contributor

julia> sumsimd([-0.0])
0.0

which is the same as what I get now on master.

@giordano
Copy link
Contributor

giordano commented May 28, 2023

but contract would make a difference for a vector dot product.

julia> using BenchmarkTools

julia> function dotsimd(x::AbstractVector{T}, y::AbstractVector{T}) where {T}
           sum = zero(T)
           @simd for idx in eachindex(x, y)
               sum += x[idx] * y[idx]
           end
           return sum
       end
dotsimd (generic function with 1 method)

5da8d5f

julia> @btime dotsimd(x, y) setup=(T=Float16; N=1_000_000; x=randn(T, N); y=randn(T, N));
  60.581 μs (0 allocations: 0 bytes)

julia> @btime dotsimd(x, y) setup=(T=Float32; N=1_000_000; x=randn(T, N); y=randn(T, N));
  132.881 μs (0 allocations: 0 bytes)

julia> @btime dotsimd(x, y) setup=(T=Float64; N=1_000_000; x=randn(T, N); y=randn(T, N));
  299.743 μs (0 allocations: 0 bytes)

This PR (with reassoc and contract):

julia> @btime dotsimd(x, y) setup=(T=Float16; N=1_000_000; x=randn(T, N); y=randn(T, N));
  100.391 μs (0 allocations: 0 bytes)

julia> @btime dotsimd(x, y) setup=(T=Float32; N=1_000_000; x=randn(T, N); y=randn(T, N));
  140.891 μs (0 allocations: 0 bytes)

julia> @btime dotsimd(x, y) setup=(T=Float64; N=1_000_000; x=randn(T, N); y=randn(T, N));
  317.133 μs (0 allocations: 0 bytes)

Only reassoc:

julia> @btime dotsimd(x, y) setup=(T=Float16; N=1_000_000; x=randn(T, N); y=randn(T, N));
  99.511 μs (0 allocations: 0 bytes)

julia> @btime dotsimd(x, y) setup=(T=Float32; N=1_000_000; x=randn(T, N); y=randn(T, N));
  137.811 μs (0 allocations: 0 bytes)

julia> @btime dotsimd(x, y) setup=(T=Float64; N=1_000_000; x=randn(T, N); y=randn(T, N));
  304.733 μs (0 allocations: 0 bytes)

Contract would only matter for a loop that implements the equivalent of prod

julia> function prodsimd(v)
           prod = one(eltype(v))
           @simd for x in v
               prod *= x
           end
           return prod
       end
prodsimd (generic function with 1 method)

5da8d5f

julia> @btime prodsimd(x) setup=(x=randn(Float16, 1_000_000));
  162.721 μs (0 allocations: 0 bytes)

julia> @btime prodsimd(x) setup=(x=randn(Float32, 1_000_000));
  323.323 μs (0 allocations: 0 bytes)

julia> @btime prodsimd(x) setup=(x=randn(Float64, 1_000_000));
  658.906 μs (0 allocations: 0 bytes)

PR:

julia> @btime prodsimd(x) setup=(x=randn(Float16, 1_000_000));
  162.621 μs (0 allocations: 0 bytes)

julia> @btime prodsimd(x) setup=(x=randn(Float32, 1_000_000));
  322.863 μs (0 allocations: 0 bytes)

julia> @btime prodsimd(x) setup=(x=randn(Float64, 1_000_000));
  659.456 μs (0 allocations: 0 bytes)

Only reassoc:

julia> @btime prodsimd(x) setup=(x=randn(Float16, 1_000_000));
  162.572 μs (0 allocations: 0 bytes)

julia> @btime prodsimd(x) setup=(x=randn(Float32, 1_000_000));
  323.513 μs (0 allocations: 0 bytes)

julia> @btime prodsimd(x) setup=(x=randn(Float64, 1_000_000));
  658.036 μs (0 allocations: 0 bytes)

(unless I'm missing or I misunderstood something?)

@chriselrod
Copy link
Contributor

chriselrod commented May 29, 2023

  1. Dot products are generally going to be memory bound when they're too big to fit in cache, probably even on the A64FX.
  2. Sticking with x64 machines, on Intel we generally get 2 loads/cycle, and 2 fma's/cycle. Given that dot products need 2 loads per FMA, 2 loads + 1 mul + 1 add means we still get the same theoretical throughput.
  3. This is even more true on AMD Ryzen (at least zen3 and zen4, I don't know about older), where we get the same each of (a) loads, (b) fma/muls, and (c) addition only. Thus, on these, even if we had double the amount of loads, muls + addition yield the same throughput as only needing fmas.

However, "in the real world" outside of microbenchmarks, fmas will be at least slightly better in theory.
Fewer instructions to decode, less code to occupy the instruction cache, fewer uops to occupy the uop cache, less overall use of execution units and thus better sharing with another thread running on the same core...

test/llvmpasses/fmf.jl Outdated Show resolved Hide resolved
@giordano
Copy link
Contributor

As shown by the fact that I had to change the fmf.jl test to make it pass, the main drawback of this PR (even with reassoc+contract) is that multiplication and addition don't fuse automatically. LLVM IR of dotf on x86_64 master:

  %25 = fmul contract <4 x double> %wide.load, %wide.load15
  %26 = fmul contract <4 x double> %wide.load12, %wide.load16
  %27 = fmul contract <4 x double> %wide.load13, %wide.load17
  %28 = fmul contract <4 x double> %wide.load14, %wide.load18
  %29 = fadd fast <4 x double> %vec.phi, %25
  %30 = fadd fast <4 x double> %vec.phi9, %26
  %31 = fadd fast <4 x double> %vec.phi10, %27
  %32 = fadd fast <4 x double> %vec.phi11, %28
  %index.next = add nuw i64 %index, 16
  %33 = icmp eq i64 %index.next, %n.vec
  br i1 %33, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %bin.rdx = fadd fast <4 x double> %30, %29
  %bin.rdx19 = fadd fast <4 x double> %31, %bin.rdx
  %bin.rdx20 = fadd fast <4 x double> %32, %bin.rdx19
  %34 = call fast double @llvm.vector.reduce.fadd.v4f64(double -0.000000e+00, <4 x double> %bin.rdx20)

with this PR

  %25 = fmul <4 x double> %wide.load, %wide.load15
  %26 = fmul <4 x double> %wide.load12, %wide.load16
  %27 = fmul <4 x double> %wide.load13, %wide.load17
  %28 = fmul <4 x double> %wide.load14, %wide.load18
  %29 = fadd reassoc <4 x double> %vec.phi, %25
  %30 = fadd reassoc <4 x double> %vec.phi9, %26
  %31 = fadd reassoc <4 x double> %vec.phi10, %27
  %32 = fadd reassoc <4 x double> %vec.phi11, %28
  %index.next = add nuw i64 %index, 16
  %33 = icmp eq i64 %index.next, %n.vec
  br i1 %33, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %bin.rdx = fadd reassoc <4 x double> %30, %29
  %bin.rdx19 = fadd reassoc <4 x double> %31, %bin.rdx
  %bin.rdx20 = fadd reassoc <4 x double> %32, %bin.rdx19
  %34 = call reassoc double @llvm.vector.reduce.fadd.v4f64(double -0.000000e+00, <4 x double> %bin.rdx20)

The good news is that fused muladd in native code can be recovered with an explicit muladd call, which results in the following LLVM IR:

  %25 = fmul contract <4 x double> %wide.load, %wide.load15
  %26 = fmul contract <4 x double> %wide.load12, %wide.load16
  %27 = fmul contract <4 x double> %wide.load13, %wide.load17
  %28 = fmul contract <4 x double> %wide.load14, %wide.load18
  %29 = fadd reassoc contract <4 x double> %vec.phi, %25
  %30 = fadd reassoc contract <4 x double> %vec.phi9, %26
  %31 = fadd reassoc contract <4 x double> %vec.phi10, %27
  %32 = fadd reassoc contract <4 x double> %vec.phi11, %28
  %index.next = add nuw i64 %index, 16
  %33 = icmp eq i64 %index.next, %n.vec
  br i1 %33, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %bin.rdx = fadd reassoc contract <4 x double> %30, %29
  %bin.rdx19 = fadd reassoc contract <4 x double> %31, %bin.rdx
  %bin.rdx20 = fadd reassoc contract <4 x double> %32, %bin.rdx19
  %34 = call reassoc contract double @llvm.vector.reduce.fadd.v4f64(double -0.000000e+00, <4 x double> %bin.rdx20)

I presume because muladd internally enables contract.

@vchuravy
Copy link
Member Author

This is likely due to the mulAdd pass checking for fast and not contract

if (!I.isFast())

Require `contract` only instead of full `fast` flag.
@vchuravy
Copy link
Member Author

Okay, let's update the docs for @simd to state that we perform contract + reassoc, and then I think this should be ready?

@giordano
Copy link
Contributor

Okay, let's update the docs for @simd to state that we perform contract + reassoc

👍

and then I think this should be ready?

LinearAlgebra tests on aarch64 (both linux and darwin) are failing at

@test collect(returned_mat) α * Ac * Bc + β * Cc rtol=rtol
it's a bit concerning they're relying so much on a performance optimisation

@gbaraldi
Copy link
Member

Maybe the latest change fixes that? But yeah it's kind of annoying that they are so brittle.

@giordano
Copy link
Contributor

giordano commented May 29, 2023

Ok, yes, relaxing fused muladd requirement seems to have solved the linearalgebra tests, but now I have this problem: llvmpasses tests fail for me on aarch64 unless I apply this patch

diff --git a/test/llvmpasses/loopinfo.jl b/test/llvmpasses/loopinfo.jl
index 20a450f6f1..ca2b307b3d 100644
--- a/test/llvmpasses/loopinfo.jl
+++ b/test/llvmpasses/loopinfo.jl
@@ -29,10 +29,10 @@ function simdf(X)
         acc += x
 # CHECK: call void @julia.loopinfo_marker(), {{.*}}, !julia.loopinfo [[LOOPINFO:![0-9]+]]
 # LOWER-NOT: llvm.mem.parallel_loop_access
-# LOWER: fadd reassoc contract double
+# LOWER: fadd reassoc {{(contract )?}}double
 # LOWER-NOT: call void @julia.loopinfo_marker()
 # LOWER: br {{.*}}, !llvm.loop [[LOOPID:![0-9]+]]
-# FINAL: fadd reassoc contract <{{(vscale x )?}}{{[0-9]+}} x double>
+# FINAL: fadd reassoc {{(contract )?}}<{{(vscale x )?}}{{[0-9]+}} x double>
     end
     acc
 end
@@ -46,7 +46,7 @@ function simdf2(X)
 # CHECK: call void @julia.loopinfo_marker(), {{.*}}, !julia.loopinfo [[LOOPINFO2:![0-9]+]]
 # LOWER: llvm.mem.parallel_loop_access
 # LOWER-NOT: call void @julia.loopinfo_marker()
-# LOWER: fadd reassoc contract double
+# LOWER: fadd reassoc {{(contract )?}}double
 # LOWER: br {{.*}}, !llvm.loop [[LOOPID2:![0-9]+]]
     end
     acc
diff --git a/test/llvmpasses/simdloop.ll b/test/llvmpasses/simdloop.ll
index e4f46f8f2b..7b4f538fc9 100644
--- a/test/llvmpasses/simdloop.ll
+++ b/test/llvmpasses/simdloop.ll
@@ -37,7 +37,7 @@ loop:
 ; CHECK: llvm.mem.parallel_loop_access
   %aval = load double, double *%aptr
   %nextv = fsub double %v, %aval
-; CHECK: fsub reassoc contract double %v, %aval
+; CHECK: fsub reassoc {{(contract )?}}double %v, %aval
   %nexti = add i64 %i, 1
   call void @julia.loopinfo_marker(), !julia.loopinfo !3
   %done = icmp sgt i64 %nexti, 500
@@ -56,7 +56,7 @@ loop:
   %aptr = getelementptr double, double *%a, i64 %i
   %aval = load double, double *%aptr
   %nextv = fsub double %v, %aval
-; CHECK: fsub reassoc contract double %v, %aval
+; CHECK: fsub reassoc {{(contract )?}}double %v, %aval
   %nexti = add i64 %i, 1
   call void @julia.loopinfo_marker(), !julia.loopinfo !2
   %done = icmp sgt i64 %nexti, 500

It isn't quite clear to me why contract is gone, and also why only on aarch64.

Edit: nevermind, I fetched the local branch to the latest version of this PR, but somehow I still had the patch which removed contract, which I thought I had stashed.

@giordano giordano added compiler:simd instruction-level vectorization compiler:llvm For issues that relate to LLVM labels May 29, 2023
@mikmoore
Copy link
Contributor

mikmoore commented May 29, 2023

Personally, I still don't think the decision to contract should be coupled into @simd. I'll reiterate my pitch and then shut up about it, as this PR is still a huge improvement over the current behavior.

contract is probably the most boring fastmath flag as the best it can ever do is turn 2 instructions into 1. It can never enable SIMD that was otherwise impossible (except maybe in an edge case where it dodges a register spill?) or eliminate a code block. The fact that contract couldn't accelerate a dot product in the above benchmarks (which was a surprise to me and could change in a different benchmark) further diminishes the performance arguments for it.

Meanwhile, contract is the only flag that can currently be enabled independently. muladd is easy to add manually, @fastmath is even easier (if one can suffer the extra flags or if we get #49890), and both can be used in combination with @simd. But there is still is no way (or plan for a way) to disable flags.

There will absolutely be cases (although not many cases) where someone requires that operations do not contract. Complex multiplication comes to mind, where implementing it with contract can cause imag(x*x') != 0.

P.S.
Maybe this should get a news item? Some people may observe performance regressions. It would be nice (and may avoid some bug reports) to note this change and suggest that the previous behavior and performance can be recovered via @fastmath @simd.

@giordano
Copy link
Contributor

The fact that contract couldn't accelerate a dot product in the above benchmarks (which was a surprise to me and could change in a different benchmark) further diminishes the performance arguments for it.

Note that benchmarks in #49405 (comment) were before relaxing the requirement for fusing muladd, now I get

julia> @btime dotsimd(x, y) setup=(T=Float16; N=1_000_000; x=randn(T, N); y=randn(T, N));
  60.271 μs (0 allocations: 0 bytes)

julia> @btime dotsimd(x, y) setup=(T=Float32; N=1_000_000; x=randn(T, N); y=randn(T, N));
  135.201 μs (0 allocations: 0 bytes)

julia> @btime dotsimd(x, y) setup=(T=Float64; N=1_000_000; x=randn(T, N); y=randn(T, N));
  297.843 μs (0 allocations: 0 bytes)

which recovers the same performance as on master for Float16.

@giordano giordano marked this pull request as ready for review May 29, 2023 15:33
@@ -100,7 +100,7 @@ The object iterated over in a `@simd for` loop should be a one-dimensional range
By using `@simd`, you are asserting several properties of the loop:

* It is safe to execute iterations in arbitrary or overlapping order, with special consideration for reduction variables.
* Floating-point operations on reduction variables can be reordered, possibly causing different results than without `@simd`.
* Floating-point operations on reduction variables can be reordered or contracted, possibly causing different results than without `@simd`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually slightly broader I think. It's the entire reduction chain. Not just the reduction operations themselves

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestions for the wording?

@vchuravy vchuravy added the merge me PR is reviewed. Merge when all tests are passing label Jun 27, 2023
@vtjnash vtjnash merged commit 663c58d into master Jun 28, 2023
@vtjnash vtjnash deleted the vc/simd_fastmath branch June 28, 2023 22:12
@pchintalapudi pchintalapudi removed the merge me PR is reviewed. Merge when all tests are passing label Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:llvm For issues that relate to LLVM compiler:simd instruction-level vectorization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants