@avx seems not using multi threads with user-defined types #242

islent · 2021-04-17T10:20:15Z

islent
Apr 17, 2021

Hello everyone! I was trying to use @avx in for loops on my user-defined types. Unfortunately, I found LoopVectorization is not vectorizing these types for some reason. Here's my MWE (minimum working example):

using LoopVectorization
using ThreadsX
using BenchmarkTools

# MWE of user-defined type
struct Point{T}
    x::T
    y::T
end

import Base.:(+)
+(a::Point, b::Point) = Point(a.x + b.x, a.y + b.y)

import Base.zero
zero(::Type{Point{T}}) where T = Point(zero(T), zero(T))
zero(::Point{T}) where T = Point(zero(T), zero(T))

# Generate data
N = 10^7
PointArray = [Point(rand(), rand()) for i in 1:N];

function sumavx(a)
    out = zero(eltype(a))
    @avxt for i in eachindex(a)
        out += a[i]
    end
    out
end

## Benchmark on array of Point
# Single thread (CPU occupation ~ 15%)
@btime sum($PointArray)

# 8 threads (CPU occupation ~ 90%), less than 2x faster
@btime ThreadsX.sum($PointArray)

#! Even slower!!! not using sufficient threads (CPU occupation is around 15%)
@btime sumavx($PointArray)



## Now benchmark array of internal type
FloatArray = [rand() for i in 1:3*N];

# Single thread (CPU occupation ~ 15%)
@btime sum($FloatArray)

# 8 threads (CPU occupation ~ 90%), less than 2x faster
@btime ThreadsX.sum($FloatArray)

#! 8 threads (CPU occupation ~ 90%), less than 2x faster
@btime sumavx($FloatArray)


versioninfo()

Some of the important outputs are:

julia> # Single thread (CPU occupation ~ 15%)
julia> @btime sum($PointArray)
  7.279 ms (0 allocations: 0 bytes)
Point{Float64}(4.999986420886718e6, 5.0008962081742305e6)

julia> # 8 threads (CPU occupation ~ 90%), less than 2x faster
julia> @btime ThreadsX.sum($PointArray)
  4.310 ms (509 allocations: 39.53 KiB)
Point{Float64}(4.99998642088672e6, 5.00089620817423e6)

julia> #! Even slower!!! not using sufficient threads (CPU occupation is around 15%)
julia> @btime sumavx($PointArray)
  9.787 ms (0 allocations: 0 bytes)
Point{Float64}(4.999986420887153e6, 5.000896208173763e6)



julia> # Single thread (CPU occupation ~ 15%)
julia> @btime sum($FloatArray)
  10.407 ms (0 allocations: 0 bytes)
1.4998079878852893e7

julia> # 8 threads (CPU occupation ~ 90%), less than 2x faster
julia> @btime ThreadsX.sum($FloatArray)
  6.582 ms (510 allocations: 37.58 KiB)
1.4998079878852889e7

julia> #! 8 threads (CPU occupation ~ 90%), less than 2x faster
julia> @btime sumavx($FloatArray)
  6.489 ms (0 allocations: 0 bytes)
1.49980798788529e7

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Xeon(R) W-10885M CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_CUDA_USE_BINARYBUILDER = false
  JULIA_DEPOT_PATH = E:\.julia
  JULIA_NUM_THREADS = 8
  JULIA_EDITOR = code

In the case of user-defined type, CPUs are not fully occupied. It looks like LoopVectorization is not performing any optimization on Point type:

julia> function sumavx_manual(a)
           out = zero(eltype(a))
           @avx thread = 8 for i in eachindex(a)
               out += a[i]
           end
           out
       end
sumavx_manual (generic function with 1 method)

julia> @btime sumavx_manual($PointArray)
  9.770 ms (0 allocations: 0 bytes)
Point{Float64}(4.999986420887153e6, 5.000896208173763e6)

I have noticed some discussions related:

https://juliasimd.github.io/LoopVectorization.jl/latest/devdocs/evaluating_loops/
How to handle user defined functions? #55
However, I still have no idea of the best way to handle user-defined types in LoopVectorization. I would appreciate more novice-friendly examples in the documentation :D.

Answered by chriselrod

Apr 17, 2021

LoopVectorization only understands primitive element types, i.e. Union{Base.HWReal,Bool}:

julia> Union{Base.HWReal,Bool}
Union{Bool, Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8}

and strided memory layouts.
It uses the function LoopVectorization.check_args to check arguments for compatibility, so if you're curious whether LoopVectorization will work with a type, you can try:

julia> LoopVectorization.check_args(PointArray)
false

If it returns false, that means it will not work and will instead simply apply @inbounds @fastmath to your loops and hope that works out.

In this case, it doesn't. Compare @inbounds @fastmath with @inbounds @simd:

julia> using LoopVect…

View full answer

chriselrod · 2021-04-17T17:23:44Z

chriselrod
Apr 17, 2021
Maintainer

LoopVectorization only understands primitive element types, i.e. Union{Base.HWReal,Bool}:

julia> Union{Base.HWReal,Bool}
Union{Bool, Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8}

and strided memory layouts.
It uses the function LoopVectorization.check_args to check arguments for compatibility, so if you're curious whether LoopVectorization will work with a type, you can try:

julia> LoopVectorization.check_args(PointArray)
false

If it returns false, that means it will not work and will instead simply apply @inbounds @fastmath to your loops and hope that works out.

In this case, it doesn't. Compare @inbounds @fastmath with @inbounds @simd:

julia> using LoopVectorization

julia> using ThreadsX

julia> using BenchmarkTools

julia> # MWE of user-defined type
       struct Point{T}
           x::T
           y::T
       end

julia> import Base.:(+)

julia> +(a::Point, b::Point) = Point(a.x + b.x, a.y + b.y)
+ (generic function with 257 methods)

julia> import Base.zero

julia> zero(::Type{Point{T}}) where T = Point(zero(T), zero(T))
zero (generic function with 30 methods)

julia> zero(::Point{T}) where T = Point(zero(T), zero(T))
zero (generic function with 31 methods)

julia> # Generate data
       N = 10^7
10000000

julia> PointArray = [Point(rand(), rand()) for i in 1:N];

julia> function sumavx(a)
           out = zero(eltype(a))
           @avxt for i in eachindex(a)
               out += a[i]
           end
           out
       end
sumavx (generic function with 1 method)

julia> ## Benchmark on array of Point
       # Single thread (CPU occupation ~ 15%)
       @btime sum($PointArray)
  2.568 ms (0 allocations: 0 bytes)
Point{Float64}(5.000077086921011e6, 4.998879288595926e6)

julia> # 8 threads (CPU occupation ~ 90%), less than 2x faster
       @btime ThreadsX.sum($PointArray)
  2.671 ms (250 allocations: 17.50 KiB)
Point{Float64}(5.0000770869210055e6, 4.998879288595921e6)

julia> #! Even slower!!! not using sufficient threads (CPU occupation is around 15%)
       @btime sumavx($PointArray)
  9.375 ms (0 allocations: 0 bytes)
Point{Float64}(5.0000770869211e6, 4.998879288595839e6)

julia> ## Now benchmark array of internal type
       FloatArray = [rand() for i in 1:3*N];

julia> # Single thread (CPU occupation ~ 15%)
       @btime sum($FloatArray)
  5.347 ms (0 allocations: 0 bytes)
1.5000468678765366e7

julia> # 8 threads (CPU occupation ~ 90%), less than 2x faster
       @btime ThreadsX.sum($FloatArray)
  3.990 ms (249 allocations: 16.48 KiB)
1.5000468678765357e7

julia> #! 8 threads (CPU occupation ~ 90%), less than 2x faster
       @btime sumavx($FloatArray)
  3.955 ms (0 allocations: 0 bytes)
1.5000468678765424e7

julia> versioninfo()
Julia Version 1.7.0-DEV.925
Commit 5e93c29dde* (2021-04-14 21:41 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.3.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cyclone)
Environment:
  JULIA_NUM_THREADS = auto

julia> Threads.nthreads()
4

julia> function sumfastmath(a)
           out = zero(eltype(a))
           @inbounds @fastmath for i in eachindex(a)
               out += a[i]
           end
           out
       end
sumfastmath (generic function with 1 method)

julia> function sumsimd(a)
           out = zero(eltype(a))
           @inbounds @simd for i in eachindex(a)
               out += a[i]
           end
           out
       end
sumsimd (generic function with 1 method)

julia> @btime sumfastmath($PointArray)
  9.375 ms (0 allocations: 0 bytes)
Point{Float64}(5.0000770869211e6, 4.998879288595839e6)

julia> @btime sumsimd($PointArray)
  2.541 ms (0 allocations: 0 bytes)
Point{Float64}(5.00007708692092e6, 4.998879288595935e6)

The fastmath version is (as predicted) exactly the same fast as the @avxt version. I also checked the assembly and confirmed that it is indeed identical, and also that it is not vectorized.
But the @simd version did vectorize.
This means that in the short term, it may be worth adding @simd to the fallback loop to help cases like this.
That'd be a good first issue (and would require minimal knowledge about the rest of LoopVectorization).

I intend to eventually fix this by taking advantage of the abstract interpreter to essentially deconstruct user defined types into their primitive components, but it will be many months before I have the time to do this.

So how to work around it?
The README gives an example using StructArrays.jl. That'd be a great approach if you can take it.
If not (e.g., if your inputs have to be an Array; you probably don't want to have to convert between the two!), an alternative is to use reinterpret.
Starting with Julia 1.6, you can do something like:

function sumavx2(a::Array{Point{T}}) where {T}
    x = zero(T)
    y = zero(T)
    b = reinterpret(reshape, T, a)
    @avxt for i in axes(b,2)
        x += b[1,i]
        y += b[2,i]
    end
    Point(x, y)
end

Not using threads is still faster on the M1 (this computer).

Note that for loops like this, memory bandwidth is the most important thing.
The CPU core itself spends most of its time doing nothing but waiting for data to show up. The M1 has much higher memory bandwidth than x86 CPUs, which is why my single core benchmark was so much faster here than yours.
It also seems like a single core has full access to the memory bandwidth, and hence extra cores did not help the M1: throwing more compute ability at a non-compute limited problem doesn't help.

But on x86 CPUs, I normally do find it helps. Aside from having lower bandwidth overall, it also seems like individual cores only have access to a fraction of the total available. So I normally find that x86 CPUs do get faster when adding more cores to memory bound tasks, but at a rate much lower than the actual number of cores used.
I'm not really sure about this / need to find some material to read (maybe I haven't looked at the appropriate parts of Intel's optimization manuals, or need to reread Agner Fog's manuals).

The reinterpret thing is of course awkward to do, as is breaking apart StructArrays so that LoopVectorization can see just the primitive operations. It'll be great to have LoopVectorization do that automatically, but like I said it'll be a while before that's possible.

2 replies

islent Apr 18, 2021
Author

Thank you for your help and patience. StructArrays works:

using StructArrays
PSA = StructArray(PointArray)

# use Base.sum
function sumSA_sum(a)
    return Point(sum(a.x), sum(a.y))
end

# Manually accumulate
function sumSA_manual(a)
    out = zero(first(a))
    x = out.x
    y = out.y
    for i in eachindex(a)
        @inbounds x += a.x[i]
        @inbounds y += a.y[i]
    end
    return Point(x,y)
end

# use ThreadsX
function sumSA_thread(a)
    return Point(ThreadsX.sum(a.x),ThreadsX.sum(a.y))
end

function sumSA_avx(a)
    out = zero(first(a))
    x = out.x
    y = out.y
    @avxt for i in eachindex(a)
        x += a.x[i]
        y += a.y[i]
    end
    return Point(x,y)
end

@btime sumSA_sum($PSA)
@btime sumSA_manual($PSA)
@btime sumSA_thread($PSA)
@btime sumSA_avx($PSA)

With N = 10^8 data points, and 8 threads, the result is really great:

julia> @btime sumSA_sum($PSA)
  73.526 ms (0 allocations: 0 bytes)
Point{Float64}(4.999687934504418e7, 4.9999620116163395e7)

julia> @btime sumSA_manual($PSA)
  101.894 ms (0 allocations: 0 bytes)
Point{Float64}(4.9996879345044665e7, 4.999962011616576e7)

julia> @btime sumSA_thread($PSA)
  45.160 ms (1025 allocations: 75.31 KiB)
Point{Float64}(4.999687934504417e7, 4.99996201161634e7)

julia> @btime sumSA_avx($PSA)
  44.795 ms (0 allocations: 0 bytes)
Point{Float64}(4.999687934504421e7, 4.9999620116163395e7)

In SIMD cases, LoopVectorization performs much better than ThreadsX:

using Base.Threads

function check(a)
    out = 0
    for i in eachindex(a)
        out += a.x[i] > 0.5
    end
    out
end

function check_inbound(a)
    out = 0
    for i in eachindex(a)
        @inbounds out += a.x[i] > 0.5
    end
    out
end

function check_inbound_atomic(a)
    out = Atomic{Int}(0)
    Threads.@threads for i in eachindex(a)
        @inbounds atomic_add!(out, Int(a.x[i] > 0.5))
    end
    out
end

function check_thread(a)
    ThreadsX.sum(i>0.5 for i in a.x)
end

function check_avx(a)
    out = 0
    @avxt for i in eachindex(a)
        out += a.x[i] > 0.5
    end
    out
end

@btime check($PSA)
@btime check_inbound($PSA)
@btime check_inbound_atomic($PSA)
@btime check_thread($PSA)
@btime check_avx($PSA)

julia> @btime check($PSA)
  71.235 ms (0 allocations: 0 bytes)
49995562

julia> @btime check_inbound($PSA)        # Unexpected much faster!
  36.148 ms (0 allocations: 0 bytes)
49995562

julia> @btime check_inbound_atomic($PSA) # Unexpected slow
  1.463 s (42 allocations: 4.17 KiB)
Atomic{Int64}(49995562)

julia> @btime check_thread($PSA)         # Allocations are slowing it down
  47.321 ms (511 allocations: 37.61 KiB)
49995562

julia> @btime check_avx($PSA)            # Returning `Float64`, not a big problem?
  22.282 ms (0 allocations: 0 bytes)
4.9995562e7

chriselrod Apr 18, 2021
Maintainer

Yeah, it shouldn't be returning Float64 there. I need to make it smarter about promotion; it sees Float64 are involved and therefore promotes everything to that.

FWIW, these are timings I get on a 7980XE (18 cores/36 threads), using PointerArray:

julia> @btime ThreadsX.sum($PointArray)
  1.784 ms (4121 allocations: 288.47 KiB)
Point{Float64}(4.998396863358341e6, 4.999648806282576e6)

julia> @btime sumavx2($PointArray)
  1.586 ms (0 allocations: 0 bytes)
Point{Float64}(4.998396863358339e6, 4.999648806282582e6)

Using PSA:

julia> @btime sumSA_sum($PSA)
  9.674 ms (0 allocations: 0 bytes)
Point{Float64}(4.998396863358342e6, 4.999648806282576e6)

julia> @btime sumSA_manual($PSA)
  12.659 ms (0 allocations: 0 bytes)
Point{Float64}(4.998396863358457e6, 4.99964880628344e6)

julia> @btime sumSA_thread($PSA)
  1.828 ms (8247 allocations: 545.12 KiB)
Point{Float64}(4.998396863358341e6, 4.999648806282576e6)

julia> @btime sumSA_avx($PSA)
  1.595 ms (0 allocations: 0 bytes)
Point{Float64}(4.998396863358343e6, 4.999648806282575e6)

julia> @btime ThreadsX.sum($PSA)
  1.887 ms (4120 allocations: 304.41 KiB)
Point{Float64}(4.998396863358341e6, 4.999648806282576e6)

julia> @btime sum($PSA)
  8.881 ms (0 allocations: 0 bytes)
Point{Float64}(4.998396863358342e6, 4.999648806282576e6)

Note that the CPU actually has to do less work with the StructArrays when SIMD, but because memory bandwidth is almost all that matters, we don't see much benefit.

Also, note that adding @simd, @fastmath, or @avx would make your manual sum match the built in sum.
To SIMD a sum, you need to be able to change the order the sums are performed.
To SIMD, it needs to do multiple at a time, adding batches to separate accumulators.
But the sum as written means to do them one after the other to the same accumulator. Hence some annotation is needed to give the compiler permission to change the order.

No annotations are needed for integers though, like in the check example, because changing the order of integer operations cannot change the answer, hence the compiler doesn't need your permission to do it.

check results:

julia> @btime check($PSA)
  8.054 ms (0 allocations: 0 bytes)
4997001

julia> @btime check_inbound($PSA)
  4.344 ms (0 allocations: 0 bytes)
4997001

julia> @btime check_inbound_atomic($PSA)
  137.086 ms (183 allocations: 16.17 KiB)
Atomic{Int64}(4997001)

julia> @btime check_thread($PSA)
  2.040 ms (4123 allocations: 272.55 KiB)
4997001

julia> @btime check_avx($PSA)
  629.274 μs (0 allocations: 0 bytes)
4.997001e6

As for why the atomic version was slow: you don't want the threads to have to talk to each other. Communicating memory and synchronizing between threads is much slower than the actual computations.
Try something like this instead:

function check_inbound_threads(a)
    # The 64 should be 128 on an Apple M1. Could just hardcore 128, or use ` VectorizationBase.cache_linesize()`
    out = zeros(Int, 64 ÷ sizeof(Int), Threads.nthreads())
    Threads.@threads for i in eachindex(a)
        @inbounds out[1,Threads.threadid()] += Int(a.x[i] > 0.5)
    end
    sum(view(out, 1, :))
end
@btime check_inbound_threads($PSA)

The idea is to use a separate accumulator per thread, and also to place them far enough apart in memory to avoid false sharing, but that might not actually be necessary because the compiler should hoist the load/store to out out of the loop. Meaning, the compiler should create code only storing to out once per thread, instead of actually loading, adding, and storing like out code says. Loading/adding/storing on every iteration would be very slow.
I get

julia> @btime check_inbound_threads($PSA)
  731.921 μs (182 allocations: 18.50 KiB)
4997001

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

@avx seems not using multi threads with user-defined types #242

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

@avx seems not using multi threads with user-defined types #242

islent Apr 17, 2021

Replies: 1 comment · 2 replies

chriselrod Apr 17, 2021 Maintainer

islent Apr 18, 2021 Author

chriselrod Apr 18, 2021 Maintainer

islent
Apr 17, 2021

Replies: 1 comment 2 replies

chriselrod
Apr 17, 2021
Maintainer

islent Apr 18, 2021
Author

chriselrod Apr 18, 2021
Maintainer