@avx seems not using multi threads with user-defined types #242
-
Hello everyone! I was trying to use using LoopVectorization
using ThreadsX
using BenchmarkTools
# MWE of user-defined type
struct Point{T}
x::T
y::T
end
import Base.:(+)
+(a::Point, b::Point) = Point(a.x + b.x, a.y + b.y)
import Base.zero
zero(::Type{Point{T}}) where T = Point(zero(T), zero(T))
zero(::Point{T}) where T = Point(zero(T), zero(T))
# Generate data
N = 10^7
PointArray = [Point(rand(), rand()) for i in 1:N];
function sumavx(a)
out = zero(eltype(a))
@avxt for i in eachindex(a)
out += a[i]
end
out
end
## Benchmark on array of Point
# Single thread (CPU occupation ~ 15%)
@btime sum($PointArray)
# 8 threads (CPU occupation ~ 90%), less than 2x faster
@btime ThreadsX.sum($PointArray)
#! Even slower!!! not using sufficient threads (CPU occupation is around 15%)
@btime sumavx($PointArray)
## Now benchmark array of internal type
FloatArray = [rand() for i in 1:3*N];
# Single thread (CPU occupation ~ 15%)
@btime sum($FloatArray)
# 8 threads (CPU occupation ~ 90%), less than 2x faster
@btime ThreadsX.sum($FloatArray)
#! 8 threads (CPU occupation ~ 90%), less than 2x faster
@btime sumavx($FloatArray)
versioninfo() Some of the important outputs are: julia> # Single thread (CPU occupation ~ 15%)
julia> @btime sum($PointArray)
7.279 ms (0 allocations: 0 bytes)
Point{Float64}(4.999986420886718e6, 5.0008962081742305e6)
julia> # 8 threads (CPU occupation ~ 90%), less than 2x faster
julia> @btime ThreadsX.sum($PointArray)
4.310 ms (509 allocations: 39.53 KiB)
Point{Float64}(4.99998642088672e6, 5.00089620817423e6)
julia> #! Even slower!!! not using sufficient threads (CPU occupation is around 15%)
julia> @btime sumavx($PointArray)
9.787 ms (0 allocations: 0 bytes)
Point{Float64}(4.999986420887153e6, 5.000896208173763e6)
julia> # Single thread (CPU occupation ~ 15%)
julia> @btime sum($FloatArray)
10.407 ms (0 allocations: 0 bytes)
1.4998079878852893e7
julia> # 8 threads (CPU occupation ~ 90%), less than 2x faster
julia> @btime ThreadsX.sum($FloatArray)
6.582 ms (510 allocations: 37.58 KiB)
1.4998079878852889e7
julia> #! 8 threads (CPU occupation ~ 90%), less than 2x faster
julia> @btime sumavx($FloatArray)
6.489 ms (0 allocations: 0 bytes)
1.49980798788529e7
julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Xeon(R) W-10885M CPU @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
JULIA_CUDA_USE_BINARYBUILDER = false
JULIA_DEPOT_PATH = E:\.julia
JULIA_NUM_THREADS = 8
JULIA_EDITOR = code In the case of user-defined type, CPUs are not fully occupied. It looks like julia> function sumavx_manual(a)
out = zero(eltype(a))
@avx thread = 8 for i in eachindex(a)
out += a[i]
end
out
end
sumavx_manual (generic function with 1 method)
julia> @btime sumavx_manual($PointArray)
9.770 ms (0 allocations: 0 bytes)
Point{Float64}(4.999986420887153e6, 5.000896208173763e6) I have noticed some discussions related:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
LoopVectorization only understands primitive element types, i.e. julia> Union{Base.HWReal,Bool}
Union{Bool, Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8} and strided memory layouts. julia> LoopVectorization.check_args(PointArray)
false If it returns In this case, it doesn't. Compare julia> using LoopVectorization
julia> using ThreadsX
julia> using BenchmarkTools
julia> # MWE of user-defined type
struct Point{T}
x::T
y::T
end
julia> import Base.:(+)
julia> +(a::Point, b::Point) = Point(a.x + b.x, a.y + b.y)
+ (generic function with 257 methods)
julia> import Base.zero
julia> zero(::Type{Point{T}}) where T = Point(zero(T), zero(T))
zero (generic function with 30 methods)
julia> zero(::Point{T}) where T = Point(zero(T), zero(T))
zero (generic function with 31 methods)
julia> # Generate data
N = 10^7
10000000
julia> PointArray = [Point(rand(), rand()) for i in 1:N];
julia> function sumavx(a)
out = zero(eltype(a))
@avxt for i in eachindex(a)
out += a[i]
end
out
end
sumavx (generic function with 1 method)
julia> ## Benchmark on array of Point
# Single thread (CPU occupation ~ 15%)
@btime sum($PointArray)
2.568 ms (0 allocations: 0 bytes)
Point{Float64}(5.000077086921011e6, 4.998879288595926e6)
julia> # 8 threads (CPU occupation ~ 90%), less than 2x faster
@btime ThreadsX.sum($PointArray)
2.671 ms (250 allocations: 17.50 KiB)
Point{Float64}(5.0000770869210055e6, 4.998879288595921e6)
julia> #! Even slower!!! not using sufficient threads (CPU occupation is around 15%)
@btime sumavx($PointArray)
9.375 ms (0 allocations: 0 bytes)
Point{Float64}(5.0000770869211e6, 4.998879288595839e6)
julia> ## Now benchmark array of internal type
FloatArray = [rand() for i in 1:3*N];
julia> # Single thread (CPU occupation ~ 15%)
@btime sum($FloatArray)
5.347 ms (0 allocations: 0 bytes)
1.5000468678765366e7
julia> # 8 threads (CPU occupation ~ 90%), less than 2x faster
@btime ThreadsX.sum($FloatArray)
3.990 ms (249 allocations: 16.48 KiB)
1.5000468678765357e7
julia> #! 8 threads (CPU occupation ~ 90%), less than 2x faster
@btime sumavx($FloatArray)
3.955 ms (0 allocations: 0 bytes)
1.5000468678765424e7
julia> versioninfo()
Julia Version 1.7.0-DEV.925
Commit 5e93c29dde* (2021-04-14 21:41 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin20.3.0)
CPU: Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cyclone)
Environment:
JULIA_NUM_THREADS = auto
julia> Threads.nthreads()
4
julia> function sumfastmath(a)
out = zero(eltype(a))
@inbounds @fastmath for i in eachindex(a)
out += a[i]
end
out
end
sumfastmath (generic function with 1 method)
julia> function sumsimd(a)
out = zero(eltype(a))
@inbounds @simd for i in eachindex(a)
out += a[i]
end
out
end
sumsimd (generic function with 1 method)
julia> @btime sumfastmath($PointArray)
9.375 ms (0 allocations: 0 bytes)
Point{Float64}(5.0000770869211e6, 4.998879288595839e6)
julia> @btime sumsimd($PointArray)
2.541 ms (0 allocations: 0 bytes)
Point{Float64}(5.00007708692092e6, 4.998879288595935e6) The fastmath version is (as predicted) exactly the same fast as the I intend to eventually fix this by taking advantage of the abstract interpreter to essentially deconstruct user defined types into their primitive components, but it will be many months before I have the time to do this. So how to work around it? function sumavx2(a::Array{Point{T}}) where {T}
x = zero(T)
y = zero(T)
b = reinterpret(reshape, T, a)
@avxt for i in axes(b,2)
x += b[1,i]
y += b[2,i]
end
Point(x, y)
end Not using threads is still faster on the M1 (this computer). Note that for loops like this, memory bandwidth is the most important thing. But on x86 CPUs, I normally do find it helps. Aside from having lower bandwidth overall, it also seems like individual cores only have access to a fraction of the total available. So I normally find that x86 CPUs do get faster when adding more cores to memory bound tasks, but at a rate much lower than the actual number of cores used. The |
Beta Was this translation helpful? Give feedback.
LoopVectorization only understands primitive element types, i.e.
Union{Base.HWReal,Bool}
:and strided memory layouts.
It uses the function
LoopVectorization.check_args
to check arguments for compatibility, so if you're curious whether LoopVectorization will work with a type, you can try:If it returns
false
, that means it will not work and will instead simply apply@inbounds @fastmath
to your loops and hope that works out.In this case, it doesn't. Compare
@inbounds @fastmath
with@inbounds @simd
: