Call-site splatting optimization #13359

timholy · 2015-09-29T09:43:11Z

Splatting has a known penalty, but today I looked into it a bit more carefully and I wonder if there's an easy fix for part of the problem. For those who don't like @generated functions, this might be a good opportunity to reduce their numbers, since I think my major use for them now is to avoid the splatting penalty.

First the demo:

@noinline function bar1(A, xs...)
    A[xs...]
end

@inline function bar2(A, xs...)
    A[xs...]
end

function call_bar1a(A, n, xs...)
    s = zero(eltype(A))
    for i = 1:n
        s += bar1(A, xs...)
    end
    s
end

function call_bar2a(A, n, xs...)
    s = zero(eltype(A))
    for i = 1:n
        s += bar2(A, xs...)
    end
    s
end

@generated function call_bar1b(A, n, xs...)
    xargs = [:(xs[$d]) for d = 1:length(xs)]
    quote
        s = zero(eltype(A))
        for i = 1:n
            s += bar1(A, $(xargs...))
        end
        s
    end
end

@generated function call_bar2b(A, n, xs...)
    xargs = [:(xs[$d]) for d = 1:length(xs)]
    quote
        s = zero(eltype(A))
        for i = 1:n
            s += bar2(A, $(xargs...))
        end
        s
    end
end

A = rand(3,3,3)
call_bar1a(A, 1, 1, 2, 3)
call_bar2a(A, 1, 1, 2, 3)
call_bar1b(A, 1, 1, 2, 3)
call_bar2b(A, 1, 1, 2, 3)

@time 1
@time call_bar1a(A, 10^6, 1, 2, 3)
@time call_bar2a(A, 10^6, 1, 2, 3)
@time call_bar1b(A, 10^6, 1, 2, 3)
@time call_bar2b(A, 10^6, 1, 2, 3)

Results:

julia> include("/tmp/test_splat.jl")
  0.000001 seconds (3 allocations: 144 bytes)
  0.760468 seconds (7.00 M allocations: 137.329 MB, 3.72% gc time)
  0.761979 seconds (7.00 M allocations: 137.329 MB, 3.62% gc time)
  0.563945 seconds (5.00 M allocations: 106.812 MB, 4.30% gc time)
  0.003080 seconds (6 allocations: 192 bytes)

You can see there's an absolutely enormous, deadly penalty for any kind of splatting. Since I write a lot of code that has to work in arbitrary dimensions, it's a major contributor to why I tend to write so many @generated functions.

Now here's the fun part: look at the difference in @code_typed for call_bar2a and call_bar2b (as a screenshot so you can see the colors):

I think the only difference is top(getfield) vs Base.getfield. (EDIT: I deleted the line number annotations to reduce the size of this diff.)

The text was updated successfully, but these errors were encountered:

vtjnash · 2015-10-04T02:41:18Z

deadly penalty for any kind of splatting

dispatch / codegen is doing a bit of bait-and-switch and using significantly less memory for the regular function, since it has compiled one version for xs::Tuple{Int, Vararg{Int}} instead of trying to super-specialize it (as @generated requires it to do). regardless, the calling convention for the varargs function is making a number of fundamental mistakes that significantly increases the amount of copying and boxing over what it should require.

timholy · 2015-10-04T08:43:29Z

If erasing those mistakes would narrow this gap, that alone would be a big advance.

There are a few cases where you need specialization, but we might get that through #11242.

timholy · 2015-10-04T09:33:29Z

Very interesting: I just discovered that there are circumstances where things aren't so pessimistic. I've long thought that a great milestone would be to come up with an implementation of sub2ind that didn't rely on @generated functions (see #10337 for discussion, where a serious attempt was made to do without them).

However, with aggressive use of @inline now we seem to be there:

@inline sb2ind(dims, I::Tuple) = sb2ind(dims, I...)
@inline sb2ind(dims, I::Integer...) = _sb2ind(dims, 1, 1, I...)
_sb2ind(::Tuple{}, dimsprod, indx) = indx
@inline function _sb2ind(dims, dimsprod, indx, i::Integer, I::Integer...)
    d = dims[1]
    _sb2ind(Base.tail(dims), dimsprod*d, indx+(i-1)*dimsprod, I...)
end

# Here's what we're benchmarking against. Let's make sure there's no splatting penalty.
@generated function Base.sub2ind{N}(dims, I::CartesianIndex{N})
    iexprs = [:(I[$d]) for d = 1:N]
    meta = Expr(:meta, :inline)
    quote
        $meta
        sub2ind(dims, $(iexprs...))
    end
end

function run_sub2ind(dims)
    s = 0
    for I in CartesianRange(dims)
        s += sub2ind(dims, I)
    end
    s
end

function run_sb2ind(dims)
    s = 0
    for I in CartesianRange(dims)
        s += sb2ind(dims, I.I)
    end
    s
end

dims = ((100,100,50))
run_sub2ind(dims)
run_sb2ind(dims)

println("Warm up @time:")
@time 1
println("sub2ind (@generated functions):")
@time run_sub2ind(dims)
println("sb2ind (lispy-version):")
@time run_sb2ind(dims)

Results on my machine:

julia> include("sb2ind.jl")
Warm up @time:
  0.000001 seconds (3 allocations: 144 bytes)
sub2ind (@generated functions):
  0.001094 seconds (5 allocations: 176 bytes)
sb2ind (lispy-version):
  0.001093 seconds (5 allocations: 176 bytes)
125000250000

Note that you can't get rid of any of these @inlines without a performance hit, although getting rid of the first one has very minor impact. (The other two are huge.)

This is really good news! This could actually shake things up a lot, if it generalized to more cases (like the one that started this post). CC @Jutho, @mbauman.

Jutho · 2015-10-04T18:30:11Z

I thought we also had a non-@generated version that showed no allocation for N up to around 6, but the problem was always going beyond. I haven't analysed the behaviour with this code, but a quick test shows:

# dims = (6,6,6,6,6,6)
sub2ind (@generated functions):
  0.000129 seconds (5 allocations: 176 bytes)
sb2ind (lispy-version):
  0.000129 seconds (5 allocations: 176 bytes)

# dims = (7,7,7,7,7,7,7)
sub2ind (@generated functions):
  0.241855 seconds (2.47 M allocations: 113.089 MB, 2.38% gc time)
sb2ind (lispy-version):
  0.002786 seconds (5 allocations: 176 bytes)

# dims = (8,8,8,8,8,8,8,8)
sub2ind (@generated functions):
 15.243509 seconds (150.99 M allocations: 10.250 GB, 3.94% gc time)
sb2ind (lispy-version):
 15.378129 seconds (167.77 M allocations: 11.500 GB, 4.39% gc time)

I was surprised to see the generated definition start allocating more quickly though?

yuyichao · 2015-10-05T23:11:02Z

Is this a dup of #5402?
If it is should we close that one in favor of this? (Or the other way around?)

timholy · 2015-10-06T02:31:26Z

This is about call-site splatting, that seems focused on declaration-splatting.

timholy · 2015-10-06T02:47:51Z

@Jutho, I hadn't remembered that we'd gone up to such high dimensionality (see #10337 (comment)); but that's a complicated thread, so I may be misreading it.

I suspect this is just the MAX constants defined in inference.jl

(dratted tabbing and key binding messed up the original version of this)

timholy · 2016-02-06T10:59:42Z

I can't find the link, but I recall that @eschnett discovered another example, which I seem to remember was something like this:

julia> @inline myadd(x,y) = x+y
myadd (generic function with 1 method)

julia> @inline myadd(x,y,z,a...) = myadd(x+y,z,a...)
myadd (generic function with 2 methods)

julia> @code_native myadd(1,2)
        .text
Filename: none
Source line: 1
        pushq   %rbp
        movq    %rsp, %rbp
Source line: 1
        addq    %rsi, %rdi
        movq    %rdi, %rax
        popq    %rbp
        ret

julia> @code_native myadd(1,2,3)
        .text
Filename: none
Source line: 1
        pushq   %rbp
        movq    %rsp, %rbp
Source line: 1
        pushq   %rbx
        subq    $24, %rsp
        movq    $2, -32(%rbp)
        movabsq $jl_pgcstack, %rbx
        movq    (%rbx), %rax
        movq    %rax, -24(%rbp)
        leaq    -32(%rbp), %rax
        movq    %rax, (%rbx)
        movq    $0, -16(%rbp)
        movq    (%rsi), %rcx
        movq    8(%rsi), %rax
        movq    (%rax), %rdi
Source line: 1
        movabsq $jl_box_int64, %rax
        addq    (%rcx), %rdi
Source line: 1
        movq    16(%rsi), %rcx
Source line: 1
        addq    (%rcx), %rdi
        callq   *%rax
        movq    -24(%rbp), %rcx
        movq    %rcx, (%rbx)
        addq    $24, %rsp
        popq    %rbx
        popq    %rbp
        ret

julia> myadd3(x,y,z) = (x+y)+z
myadd3 (generic function with 1 method)

julia> @code_native myadd3(1,2,3)
        .text
Filename: none
Source line: 1
        pushq   %rbp
        movq    %rsp, %rbp
Source line: 1
        addq    %rsi, %rdi
        leaq    (%rdi,%rdx), %rax
        popq    %rbp
        ret

Interestingly, @code_typed is almost identical for the two:

julia> @code_typed myadd3(1,2,3)
1-element Array{Any,1}:
 :($(Expr(:lambda, Any[:x,:y,:z], Any[Any[Any[:x,Int64,0],Any[:y,Int64,0],Any[:z,Int64,0]],Any[],Any[],Any[]], :(begin  # none, line 1:
        return (Base.box)(Base.Int,(Base.add_int)((Base.box)(Base.Int,(Base.add_int)(x::Int64,y::Int64)),z::Int64))
    end::Int64))))

julia> @code_typed myadd(1,2,3)
1-element Array{Any,1}:
 :($(Expr(:lambda, Any[:x,:y,:z,:(a::Any...)], Any[Any[Any[:x,Int64,0],Any[:y,Int64,0],Any[:z,Int64,0],Any[:a,Tuple{},0]],Any[],Any[],Any[]], :(begin 
        $(Expr(:meta, :inline)) # none, line 1:
        return (Base.box)(Base.Int,(Base.add_int)((Base.box)(Base.Int,(Base.add_int)(x::Int64,y::Int64)),z::Int64))
    end::Int64))))

stevengj · 2016-10-19T01:34:51Z

I added a version of Tim's benchmark to BaseBenchmarks.

timholy mentioned this issue Oct 3, 2015

RFC: better ntuple implementation without @generated #13439

Closed

ViralBShah added the performance Must go faster label Oct 4, 2015

yuyichao mentioned this issue Oct 5, 2015

Avoid introducing local variable (and GC frame store) in unsafe_setindex! #13461

Merged

yuyichao mentioned this issue Oct 6, 2015

make A[:] = x::Number fall back to fill! #13459

Closed

timholy mentioned this issue Oct 15, 2015

Immutable Views JuliaArrays/ArrayViews.jl#36

Closed

jrevels added the potential benchmark Could make a good benchmark in BaseBenchmarks label Nov 17, 2015

yuyichao mentioned this issue Dec 15, 2015

Performance regression with threads #14417

Closed

This was referenced Feb 6, 2016

Arraypocalypse Now and Then #13157

Closed

Additional tools for type-stable multidimensional coding #15030

Merged

timholy mentioned this issue Mar 30, 2016

ReshapedArrays, take2 #15449

Merged

timholy mentioned this issue Apr 17, 2016

NTuples made me sad (so I nixed them) #11242

Merged

vtjnash self-assigned this Apr 22, 2016

yuyichao mentioned this issue May 10, 2016

"A[:] = ..." allocates memory #16302

Closed

stevengj mentioned this issue Oct 18, 2016

add miscellaneous benchmarks from various issues JuliaCI/BaseBenchmarks.jl#33

Merged

stevengj removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 19, 2016

This was referenced Mar 31, 2017

SIR test gives funny answers SciML/JumpProcesses.jl#9

Closed

Performance SciML/JumpProcesses.jl#10

Closed

KristofferC mentioned this issue May 28, 2017

spzeros and speye are really slow #22110

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call-site splatting optimization #13359

Call-site splatting optimization #13359

timholy commented Sep 29, 2015

vtjnash commented Oct 4, 2015

timholy commented Oct 4, 2015

timholy commented Oct 4, 2015

Jutho commented Oct 4, 2015

yuyichao commented Oct 5, 2015

timholy commented Oct 6, 2015

timholy commented Oct 6, 2015

timholy commented Feb 6, 2016

stevengj commented Oct 19, 2016

Call-site splatting optimization #13359

Call-site splatting optimization #13359

Comments

timholy commented Sep 29, 2015

vtjnash commented Oct 4, 2015

timholy commented Oct 4, 2015

timholy commented Oct 4, 2015

Jutho commented Oct 4, 2015

yuyichao commented Oct 5, 2015

timholy commented Oct 6, 2015

timholy commented Oct 6, 2015

timholy commented Feb 6, 2016

stevengj commented Oct 19, 2016