Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call-site splatting optimization #13359

Open
timholy opened this issue Sep 29, 2015 · 9 comments
Open

Call-site splatting optimization #13359

timholy opened this issue Sep 29, 2015 · 9 comments
Assignees
Labels
performance Must go faster

Comments

@timholy
Copy link
Member

timholy commented Sep 29, 2015

Splatting has a known penalty, but today I looked into it a bit more carefully and I wonder if there's an easy fix for part of the problem. For those who don't like @generated functions, this might be a good opportunity to reduce their numbers, since I think my major use for them now is to avoid the splatting penalty.

First the demo:

@noinline function bar1(A, xs...)
    A[xs...]
end

@inline function bar2(A, xs...)
    A[xs...]
end

function call_bar1a(A, n, xs...)
    s = zero(eltype(A))
    for i = 1:n
        s += bar1(A, xs...)
    end
    s
end

function call_bar2a(A, n, xs...)
    s = zero(eltype(A))
    for i = 1:n
        s += bar2(A, xs...)
    end
    s
end

@generated function call_bar1b(A, n, xs...)
    xargs = [:(xs[$d]) for d = 1:length(xs)]
    quote
        s = zero(eltype(A))
        for i = 1:n
            s += bar1(A, $(xargs...))
        end
        s
    end
end

@generated function call_bar2b(A, n, xs...)
    xargs = [:(xs[$d]) for d = 1:length(xs)]
    quote
        s = zero(eltype(A))
        for i = 1:n
            s += bar2(A, $(xargs...))
        end
        s
    end
end

A = rand(3,3,3)
call_bar1a(A, 1, 1, 2, 3)
call_bar2a(A, 1, 1, 2, 3)
call_bar1b(A, 1, 1, 2, 3)
call_bar2b(A, 1, 1, 2, 3)

@time 1
@time call_bar1a(A, 10^6, 1, 2, 3)
@time call_bar2a(A, 10^6, 1, 2, 3)
@time call_bar1b(A, 10^6, 1, 2, 3)
@time call_bar2b(A, 10^6, 1, 2, 3)

Results:

julia> include("/tmp/test_splat.jl")
  0.000001 seconds (3 allocations: 144 bytes)
  0.760468 seconds (7.00 M allocations: 137.329 MB, 3.72% gc time)
  0.761979 seconds (7.00 M allocations: 137.329 MB, 3.62% gc time)
  0.563945 seconds (5.00 M allocations: 106.812 MB, 4.30% gc time)
  0.003080 seconds (6 allocations: 192 bytes)

You can see there's an absolutely enormous, deadly penalty for any kind of splatting. Since I write a lot of code that has to work in arbitrary dimensions, it's a major contributor to why I tend to write so many @generated functions.

Now here's the fun part: look at the difference in @code_typed for call_bar2a and call_bar2b (as a screenshot so you can see the colors):
image

I think the only difference is top(getfield) vs Base.getfield. (EDIT: I deleted the line number annotations to reduce the size of this diff.)

@vtjnash
Copy link
Member

vtjnash commented Oct 4, 2015

deadly penalty for any kind of splatting

dispatch / codegen is doing a bit of bait-and-switch and using significantly less memory for the regular function, since it has compiled one version for xs::Tuple{Int, Vararg{Int}} instead of trying to super-specialize it (as @generated requires it to do). regardless, the calling convention for the varargs function is making a number of fundamental mistakes that significantly increases the amount of copying and boxing over what it should require.

@ViralBShah ViralBShah added the performance Must go faster label Oct 4, 2015
@timholy
Copy link
Member Author

timholy commented Oct 4, 2015

If erasing those mistakes would narrow this gap, that alone would be a big advance.

There are a few cases where you need specialization, but we might get that through #11242.

@timholy
Copy link
Member Author

timholy commented Oct 4, 2015

Very interesting: I just discovered that there are circumstances where things aren't so pessimistic. I've long thought that a great milestone would be to come up with an implementation of sub2ind that didn't rely on @generated functions (see #10337 for discussion, where a serious attempt was made to do without them).

However, with aggressive use of @inline now we seem to be there:

@inline sb2ind(dims, I::Tuple) = sb2ind(dims, I...)
@inline sb2ind(dims, I::Integer...) = _sb2ind(dims, 1, 1, I...)
_sb2ind(::Tuple{}, dimsprod, indx) = indx
@inline function _sb2ind(dims, dimsprod, indx, i::Integer, I::Integer...)
    d = dims[1]
    _sb2ind(Base.tail(dims), dimsprod*d, indx+(i-1)*dimsprod, I...)
end

# Here's what we're benchmarking against. Let's make sure there's no splatting penalty.
@generated function Base.sub2ind{N}(dims, I::CartesianIndex{N})
    iexprs = [:(I[$d]) for d = 1:N]
    meta = Expr(:meta, :inline)
    quote
        $meta
        sub2ind(dims, $(iexprs...))
    end
end

function run_sub2ind(dims)
    s = 0
    for I in CartesianRange(dims)
        s += sub2ind(dims, I)
    end
    s
end

function run_sb2ind(dims)
    s = 0
    for I in CartesianRange(dims)
        s += sb2ind(dims, I.I)
    end
    s
end

dims = ((100,100,50))
run_sub2ind(dims)
run_sb2ind(dims)

println("Warm up @time:")
@time 1
println("sub2ind (@generated functions):")
@time run_sub2ind(dims)
println("sb2ind (lispy-version):")
@time run_sb2ind(dims)

Results on my machine:

julia> include("sb2ind.jl")
Warm up @time:
  0.000001 seconds (3 allocations: 144 bytes)
sub2ind (@generated functions):
  0.001094 seconds (5 allocations: 176 bytes)
sb2ind (lispy-version):
  0.001093 seconds (5 allocations: 176 bytes)
125000250000

Note that you can't get rid of any of these @inlines without a performance hit, although getting rid of the first one has very minor impact. (The other two are huge.)

This is really good news! This could actually shake things up a lot, if it generalized to more cases (like the one that started this post). CC @Jutho, @mbauman.

@Jutho
Copy link
Contributor

Jutho commented Oct 4, 2015

I thought we also had a non-@generated version that showed no allocation for N up to around 6, but the problem was always going beyond. I haven't analysed the behaviour with this code, but a quick test shows:

# dims = (6,6,6,6,6,6)
sub2ind (@generated functions):
  0.000129 seconds (5 allocations: 176 bytes)
sb2ind (lispy-version):
  0.000129 seconds (5 allocations: 176 bytes)

# dims = (7,7,7,7,7,7,7)
sub2ind (@generated functions):
  0.241855 seconds (2.47 M allocations: 113.089 MB, 2.38% gc time)
sb2ind (lispy-version):
  0.002786 seconds (5 allocations: 176 bytes)

# dims = (8,8,8,8,8,8,8,8)
sub2ind (@generated functions):
 15.243509 seconds (150.99 M allocations: 10.250 GB, 3.94% gc time)
sb2ind (lispy-version):
 15.378129 seconds (167.77 M allocations: 11.500 GB, 4.39% gc time)

I was surprised to see the generated definition start allocating more quickly though?

@yuyichao
Copy link
Contributor

yuyichao commented Oct 5, 2015

Is this a dup of #5402?
If it is should we close that one in favor of this? (Or the other way around?)

@timholy
Copy link
Member Author

timholy commented Oct 6, 2015

This is about call-site splatting, that seems focused on declaration-splatting.

@timholy
Copy link
Member Author

timholy commented Oct 6, 2015

@Jutho, I hadn't remembered that we'd gone up to such high dimensionality (see #10337 (comment)); but that's a complicated thread, so I may be misreading it.

I suspect this is just the MAX constants defined in inference.jl

(dratted tabbing and key binding messed up the original version of this)

@timholy
Copy link
Member Author

timholy commented Feb 6, 2016

I can't find the link, but I recall that @eschnett discovered another example, which I seem to remember was something like this:

julia> @inline myadd(x,y) = x+y
myadd (generic function with 1 method)

julia> @inline myadd(x,y,z,a...) = myadd(x+y,z,a...)
myadd (generic function with 2 methods)

julia> @code_native myadd(1,2)
        .text
Filename: none
Source line: 1
        pushq   %rbp
        movq    %rsp, %rbp
Source line: 1
        addq    %rsi, %rdi
        movq    %rdi, %rax
        popq    %rbp
        ret

julia> @code_native myadd(1,2,3)
        .text
Filename: none
Source line: 1
        pushq   %rbp
        movq    %rsp, %rbp
Source line: 1
        pushq   %rbx
        subq    $24, %rsp
        movq    $2, -32(%rbp)
        movabsq $jl_pgcstack, %rbx
        movq    (%rbx), %rax
        movq    %rax, -24(%rbp)
        leaq    -32(%rbp), %rax
        movq    %rax, (%rbx)
        movq    $0, -16(%rbp)
        movq    (%rsi), %rcx
        movq    8(%rsi), %rax
        movq    (%rax), %rdi
Source line: 1
        movabsq $jl_box_int64, %rax
        addq    (%rcx), %rdi
Source line: 1
        movq    16(%rsi), %rcx
Source line: 1
        addq    (%rcx), %rdi
        callq   *%rax
        movq    -24(%rbp), %rcx
        movq    %rcx, (%rbx)
        addq    $24, %rsp
        popq    %rbx
        popq    %rbp
        ret

julia> myadd3(x,y,z) = (x+y)+z
myadd3 (generic function with 1 method)

julia> @code_native myadd3(1,2,3)
        .text
Filename: none
Source line: 1
        pushq   %rbp
        movq    %rsp, %rbp
Source line: 1
        addq    %rsi, %rdi
        leaq    (%rdi,%rdx), %rax
        popq    %rbp
        ret

Interestingly, @code_typed is almost identical for the two:

julia> @code_typed myadd3(1,2,3)
1-element Array{Any,1}:
 :($(Expr(:lambda, Any[:x,:y,:z], Any[Any[Any[:x,Int64,0],Any[:y,Int64,0],Any[:z,Int64,0]],Any[],Any[],Any[]], :(begin  # none, line 1:
        return (Base.box)(Base.Int,(Base.add_int)((Base.box)(Base.Int,(Base.add_int)(x::Int64,y::Int64)),z::Int64))
    end::Int64))))

julia> @code_typed myadd(1,2,3)
1-element Array{Any,1}:
 :($(Expr(:lambda, Any[:x,:y,:z,:(a::Any...)], Any[Any[Any[:x,Int64,0],Any[:y,Int64,0],Any[:z,Int64,0],Any[:a,Tuple{},0]],Any[],Any[],Any[]], :(begin 
        $(Expr(:meta, :inline)) # none, line 1:
        return (Base.box)(Base.Int,(Base.add_int)((Base.box)(Base.Int,(Base.add_int)(x::Int64,y::Int64)),z::Int64))
    end::Int64))))

@stevengj
Copy link
Member

I added a version of Tim's benchmark to BaseBenchmarks.

@stevengj stevengj removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

7 participants