Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuple-dialect has performance consequences (anonymous functions?) #17126

Closed
timholy opened this issue Jun 26, 2016 · 10 comments
Closed

Tuple-dialect has performance consequences (anonymous functions?) #17126

timholy opened this issue Jun 26, 2016 · 10 comments
Labels
performance Must go faster

Comments

@timholy
Copy link
Member

timholy commented Jun 26, 2016

This may be #15276, but it seems that issue has various gradations of challenge, so perhaps additional examples are useful.

This contains test code implementing the equivalent of map(s->1:s, size(A)) using 3 different strategies: one based on ntuple, several variants of map, and one that manually assembles the tuples. On my laptop, the variation in performance of these methods appears to be >5x; presumably, we'd prefer that they all be the same (and all good!). Naturally, the best of these is also the most laborious (the manual method), and it beats all the others by a factor of 2.

This is admittedly nitty-gritty microoptimization, but was necessary to get #16260 to pass nanosoldier; as I work on the cleanup (#16973), I thought I'd better take the time to document some of the challenges.

# Comparing 3 different implementations of indices
# indices1: calling "ntuple" (a local implementation you can play with)
@inline indices1{T,N}(A::AbstractArray{T,N}) = myntuple(d->1:size(A,d), Val{N})
# indices2: using "map" (a local implementation you can play with)
# Note this is largely academic, since you'd rather define indices2(A)
# in terms of indices2(A, d). Informative nonetheless.
@inline indices2a(A) = mymap_a(s->1:s, size(A))
@inline indices2b(A) = mymap_b(s->1:s, size(A))
@inline indices2c(A) = mymap_c(s->1:s, size(A))
# indices3: directly unrolling the function
@inline indices3a(A) = _indices3a((), A)
@inline _indices3a{T,N}(out::NTuple{N}, A::AbstractArray{T,N}) = out
@inline _indices3a(out, A) = _indices3a((out..., 1:size(A, length(out)+1)), A)
@inline indices3b(A) = _indices3b(size(A))
@inline _indices3b(sz::Tuple) = _indices3b(sz...)
@inline _indices3b() = ()
@inline _indices3b(s::Integer, sz...) = (1:s, _indices3b(sz...)...)

# Like NTuple without any checking
@inline myntuple{F,N}(f::F, v::Type{Val{N}}) = _myntuple((), f, v)
@inline _myntuple{F,N}(out::NTuple{N}, f::F, ::Type{Val{N}}) = out
@inline _myntuple{F,N}(out, f::F, v::Type{Val{N}}) = _myntuple((out..., f(length(out)+1)), f, v)

# Like map
@inline mymap_a(f, t) = (f(t[1]), mymap_a(f, Base.tail(t))...)
@inline mymap_a(f, ::Tuple{}) = ()

# Alternative map
@inline mymap_b(f, t) = (_mymap_b(f, t...)...,)
@inline _mymap_b(f) = ()
@inline _mymap_b(f, h, t...) = (f(h), _mymap_b(f, t...)...)

# Another alternative map
@inline mymap_c(f, t) = _mymap_c((), f, t...)
@inline _mymap_c(out, f) = out
@inline _mymap_c(out, f, h, t...) = _mymap_c((out..., f(h)), f, t...)

A = rand(2,3,4,5)
using BenchmarkTools
println("ntuple implementation")
@show @benchmark indices1($A)
println("map implementations")
@show @benchmark indices2a($A)
@show @benchmark indices2b($A)
@show @benchmark indices2c($A)
println("manually-inlined implementation")
@show @benchmark indices3a($A)
@show @benchmark indices3b($A)

Results:

julia> include("/tmp/dialect.jl")
ntuple implementation
@benchmark(indices1($(Expr(:$, :A)))) = BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     999
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  16.00 bytes
  allocs estimate:  1
  minimum time:     10.00 ns (0.00% GC)
  median time:      11.00 ns (0.00% GC)
  mean time:        11.82 ns (5.84% GC)
  maximum time:     1.32 μs (97.95% GC)
map implementations
@benchmark(indices2a($(Expr(:$, :A)))) = BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     5.00 ns (0.00% GC)
  median time:      5.00 ns (0.00% GC)
  mean time:        5.01 ns (0.00% GC)
  maximum time:     17.00 ns (0.00% GC)
@benchmark(indices2b($(Expr(:$, :A)))) = BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     4.00 ns (0.00% GC)
  median time:      4.00 ns (0.00% GC)
  mean time:        4.03 ns (0.00% GC)
  maximum time:     17.00 ns (0.00% GC)
@benchmark(indices2c($(Expr(:$, :A)))) = BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     4.00 ns (0.00% GC)
  median time:      4.00 ns (0.00% GC)
  mean time:        4.05 ns (0.00% GC)
  maximum time:     24.00 ns (0.00% GC)
manually-inlined implementation
@benchmark(indices3a($(Expr(:$, :A)))) = BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     2.00 ns (0.00% GC)
  median time:      2.00 ns (0.00% GC)
  mean time:        2.01 ns (0.00% GC)
  maximum time:     13.00 ns (0.00% GC)
@benchmark(indices3b($(Expr(:$, :A)))) = BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     5.00 ns (0.00% GC)
  median time:      5.00 ns (0.00% GC)
  mean time:        5.02 ns (0.00% GC)
  maximum time:     19.00 ns (0.00% GC)
@Jutho
Copy link
Contributor

Jutho commented Jun 26, 2016

mymap_a should probably call mymap_a instead of plain map. Does this affect the timing?

@timholy
Copy link
Member Author

timholy commented Jun 27, 2016

Great catch as always, @Jutho. Doesn't seem to affect the timing, though; the reason the map was there was because it was copy/pasted from map, so the two were essentially identical. But it certainly would have been confusing for anyone who made modifications trying to discover the cause, so thanks. EDIT: I edited the code above to fix this error.

@kshyatt kshyatt added the performance Must go faster label Jun 27, 2016
@timholy
Copy link
Member Author

timholy commented Jun 29, 2016

Worth noting that making the top-level functions @pure gives them all the performance of indices3. I'm a bit unclear about whether this would be considered an abuse of @pure (I suspect the answer is "yes," since I think it's intended to be reserved for type-computations).

@andyferris
Copy link
Member

(I suspect the answer is "yes," since I think it's intended to be reserved for type-computations).

Hmm... That sounds a bit strict.

The threads I read defined "pure" as any function the programmer is happy to tell the compiler it can run any time (once at compile time, more than once at run time, etc) without worrying about the consequence on program logic. (and the specific implementation of the compiler will affect the speed/improve inference/etc, but not change the program output).

@andyferris
Copy link
Member

Follow-up: Obviously, having it be a function of types makes it most obvious to be able to run it at compile-time, but inference now tracks many constant values too...

@timholy
Copy link
Member Author

timholy commented Jul 1, 2016

I guess part of my thinking was that it would be nice to mark ntuple and map as pure, but an uncontrolled function argument makes that impossible. That concern doesn't apply to indices, since the function is known to be free of side effects.

Would be great to be able to say "do the expansion at compile time, but save the evaluation of the user function for run time" (naturally, @generated functions allow this).

@blakejohnson
Copy link
Contributor

I think you are safe to mark indices as @pure.

@timholy
Copy link
Member Author

timholy commented Aug 1, 2016

Based on discussion in #17729, @pure isn't safe here. But in 6001540 I discovered that the source of the variation was due to our implementation of size (lack of sufficient @inline), NOT the particular way the tuple manipulations were written. That's reassuring, though wow what a huge waste of time for a missing @inline statement or two...

ntuple still performs very badly compared to the rest, but that can be a separate issue.

@timholy timholy closed this as completed Aug 1, 2016
@KristofferC
Copy link
Member

KristofferC commented Aug 1, 2016

It feels like we (you @timholy) keep finding these cases where adding explicit @inlines significantly improves performance. Instead of playing whack a mole, doesn't this just mean that our inline heuristic could be improved. Is there a common case (lispy tuple style coding) that we could teach the inliner to reason better about?

@timholy
Copy link
Member Author

timholy commented Aug 1, 2016

I would love to see improvements in our inlining heuristics. I don't understand them at all, so that's not an area I've looked at or plan to look at.

I try to be fairly conservative about not adding @inline unless I'm sure it makes a difference, but in some cases this comes back to bite me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

6 participants