Perf: Specialize `fn` in `rmap` and `rmaptype` #80

kpamnany · 2021-07-23T20:55:51Z

Also avoid use of map on NamedTuples as it doesn't specialize f.

src/RecursiveApply/RecursiveApply.jl

Also avoid use of `map` on `NamedTuple`s as it doesn't specialize `f`.

jakebolewski · 2021-07-26T16:39:44Z

Since this is low level and called quite often , can we start to add some microbenchmarks to show that this is an actual improvement / doesn't regress in the future with different compiler heuristics?

kpamnany · 2021-07-26T19:28:02Z

Since this is low level and called quite often , can we start to add some microbenchmarks to show that this is an actual improvement / doesn't regress in the future with different compiler heuristics?

This would be nice, but not sure how to replicate the problem with a microbenchmark. The profile below demonstrates the problem:

See the jl_apply_generic.

However, the compiler is handling most of the code snippets I've tried properly; I haven't found a simple way to make it give up as above yet. E.g.:

julia> @code_warntype map(x -> x^3, (x1 = 1, x2 = 2))
Variables
  #self#::Core.Const(map)
  f::Core.Const(var"#35#36"())
  nt::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}
  nts::Tuple{}

Body::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}
1 ─ %1  = Core.tuple(nt)::Tuple{NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}}
│   %2  = Core._apply_iterate(Base.iterate, Base.same_names, %1, nts)::Core.Const(true)
│   %3  = !%2::Core.Const(false)
└──       goto #3 if not %3
2 ─       Core.Const(:(Base.ArgumentError("Named tuple names do not match.")))
└──       Core.Const(:(Base.throw(%5)))
3 ┄ %7  = Core.apply_type(Base.NamedTuple, $(Expr(:static_parameter, 1)))::Core.Const(NamedTuple{(:x1, :x2), T} where T<:Tuple)
│   %8  = Core.tuple(f)::Core.Const((var"#35#36"(),))
│   %9  = Core.tuple(nt)::Tuple{NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}}
│   %10 = Core._apply_iterate(Base.iterate, Core.tuple, %9, nts)::Tuple{NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}}
│   %11 = Base.map(Base.Tuple, %10)::Tuple{Tuple{Int64, Int64}}
│   %12 = Core._apply_iterate(Base.iterate, Base.map, %8, %11)::Tuple{Int64, Int64}
│   %13 = (%7)(%12)::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}
└──       return %13

I'm open to ideas.

jakebolewski · 2021-07-27T17:43:30Z

I see, the issue I have is that there are really only 2 changes that are significant (lines 22 and 24 in the diff), the others are just syntactic and don't do anything.

It might be better to just force inline this chain of method calls to get somewhat predictable performance / behavior from the compiler, that would be a better solution IMO.

kpamnany · 2021-07-27T18:00:04Z

I see, the issue I have is that there are really only 2 changes that are significant (lines 22 and 24 in the diff), the others are just syntactic and don't do anything.

My understanding is that specializing the function (with the where F) is usually all you need to do, so these other changes are not just syntactic. It is because map on NamedTuples does not do this (for latency reasons) that lines 22 and 24 are necessary.

It might be better to just force inline this chain of method calls to get somewhat predictable performance / behavior from the compiler, that would be a better solution IMO.

Compiler inlining decisions use a cost function (see JuliaLang/julia#27857) and can only be forced if the calling signature is concrete (see JuliaLang/julia#29258). I'm not sure what that means for something like fn? So I'm not sure how to force inlining.

Edit: I'm not seeing the issue with this fix. What do you not like about it?

jakebolewski · 2021-07-27T18:24:31Z

There is no issue with it, it just seems very finicky with the specialization changes to try and trigger in-lining. you can't look at code typed either because it always specializes so it might be misleading. Since this is such a low level perf issue (and will be called everywhere) it would be nice to just come up with a way to enforce (to the degree possible) that these methods won't be a surprise performance trap.

jakebolewski · 2021-07-27T18:29:11Z

The where {F} will only specialize when the function F is not called in the method (passed through), this happens some places but the bottom methods in the call stack will be specialized without that form. But is this really why there is a generic fallback in the flux example?

kpamnany · 2021-07-27T18:39:23Z

@vchuravy: thoughts please?

vchuravy · 2021-07-28T09:48:54Z

Also summoning @aviatesk. I went looking through the compiler to find the place where this decision is made,
but I couldn't find it. Cthulhu and profiling both agreed, and it is worthwhile to remember that inference is heuristic
and sometimes context sensitive. We want to expose inference remarks in Cthulhu that hopefully will shed some more light on this. It might be related to recursion detection (wasn't one definition of rmapcalling rmap in a closure?)

Broadly speaking f(g) = g() is not guaranteed to specialize (and I heard Jeff talk about that we will deoptimze that more often than not), and that the form f(g::G) where G = g() is the preferred if you require something to specialize. I am not sure why map(f, ::NT) is not using that form, but I would assume that is to reduce the latency impact.

jakebolewski · 2021-07-28T14:40:18Z

Thanks for the info @vchuravy

Broadly speaking f(g) = g() is not guaranteed to specialize

Makes sense (given the general move towards better compile times) but compiler docs might need to be updated if this is increasingly the case (it might be on dev).

I guess my point would be to make this optimization more robust / obvious by just @inline the callstack which should basically force specialization. Since this is mapping to Core operations (+, -, *, etc) on Field objects this is probably the behavior we want to the degree possible enforce (specialize & inline).

aviatesk · 2021-07-28T15:50:49Z

Also summoning @aviatesk.

Okay, I will tell you what I know ;)

As @vchuravy said, these problems relate to recursion detection heuristic within type inference.
The "recursion" here means recursive method calls within a static call graph, and it doesn't mean the recursive calls of the same generic function. To say roughly, if there are calls of the same method call signature, type inference might give up because otherwise it may not guarantee the termination of itself (, and it can cause problems for succeeding optimizations).
So @code_warntype map(x -> x^3, (x1 = 1, x2 = 2)) might not be a best test target, because it certainly recurses into the same generic function map, but the method calls within the call graph are different.

Rather, I suggest to use rmap(x -> x^3, (y = (x1 = 1, x2 = 2),)) for the simplest target here, because it includes the same recursive method calls in the call stack.

Now you can use Cthulhu or JETTest.jl, whichever you like, and you will find some optimization failures happening within the previous implementations of rmap.

If you use Cthulhu, you will see some callsites are annotated as < limited >, which basically means inference gives up on the recursive callsite.

julia> @Descend optimize=false rmap(x -> x^3, (y = (x1 = 1, x2 = 2),))

# after descending into a bit of madness ...
│ ─ %-1  = invoke map(::var"#15#16"{var"#19#20"},::Tuple{NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}})::Tuple{NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}}
1 ─      nothing::Core.Const(nothing)
│   @ tuple.jl:220 within `map`
│   %2 = Base.getindex(t, 1)::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}
│   %3 = (f)(%2)::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}
│   %4 = Core.tuple(%3)::Tuple{NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}}
└──      return %4
Select a call to descend into or ↩ to ascend. [q]uit. [b]ookmark.
Toggles: [o]ptimize, [w]arn, [v]erbose printing for warntype code, [d]ebuginfo, [i]nlining costs, [s]yntax highlight for Source/LLVM/Native.
Show: [S]ource code, [A]ST, [L]LVM IR, [N]ative code
Actions: [E]dit source code, [R]evise and redisplay
Advanced: dump [P]arams cache.
   %2  = < constprop > getindex(::Tuple{NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}},::Core.Const(1))::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}
 • %3  = < limited > #15(::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}})::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}
   ↩

I recommend you use JETTest, because it's automated and can be used for your testsuite:

# before this PR
julia> @test_nodispatch rmap(x -> x^3, (y = (x1 = 1, x2 = 2),))
Dispatch Test Failed at REPL[24]:1
  Expression: #= REPL[24]:1 =# JETTest.@test_nodispatch rmap((x->begin
                x ^ 3
            end), (y = (x1 = 1, x2 = 2),))
  ═════ 4 possible errors found ═════
  ┌ @ REPL[17]:1 Main.map(#15, X)
  │┌ @ namedtuple.jl:208 Base.map(Core.tuple(f), Base.map(Base.Tuple, Core.tuple(Core.tuple(nt), nts...))...)
  ││┌ @ tuple.jl:220 f(Base.getindex(t, 1))
  │││┌ @ REPL[17]:1 Main.rmap(Core.getfield(#self#, :fn), x)
  ││││┌ @ REPL[17]:1 Main.map(#15, X)
  │││││┌ @ namedtuple.jl:204 map(::var"#15#16"{var"#23#24"}, ::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}})
  ││││││ failed to optimize: map(::var"#15#16"{var"#23#24"}, ::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}})
  │││││└─────────────────────
  ││││┌ @ REPL[17]:1 rmap(::var"#23#24", ::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}})
  │││││ failed to optimize: rmap(::var"#23#24", ::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}})
  ││││└──────────────
  │││┌ @ REPL[17]:1 (::var"#15#16"{var"#23#24"})(::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}})
  ││││ failed to optimize: (::var"#15#16"{var"#23#24"})(::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}})
  │││└──────────────
  ││┌ @ tuple.jl:220 map(::var"#15#16"{var"#23#24"}, ::Tuple{NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}})
  │││ failed to optimize: map(::var"#15#16"{var"#23#24"}, ::Tuple{NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}})
  ││└────────────────
  
ERROR: There was an error during testing

# after this PR
julia> @test_nodispatch rmap(x -> x^3, (y = (x1 = 1, x2 = 2),))
Test Passed
  Expression: #= REPL[31]:1 =# JETTest.@test_nodispatch rmap((x->begin
                x ^ 3
            end), (y = (x1 = 1, x2 = 2),))

In JETTest, "failed to optimize" usually corresponds to < limited > callsites in Cthulhu.

My understanding is that specializing the function (with the where F) is usually all you need to do

And @jakebolewski explained above, the optimization benefits of this PR do not come from this, but it seems to come from the other functional changes, i.e. convert the outputs of recursive calls of rmap into NamedTuples, which helps compiler infer the types of the recursive method calls vastly.
So you already have done very good job for that part, and you can just strip away where F parts, I think ?
I confirmed JETTest passes on the simplest target example I explained above even if I stripped away those where annotations.

jakebolewski · 2021-07-28T23:20:41Z

Awesome write up @aviatesk, thanks for that breakdown and for the demonstration of JETTest.jl (which I didn't know about)! Definitely something we can and will integrate into the ClimaCore test suite, this will be hugely useful at spotting regressions.

Adding JETTest checks is probably more useful than microbenchmarks at this stage.

simonbyrne · 2021-08-02T20:35:44Z

Thanks @aviatesk: JETTest is very nice!

simonbyrne · 2021-08-03T19:40:32Z

bors r+

bors · 2021-08-03T20:15:07Z

Build succeeded:

kpamnany requested a review from simonbyrne July 23, 2021 20:55

simonbyrne reviewed Jul 23, 2021

View reviewed changes

src/RecursiveApply/RecursiveApply.jl Outdated Show resolved Hide resolved

Perf: Specialize fn in rmap and rmaptype

b1bd4a6

Also avoid use of `map` on `NamedTuple`s as it doesn't specialize `f`.

kpamnany force-pushed the kp/spec-rmap branch from f55b1f7 to b1bd4a6 Compare July 23, 2021 22:55

add JETTest, test on 1.7

cf42eb6

simonbyrne force-pushed the kp/spec-rmap branch from dff978d to cf42eb6 Compare August 2, 2021 20:33

simonbyrne added 2 commits August 2, 2021 15:15

fix LinearIndices

6abd5c8

rework norms

8d95413

bors bot merged commit 0f6e922 into main Aug 3, 2021

bors bot deleted the kp/spec-rmap branch August 3, 2021 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: Specialize `fn` in `rmap` and `rmaptype` #80

Perf: Specialize `fn` in `rmap` and `rmaptype` #80

kpamnany commented Jul 23, 2021

jakebolewski commented Jul 26, 2021

kpamnany commented Jul 26, 2021

jakebolewski commented Jul 27, 2021 •

edited

Loading

kpamnany commented Jul 27, 2021 •

edited

Loading

jakebolewski commented Jul 27, 2021

jakebolewski commented Jul 27, 2021 •

edited

Loading

kpamnany commented Jul 27, 2021

vchuravy commented Jul 28, 2021

jakebolewski commented Jul 28, 2021 •

edited

Loading

aviatesk commented Jul 28, 2021

jakebolewski commented Jul 28, 2021 •

edited

Loading

simonbyrne commented Aug 2, 2021

simonbyrne commented Aug 3, 2021

bors bot commented Aug 3, 2021

Perf: Specialize fn in rmap and rmaptype #80

Perf: Specialize fn in rmap and rmaptype #80

Conversation

kpamnany commented Jul 23, 2021

jakebolewski commented Jul 26, 2021

kpamnany commented Jul 26, 2021

jakebolewski commented Jul 27, 2021 • edited Loading

kpamnany commented Jul 27, 2021 • edited Loading

jakebolewski commented Jul 27, 2021

jakebolewski commented Jul 27, 2021 • edited Loading

kpamnany commented Jul 27, 2021

vchuravy commented Jul 28, 2021

jakebolewski commented Jul 28, 2021 • edited Loading

aviatesk commented Jul 28, 2021

jakebolewski commented Jul 28, 2021 • edited Loading

simonbyrne commented Aug 2, 2021

simonbyrne commented Aug 3, 2021

bors bot commented Aug 3, 2021

Perf: Specialize `fn` in `rmap` and `rmaptype` #80

Perf: Specialize `fn` in `rmap` and `rmaptype` #80

jakebolewski commented Jul 27, 2021 •

edited

Loading

kpamnany commented Jul 27, 2021 •

edited

Loading

jakebolewski commented Jul 27, 2021 •

edited

Loading

jakebolewski commented Jul 28, 2021 •

edited

Loading

jakebolewski commented Jul 28, 2021 •

edited

Loading