Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf: Specialize fn in rmap and rmaptype #80

Merged
merged 4 commits into from
Aug 3, 2021
Merged

Perf: Specialize fn in rmap and rmaptype #80

merged 4 commits into from
Aug 3, 2021

Conversation

kpamnany
Copy link
Contributor

Also avoid use of map on NamedTuples as it doesn't specialize f.

@kpamnany kpamnany requested a review from simonbyrne July 23, 2021 20:55
Also avoid use of `map` on `NamedTuple`s as it doesn't specialize `f`.
@jakebolewski
Copy link
Contributor

Since this is low level and called quite often , can we start to add some microbenchmarks to show that this is an actual improvement / doesn't regress in the future with different compiler heuristics?

@kpamnany
Copy link
Contributor Author

Since this is low level and called quite often , can we start to add some microbenchmarks to show that this is an actual improvement / doesn't regress in the future with different compiler heuristics?

This would be nice, but not sure how to replicate the problem with a microbenchmark. The profile below demonstrates the problem:

image

See the jl_apply_generic.

However, the compiler is handling most of the code snippets I've tried properly; I haven't found a simple way to make it give up as above yet. E.g.:

julia> @code_warntype map(x -> x^3, (x1 = 1, x2 = 2))
Variables
  #self#::Core.Const(map)
  f::Core.Const(var"#35#36"())
  nt::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}
  nts::Tuple{}

Body::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}
1%1  = Core.tuple(nt)::Tuple{NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}}%2  = Core._apply_iterate(Base.iterate, Base.same_names, %1, nts)::Core.Const(true)
│   %3  = !%2::Core.Const(false)
└──       goto #3 if not %3
2 ─       Core.Const(:(Base.ArgumentError("Named tuple names do not match.")))
└──       Core.Const(:(Base.throw(%5)))
3%7  = Core.apply_type(Base.NamedTuple, $(Expr(:static_parameter, 1)))::Core.Const(NamedTuple{(:x1, :x2), T} where T<:Tuple)
│   %8  = Core.tuple(f)::Core.Const((var"#35#36"(),))
│   %9  = Core.tuple(nt)::Tuple{NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}}%10 = Core._apply_iterate(Base.iterate, Core.tuple, %9, nts)::Tuple{NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}}%11 = Base.map(Base.Tuple, %10)::Tuple{Tuple{Int64, Int64}}%12 = Core._apply_iterate(Base.iterate, Base.map, %8, %11)::Tuple{Int64, Int64}%13 = (%7)(%12)::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}
└──       return %13

I'm open to ideas.

@jakebolewski
Copy link
Contributor

jakebolewski commented Jul 27, 2021

I see, the issue I have is that there are really only 2 changes that are significant (lines 22 and 24 in the diff), the others are just syntactic and don't do anything.

It might be better to just force inline this chain of method calls to get somewhat predictable performance / behavior from the compiler, that would be a better solution IMO.

@kpamnany
Copy link
Contributor Author

kpamnany commented Jul 27, 2021

I see, the issue I have is that there are really only 2 changes that are significant (lines 22 and 24 in the diff), the others are just syntactic and don't do anything.

My understanding is that specializing the function (with the where F) is usually all you need to do, so these other changes are not just syntactic. It is because map on NamedTuples does not do this (for latency reasons) that lines 22 and 24 are necessary.

It might be better to just force inline this chain of method calls to get somewhat predictable performance / behavior from the compiler, that would be a better solution IMO.

Compiler inlining decisions use a cost function (see JuliaLang/julia#27857) and can only be forced if the calling signature is concrete (see JuliaLang/julia#29258). I'm not sure what that means for something like fn? So I'm not sure how to force inlining.

Edit: I'm not seeing the issue with this fix. What do you not like about it?

@jakebolewski
Copy link
Contributor

There is no issue with it, it just seems very finicky with the specialization changes to try and trigger in-lining. you can't look at code typed either because it always specializes so it might be misleading. Since this is such a low level perf issue (and will be called everywhere) it would be nice to just come up with a way to enforce (to the degree possible) that these methods won't be a surprise performance trap.

@jakebolewski
Copy link
Contributor

jakebolewski commented Jul 27, 2021

The where {F} will only specialize when the function F is not called in the method (passed through), this happens some places but the bottom methods in the call stack will be specialized without that form. But is this really why there is a generic fallback in the flux example?

@kpamnany
Copy link
Contributor Author

@vchuravy: thoughts please?

@vchuravy
Copy link

Also summoning @aviatesk. I went looking through the compiler to find the place where this decision is made,
but I couldn't find it. Cthulhu and profiling both agreed, and it is worthwhile to remember that inference is heuristic
and sometimes context sensitive. We want to expose inference remarks in Cthulhu that hopefully will shed some more light on this. It might be related to recursion detection (wasn't one definition of rmapcalling rmap in a closure?)

Broadly speaking f(g) = g() is not guaranteed to specialize (and I heard Jeff talk about that we will deoptimze that more often than not), and that the form f(g::G) where G = g() is the preferred if you require something to specialize. I am not sure why map(f, ::NT) is not using that form, but I would assume that is to reduce the latency impact.

@jakebolewski
Copy link
Contributor

jakebolewski commented Jul 28, 2021

Thanks for the info @vchuravy

Broadly speaking f(g) = g() is not guaranteed to specialize

Makes sense (given the general move towards better compile times) but compiler docs might need to be updated if this is increasingly the case (it might be on dev).

I guess my point would be to make this optimization more robust / obvious by just @inline the callstack which should basically force specialization. Since this is mapping to Core operations (+, -, *, etc) on Field objects this is probably the behavior we want to the degree possible enforce (specialize & inline).

@aviatesk
Copy link

Also summoning @aviatesk.

Okay, I will tell you what I know ;)

As @vchuravy said, these problems relate to recursion detection heuristic within type inference.
The "recursion" here means recursive method calls within a static call graph, and it doesn't mean the recursive calls of the same generic function. To say roughly, if there are calls of the same method call signature, type inference might give up because otherwise it may not guarantee the termination of itself (, and it can cause problems for succeeding optimizations).
So @code_warntype map(x -> x^3, (x1 = 1, x2 = 2)) might not be a best test target, because it certainly recurses into the same generic function map, but the method calls within the call graph are different.

Rather, I suggest to use rmap(x -> x^3, (y = (x1 = 1, x2 = 2),)) for the simplest target here, because it includes the same recursive method calls in the call stack.

Now you can use Cthulhu or JETTest.jl, whichever you like, and you will find some optimization failures happening within the previous implementations of rmap.

If you use Cthulhu, you will see some callsites are annotated as < limited >, which basically means inference gives up on the recursive callsite.

julia> @Descend optimize=false rmap(x -> x^3, (y = (x1 = 1, x2 = 2),))

# after descending into a bit of madness ...
│ ─ %-1  = invoke map(::var"#15#16"{var"#19#20"},::Tuple{NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}})::Tuple{NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}}
1nothing::Core.Const(nothing)
│   @ tuple.jl:220 within `map`%2 = Base.getindex(t, 1)::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}%3 = (f)(%2)::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}%4 = Core.tuple(%3)::Tuple{NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}}
└──      return %4
Select a call to descend into or  to ascend. [q]uit. [b]ookmark.
Toggles: [o]ptimize, [w]arn, [v]erbose printing for warntype code, [d]ebuginfo, [i]nlining costs, [s]yntax highlight for Source/LLVM/Native.
Show: [S]ource code, [A]ST, [L]LVM IR, [N]ative code
Actions: [E]dit source code, [R]evise and redisplay
Advanced: dump [P]arams cache.
   %2  = < constprop > getindex(::Tuple{NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}},::Core.Const(1))::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}%3  = < limited > #15(::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}})::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}
   

I recommend you use JETTest, because it's automated and can be used for your testsuite:

# before this PR
julia> @test_nodispatch rmap(x -> x^3, (y = (x1 = 1, x2 = 2),))
Dispatch Test Failed at REPL[24]:1
  Expression: #= REPL[24]:1 =# JETTest.@test_nodispatch rmap((x->begin
                x ^ 3
            end), (y = (x1 = 1, x2 = 2),))
  ═════ 4 possible errors found ═════
  ┌ @ REPL[17]:1 Main.map(#15, X)
  │┌ @ namedtuple.jl:208 Base.map(Core.tuple(f), Base.map(Base.Tuple, Core.tuple(Core.tuple(nt), nts...))...)
  ││┌ @ tuple.jl:220 f(Base.getindex(t, 1))
  │││┌ @ REPL[17]:1 Main.rmap(Core.getfield(#self#, :fn), x)
  ││││┌ @ REPL[17]:1 Main.map(#15, X)
  │││││┌ @ namedtuple.jl:204 map(::var"#15#16"{var"#23#24"}, ::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}})
  ││││││ failed to optimize: map(::var"#15#16"{var"#23#24"}, ::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}})
  │││││└─────────────────────
  ││││┌ @ REPL[17]:1 rmap(::var"#23#24", ::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}})
  │││││ failed to optimize: rmap(::var"#23#24", ::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}})
  ││││└──────────────
  │││┌ @ REPL[17]:1 (::var"#15#16"{var"#23#24"})(::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}})
  ││││ failed to optimize: (::var"#15#16"{var"#23#24"})(::NamedTuple{(:x1, :x2), Tuple{Int64, Int64}})
  │││└──────────────
  ││┌ @ tuple.jl:220 map(::var"#15#16"{var"#23#24"}, ::Tuple{NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}})
  │││ failed to optimize: map(::var"#15#16"{var"#23#24"}, ::Tuple{NamedTuple{(:x1, :x2), Tuple{Int64, Int64}}})
  ││└────────────────
  
ERROR: There was an error during testing

# after this PR
julia> @test_nodispatch rmap(x -> x^3, (y = (x1 = 1, x2 = 2),))
Test Passed
  Expression: #= REPL[31]:1 =# JETTest.@test_nodispatch rmap((x->begin
                x ^ 3
            end), (y = (x1 = 1, x2 = 2),))

In JETTest, "failed to optimize" usually corresponds to < limited > callsites in Cthulhu.

My understanding is that specializing the function (with the where F) is usually all you need to do

And @jakebolewski explained above, the optimization benefits of this PR do not come from this, but it seems to come from the other functional changes, i.e. convert the outputs of recursive calls of rmap into NamedTuples, which helps compiler infer the types of the recursive method calls vastly.
So you already have done very good job for that part, and you can just strip away where F parts, I think ?
I confirmed JETTest passes on the simplest target example I explained above even if I stripped away those where annotations.

@jakebolewski
Copy link
Contributor

jakebolewski commented Jul 28, 2021

Awesome write up @aviatesk, thanks for that breakdown and for the demonstration of JETTest.jl (which I didn't know about)! Definitely something we can and will integrate into the ClimaCore test suite, this will be hugely useful at spotting regressions.

Adding JETTest checks is probably more useful than microbenchmarks at this stage.

@simonbyrne
Copy link
Member

Thanks @aviatesk: JETTest is very nice!

@simonbyrne
Copy link
Member

bors r+

@bors
Copy link
Contributor

bors bot commented Aug 3, 2021

@bors bors bot merged commit 0f6e922 into main Aug 3, 2021
@bors bors bot deleted the kp/spec-rmap branch August 3, 2021 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants