Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prolong2mortars! with P4estMesh allocates sometimes #628

Closed
Tracked by #720
efaulhaber opened this issue Jun 2, 2021 · 7 comments · Fixed by #778
Closed
Tracked by #720

prolong2mortars! with P4estMesh allocates sometimes #628

efaulhaber opened this issue Jun 2, 2021 · 7 comments · Fixed by #778
Labels
performance We are greedy

Comments

@efaulhaber
Copy link
Member

The following has been done on the efaulhaber:p4est-non-conforming branch, which will be merged in #618.

When I start the REPL and execute

using Trixi

trixi_include("examples/2d/elixir_advection_p4est_non_conforming_flag.jl")

the timers report:

prolong2mortars            106   7.42ms  32.8%  70.0μs   6.29MiB  94.1%  60.8KiB

The benchmark code

using BenchmarkTools

u = Trixi.wrap_array(copy(sol.u[end]), mesh, equations, solver, semi.cache);
@benchmark Trixi.prolong2mortars!($semi.cache, $u, $mesh, $equations, $solver.mortar, $solver)

reports:

BenchmarkTools.Trial: 
  memory estimate:  60.75 KiB
  allocs estimate:  1584
  --------------
  minimum time:     65.900 μs (0.00% GC)
  median time:      67.700 μs (0.00% GC)
  mean time:        75.865 μs (9.76% GC)
  maximum time:     8.376 ms (98.95% GC)
  --------------
  samples:          10000
  evals/sample:     1

Now, I go to solvers/dg_p4est/dg_2d.jl and remove @threaded from

  size_ = (nnodes(dg), nnodes(dg))

  @threaded for mortar in eachmortar(dg, cache)
    small_indices = node_indices[1, mortar]
    large_indices = node_indices[2, mortar]

and execute the example again.
Then, I add @threaded again, so the code is identical to the one I benchmarked before.
Running the exact same benchmarks with the exact same code again now yields:

prolong2mortars            106   1.09ms  5.30%  10.3μs     0.00B  0.00%    0.00B

and

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.225 μs (0.00% GC)
  median time:      7.550 μs (0.00% GC)
  mean time:        7.599 μs (0.00% GC)
  maximum time:     13.600 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

What is going on here?

@efaulhaber efaulhaber added the performance We are greedy label Jun 2, 2021
@ranocha
Copy link
Member

ranocha commented Jun 2, 2021

I have observed similar things. Introducing our own @timed helped already quite a bit. I guess the compiler sees more code when using @threaded and gives up earlier. When you remove @threaded, the compiler sees less code, optimizes it as desired, and caches the result. Thus, you see good performance. Seems to be related to JuliaLang/julia#35800, JuliaLang/julia#32552, and JuliaLang/julia#41740.

@efaulhaber
Copy link
Member Author

Is there any workaround for now?

@ranocha
Copy link
Member

ranocha commented Jun 2, 2021

Not really...

@ranocha
Copy link
Member

ranocha commented Jun 4, 2021

Request my review on your related PR when you feel ready for that and I can try to have a look to speed up critical parts

@ranocha
Copy link
Member

ranocha commented Aug 12, 2021

Some analysis using Julia v1.6.2:

julia> using ProfileView, Trixi

julia> trixi_include(joinpath(examples_dir(), "p4est_2d_dgsem", "elixir_advection_nonconforming_unstructured.jl"),
                     save_solution=TrivialCallback(), save_restart=TrivialCallback()) # second run, after compilation
[...]
 ───────────────────────────────────────────────────────────────────────────────
            Trixi.jl                    Time                   Allocations      
                                ──────────────────────   ───────────────────────
        Tot / % measured:           26.5ms / 96.9%           14.7MiB / 100%     

 Section                ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────
 rhs!                      126   23.6ms  92.1%   188μs   14.3MiB  97.5%   116KiB
   prolong2mortars         126   14.1ms  54.8%   112μs   14.2MiB  97.1%   116KiB
   volume integral         126   2.67ms  10.4%  21.2μs     0.00B  0.00%    0.00B
   ~rhs!~                  126   2.45ms  9.55%  19.4μs   56.1KiB  0.37%     456B
   interface flux          126   1.29ms  5.02%  10.2μs     0.00B  0.00%    0.00B
   mortar flux             126   1.21ms  4.74%  9.64μs     0.00B  0.00%    0.00B
   boundary flux           126    553μs  2.16%  4.39μs     0.00B  0.00%    0.00B
   prolong2interfaces      126    528μs  2.06%  4.19μs     0.00B  0.00%    0.00B
   surface integral        126    418μs  1.63%  3.32μs     0.00B  0.00%    0.00B
   Jacobian                126    263μs  1.03%  2.09μs     0.00B  0.00%    0.00B
   reset ∂u/∂t             126    101μs  0.40%   804ns     0.00B  0.00%    0.00B
   prolong2boundaries      126   80.2μs  0.31%   636ns     0.00B  0.00%    0.00B
   source terms            126   2.22μs  0.01%  17.7ns     0.00B  0.00%    0.00B
 analyze solution            2   1.65ms  6.44%   826μs    382KiB  2.55%   191KiB
 calculate dt               26    378μs  1.47%  14.5μs     0.00B  0.00%    0.00B
 ───────────────────────────────────────────────────────────────────────────────

Thus, we see the allocations indicating a type instability.

However, @code_warntype is fine.

julia> u_ode = copy(sol.u[end]); du_ode = similar(u_ode); u = Trixi.wrap_array(u_ode, semi); du = Trixi.wrap_array(du_ode, semi); Trixi.rhs!(du_ode, u_ode, semi, 0.0)

julia> @code_warntype Trixi.prolong2mortars!(
           (semi.cache),
           (u),
           (mesh),
           (equations),
           (solver.mortar),
           (solver.surface_integral),
           (solver))
Variables
  #self#::Core.Const(Trixi.prolong2mortars!)
  cache::NamedTuple{(:elements, :interfaces, :boundaries, :mortars, :fstar_upper_threaded, :fstar_lower_threaded, :u_threaded), Tuple{Trixi.P4estElementContainer{2, Float64, Float64, 3, 4, 5}, Trixi.P4estInterfaceContainer{2, Float64, 4}, Trixi.P4estBoundaryContainer{2, Float64, 3}, Trixi.P4estMortarContainer{2, Float64, 3, 5}, Vector{StaticArrays.MMatrix{1, 4, Float64, 4}}, Vector{StaticArrays.MMatrix{1, 4, Float64, 4}}, Vector{StaticArrays.MMatrix{1, 4, Float64, 4}}}}                                                                                           
  u::StrideArraysCore.PtrArray{Tuple{Static.StaticInt{1}, Static.StaticInt{4}, Static.StaticInt{4}, Int64}, (true, true, true, true), Float64, 4, 1, 0, (1, 2, 3, 4), Tuple{Static.StaticInt{8}, Static.StaticInt{8}, Static.StaticInt{32}, Static.StaticInt{128}}, NTuple{4, Static.StaticInt{1}}}                                                
  mesh::P4estMesh{2, Float64, Ptr{P4est.LibP4est.p4est}, 4, 4}
  equations::LinearScalarAdvectionEquation2D{Float64}
  mortar_l2::Trixi.LobattoLegendreMortarL2{Float64, 4, Matrix{Float64}, Matrix{Float64}}
  surface_integral::Core.Const(SurfaceIntegralWeakForm{FluxLaxFriedrichs{typeof(max_abs_speed_naive)}}(FluxLaxFriedrichs(max_abs_speed_naive)))                                                                                   
  dg::DGSEM{LobattoLegendreBasis{Float64, 4, SVector{4, Float64}, Matrix{Float64}, Matrix{Float64}, Matrix{Float64}}, Trixi.LobattoLegendreMortarL2{Float64, 4, Matrix{Float64}, Matrix{Float64}}, SurfaceIntegralWeakForm{FluxLaxFriedrichs{typeof(max_abs_speed_naive)}}, VolumeIntegralWeakForm}                                                
  @_9::Union{Nothing, Tuple{Int64, Int64}}
  630::Trixi.P4estMortarContainer{2, Float64, 3, 5}
  index_range::Base.OneTo{Int64}
  node_indices::Matrix{Tuple{Symbol, Symbol}}
  element_ids::Matrix{Int64}
  @_14::Union{Nothing, Tuple{Int64, Int64}}
  @_15::Int64
  @_16::Int64
  @_17::Union{Nothing, Tuple{Int64, Int64}}
  @_18::Int64
  @_19::Int64
  mortar::Int64
  u_upper::SubArray{Float64, 2, Array{Float64, 5}, Tuple{Int64, Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Int64}, false}                                                                               
  u_lower::SubArray{Float64, 2, Array{Float64, 5}, Tuple{Int64, Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Int64}, false}                                                                               
  element::Int64
  j_large::Int64
  i_large::Int64
  j_large_step::Int64
  j_large_start::Int64
  i_large_step::Int64
  i_large_start::Int64
  large_indices::Tuple{Symbol, Symbol}
  u_buffer::StaticArrays.MMatrix{1, 4, Float64, 4}
  j_small_step::Int64
  j_small_start::Int64
  i_small_step::Int64
  i_small_start::Int64
  small_indices::Tuple{Symbol, Symbol}
  @_37::Union{Nothing, Tuple{Int64, Int64}}
  position::Int64
  j_small::Int64
  i_small::Int64
  @_41::Union{Nothing, Tuple{Int64, Int64}}
  i@_42::Int64
  v@_43::Int64
  @_44::Union{Nothing, Tuple{Int64, Int64}}
  i@_45::Int64
  v@_46::Int64

Body::Nothing
1 ──        (630 = Base.getproperty(cache, :mortars))
│    %2   = Base.getproperty(UnPack, :unpack)::Core.Const(UnPack.unpack)
│    %3   = 630::Trixi.P4estMortarContainer{2, Float64, 3, 5}%4   = Core.apply_type(Trixi.Val, :element_ids)::Core.Const(Val{:element_ids})
│    %5   = (%4)()::Core.Const(Val{:element_ids}())
│           (element_ids = (%2)(%3, %5))
│    %7   = Base.getproperty(UnPack, :unpack)::Core.Const(UnPack.unpack)
│    %8   = 630::Trixi.P4estMortarContainer{2, Float64, 3, 5}%9   = Core.apply_type(Trixi.Val, :node_indices)::Core.Const(Val{:node_indices})
│    %10  = (%9)()::Core.Const(Val{:node_indices}())
│           (node_indices = (%7)(%8, %10))
│           630
│           (index_range = Trixi.eachnode(dg))
│    %14  = Trixi.eachmortar(dg, cache)::Base.OneTo{Int64}
│           (@_9 = Base.iterate(%14))
│    %16  = (@_9 === nothing)::Bool%17  = Base.not_int(%16)::Bool
└───        goto #19 if not %17
2 ┄─        Core.NewvarNode(:(@_14))
│           Core.NewvarNode(:(@_15))
│           Core.NewvarNode(:(@_16))
│           Core.NewvarNode(:(u_upper))
│           Core.NewvarNode(:(u_lower))
│           Core.NewvarNode(:(element))
│           Core.NewvarNode(:(j_large))
│           Core.NewvarNode(:(i_large))
│           Core.NewvarNode(:(j_large_step))
│           Core.NewvarNode(:(j_large_start))
│           Core.NewvarNode(:(i_large_step))
│           Core.NewvarNode(:(i_large_start))
│           Core.NewvarNode(:(large_indices))
│           Core.NewvarNode(:(u_buffer))
│    %33  = @_9::Tuple{Int64, Int64}::Tuple{Int64, Int64}
│           (mortar = Core.getfield(%33, 1))
│    %35  = Core.getfield(%33, 2)::Int64
│           (small_indices = Base.getindex(node_indices, 1, mortar))
│    %37  = Base.getindex(small_indices, 1)::Symbol%38  = Trixi.index_to_start_step(%37, index_range::Core.Const(Base.OneTo(4)))::Tuple{Int64, Int64}%39  = Base.indexed_iterate(%38, 1)::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(2)])
│           (i_small_start = Core.getfield(%39, 1))
│           (@_19 = Core.getfield(%39, 2))
│    %42  = Base.indexed_iterate(%38, 2, @_19::Core.Const(2))::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(3)])                                                                                                 
│           (i_small_step = Core.getfield(%42, 1))
│    %44  = Base.getindex(small_indices, 2)::Symbol%45  = Trixi.index_to_start_step(%44, index_range::Core.Const(Base.OneTo(4)))::Tuple{Int64, Int64}%46  = Base.indexed_iterate(%45, 1)::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(2)])
│           (j_small_start = Core.getfield(%46, 1))
│           (@_18 = Core.getfield(%46, 2))
│    %49  = Base.indexed_iterate(%45, 2, @_18::Core.Const(2))::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(3)])                                                                                                 
│           (j_small_step = Core.getfield(%49, 1))
│    %51  = (1:2)::Core.Const(1:2)
│           (@_17 = Base.iterate(%51))
│    %53  = (@_17::Core.Const((1, 1)) === nothing)::Core.Const(false)
│    %54  = Base.not_int(%53)::Core.Const(true)
└───        goto #11 if not %54
3 ┄─ %56  = @_17::Tuple{Int64, Int64}::Tuple{Int64, Int64}
│           (position = Core.getfield(%56, 1))
│    %58  = Core.getfield(%56, 2)::Int64
│           (i_small = i_small_start)
│           (j_small = j_small_start)
│           (element = Base.getindex(element_ids, position, mortar))
│    %62  = Trixi.eachnode(dg)::Core.Const(Base.OneTo(4))
│           (@_37 = Base.iterate(%62))
│    %64  = (@_37::Core.Const((1, 1)) === nothing)::Core.Const(false)
│    %65  = Base.not_int(%64)::Core.Const(true)
└───        goto #9 if not %65
4 ┄─ %67  = @_37::Tuple{Int64, Int64}::Tuple{Int64, Int64}
│           (i@_42 = Core.getfield(%67, 1))
│    %69  = Core.getfield(%67, 2)::Int64%70  = Trixi.eachvariable(equations)::Core.Const(Base.OneTo(1))
│           (@_41 = Base.iterate(%70))
│    %72  = (@_41::Core.Const((1, 1)) === nothing)::Core.Const(false)
│    %73  = Base.not_int(%72)::Core.Const(true)
└───        goto #7 if not %73
5 ── %75  = @_41::Core.Const((1, 1))::Core.Const((1, 1))
│           (v@_43 = Core.getfield(%75, 1))
│    %77  = Core.getfield(%75, 2)::Core.Const(1)
│    %78  = Base.getindex(u, v@_43::Core.Const(1), i_small, j_small, element)::Float64%79  = Base.getproperty(cache, :mortars)::Trixi.P4estMortarContainer{2, Float64, 3, 5}%80  = Base.getproperty(%79, :u)::Array{Float64, 5}%81  = v@_43::Core.Const(1)::Core.Const(1)
│    %82  = position::Int64%83  = i@_42::Int64
│           Base.setindex!(%80, %78, 1, %81, %82, %83, mortar)
│           (@_41 = Base.iterate(%70, %77))
│    %86  = (@_41::Core.Const(nothing) === nothing)::Core.Const(true)
│    %87  = Base.not_int(%86)::Core.Const(false)
└───        goto #7 if not %87
6 ──        Core.Const(:(goto %75))
7 ┄─        (i_small = i_small + i_small_step)
│           (j_small = j_small + j_small_step)
│           (@_37 = Base.iterate(%62, %69))
│    %93  = (@_37 === nothing)::Bool%94  = Base.not_int(%93)::Bool
└───        goto #9 if not %94
8 ──        goto #4
9 ┄─        (@_17 = Base.iterate(%51, %58))
│    %98  = (@_17 === nothing)::Bool%99  = Base.not_int(%98)::Bool
└───        goto #11 if not %99
10 ─        goto #3
11%102 = Base.getproperty(cache, :u_threaded)::Vector{StaticArrays.MMatrix{1, 4, Float64, 4}}%103 = Base.getproperty(Trixi.Threads, :threadid)::Core.Const(Base.Threads.threadid)
│    %104 = (%103)()::Int64
│           (u_buffer = Base.getindex(%102, %104))
│           (large_indices = Base.getindex(node_indices, 2, mortar))
│    %107 = Base.getindex(large_indices, 1)::Symbol%108 = Trixi.index_to_start_step(%107, index_range::Core.Const(Base.OneTo(4)))::Tuple{Int64, Int64}%109 = Base.indexed_iterate(%108, 1)::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(2)])
│           (i_large_start = Core.getfield(%109, 1))
│           (@_16 = Core.getfield(%109, 2))
│    %112 = Base.indexed_iterate(%108, 2, @_16::Core.Const(2))::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(3)])
│           (i_large_step = Core.getfield(%112, 1))
│    %114 = Base.getindex(large_indices, 2)::Symbol%115 = Trixi.index_to_start_step(%114, index_range::Core.Const(Base.OneTo(4)))::Tuple{Int64, Int64}%116 = Base.indexed_iterate(%115, 1)::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(2)])
│           (j_large_start = Core.getfield(%116, 1))
│           (@_15 = Core.getfield(%116, 2))
│    %119 = Base.indexed_iterate(%115, 2, @_15::Core.Const(2))::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(3)])
│           (j_large_step = Core.getfield(%119, 1))
│           (i_large = i_large_start)
│           (j_large = j_large_start)
│           (element = Base.getindex(element_ids, 3, mortar))
│    %124 = Trixi.eachnode(dg)::Core.Const(Base.OneTo(4))
│           (@_14 = Base.iterate(%124))
│    %126 = (@_14::Core.Const((1, 1)) === nothing)::Core.Const(false)
│    %127 = Base.not_int(%126)::Core.Const(true)
└───        goto #17 if not %127
12%129 = @_14::Tuple{Int64, Int64}::Tuple{Int64, Int64}
│           (i@_45 = Core.getfield(%129, 1))
│    %131 = Core.getfield(%129, 2)::Int64%132 = Trixi.eachvariable(equations)::Core.Const(Base.OneTo(1))
│           (@_44 = Base.iterate(%132))
│    %134 = (@_44::Core.Const((1, 1)) === nothing)::Core.Const(false)
│    %135 = Base.not_int(%134)::Core.Const(true)
└───        goto #15 if not %135
13%137 = @_44::Core.Const((1, 1))::Core.Const((1, 1))
│           (v@_46 = Core.getfield(%137, 1))
│    %139 = Core.getfield(%137, 2)::Core.Const(1)
│    %140 = Base.getindex(u, v@_46::Core.Const(1), i_large, j_large, element)::Float64
│           Base.setindex!(u_buffer, %140, v@_46::Core.Const(1), i@_45)
│           (@_44 = Base.iterate(%132, %139))
│    %143 = (@_44::Core.Const(nothing) === nothing)::Core.Const(true)
│    %144 = Base.not_int(%143)::Core.Const(false)
└───        goto #15 if not %144
14 ─        Core.Const(:(goto %137))
15 ┄        (i_large = i_large + i_large_step)
│           (j_large = j_large + j_large_step)
│           (@_14 = Base.iterate(%124, %131))
│    %150 = (@_14 === nothing)::Bool%151 = Base.not_int(%150)::Bool
└───        goto #17 if not %151
16 ─        goto #12
17%154 = Base.getproperty(cache, :mortars)::Trixi.P4estMortarContainer{2, Float64, 3, 5}%155 = Base.getproperty(%154, :u)::Array{Float64, 5}
│           (u_lower = Trixi.view(%155, 2, Trixi.:(:), 1, Trixi.:(:), mortar))
│    %157 = u_lower::Core.PartialStruct(SubArray{Float64, 2, Array{Float64, 5}, Tuple{Int64, Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Int64}, false}, Any[Array{Float64, 5}, Core.PartialStruct(Tuple{Int64, Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Int64}, Any[Core.Const(2), Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Int64]), Core.Const(0), Core.Const(0)])::Core.PartialStruct(SubArray{Float64, 2, Array{Float64, 5}, Tuple{Int64, Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Int64}, false}, Any[Array{Float64, 5}, Core.PartialStruct(Tuple{Int64, Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Int64}, Any[Core.Const(2), Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Int64]), Core.Const(0), Core.Const(0)])
│    %158 = Base.getproperty(mortar_l2, :forward_lower)::Matrix{Float64}
│           Trixi.multiply_dimensionwise!(%157, %158, u_buffer)
│    %160 = Base.getproperty(cache, :mortars)::Trixi.P4estMortarContainer{2, Float64, 3, 5}%161 = Base.getproperty(%160, :u)::Array{Float64, 5}
│           (u_upper = Trixi.view(%161, 2, Trixi.:(:), 2, Trixi.:(:), mortar))
│    %163 = u_upper::Core.PartialStruct(SubArray{Float64, 2, Array{Float64, 5}, Tuple{Int64, Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Int64}, false}, Any[Array{Float64, 5}, Core.PartialStruct(Tuple{Int64, Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Int64}, Any[Core.Const(2), Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Int64]), Core.Const(0), Core.Const(0)])::Core.PartialStruct(SubArray{Float64, 2, Array{Float64, 5}, Tuple{Int64, Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Int64}, false}, Any[Array{Float64, 5}, Core.PartialStruct(Tuple{Int64, Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Int64}, Any[Core.Const(2), Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Int64]), Core.Const(0), Core.Const(0)])
│    %164 = Base.getproperty(mortar_l2, :forward_upper)::Matrix{Float64}
│           Trixi.multiply_dimensionwise!(%163, %164, u_buffer)
│           (@_9 = Base.iterate(%14, %35))
│    %167 = (@_9 === nothing)::Bool%168 = Base.not_int(%167)::Bool
└───        goto #19 if not %168
18 ─        goto #2
19return Trixi.nothing

Digging deeper, profiling shows dynamic dispatch:

julia> function doit(n, du, u, semi)
           for _ in 1:n
               Trixi.prolong2mortars!(
                   (semi.cache),
                   (u),
                   (semi.mesh),
                   (semi.equations),
                   (semi.solver.mortar),
                   (semi.solver.surface_integral),
                   (semi.solver))
           end
       end
doit (generic function with 1 method)

julia> doit(10^3, du, u, semi)

julia> @profview doit(10^5, du, u, semi)

This gives
profiling

The relevant lines are

to_indices(A, inds, I::Tuple{Any, Vararg{Any}}) =
    (@_inline_meta; (to_index(A, I[1]), to_indices(A, _maybetail(inds), tail(I))...))

in indices.jl,

@inline to_indices(A, inds, I::Tuple{Colon, Vararg{Any}}) =
    (uncolon(inds, I), to_indices(A, _maybetail(inds), tail(I))...)

in multidimensional.jl, and

    SubArray(IndexStyle(viewindexing(indices), IndexStyle(parent)), parent, ensure_indexable(indices), index_dimsum(indices...))

in subarray.jl.jl.

These lines are called from

view(cache.mortars.u, 2, :, 1, :, mortar)

in Trixi.jl.

If I restart Julia and force compilation of the relevant method of view before calling the code from Trixi.jl, everything is fine:

julia> using Trixi

julia> let
           u = zeros(2, 2, 2, 2, 2)
           u_upper = view(u, 2, :, 1, :, 1)
           nothing
       end

julia> trixi_include(joinpath(examples_dir(), "p4est_2d_dgsem", "elixir_advection_nonconforming_unstructured.jl"),
                     save_solution=TrivialCallback(), save_restart=TrivialCallback()) # again the second run
[...]
 ───────────────────────────────────────────────────────────────────────────────
            Trixi.jl                    Time                   Allocations      
                                ──────────────────────   ───────────────────────
        Tot / % measured:           10.7ms / 93.4%            257KiB / 81.2%    

 Section                ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────
 rhs!                      126   8.30ms  82.8%  65.9μs   56.1KiB  26.9%     456B
   volume integral         126   2.21ms  22.0%  17.5μs     0.00B  0.00%    0.00B
   ~rhs!~                  126   1.45ms  14.4%  11.5μs   56.1KiB  26.9%     456B
   prolong2mortars         126   1.08ms  10.7%  8.55μs     0.00B  0.00%    0.00B
   interface flux          126    981μs  9.78%  7.78μs     0.00B  0.00%    0.00B
   mortar flux             126    936μs  9.33%  7.43μs     0.00B  0.00%    0.00B
   boundary flux           126    481μs  4.80%  3.82μs     0.00B  0.00%    0.00B
   prolong2interfaces      126    452μs  4.51%  3.59μs     0.00B  0.00%    0.00B
   surface integral        126    345μs  3.44%  2.74μs     0.00B  0.00%    0.00B
   Jacobian                126    243μs  2.43%  1.93μs     0.00B  0.00%    0.00B
   reset ∂u/∂t             126   77.9μs  0.78%   618ns     0.00B  0.00%    0.00B
   prolong2boundaries      126   53.5μs  0.53%   425ns     0.00B  0.00%    0.00B
   source terms            126   2.74μs  0.03%  21.8ns     0.00B  0.00%    0.00B
 analyze solution            2   1.32ms  13.2%   661μs    152KiB  73.1%  76.2KiB
 calculate dt               26    402μs  4.01%  15.5μs     0.00B  0.00%    0.00B
 ───────────────────────────────────────────────────────────────────────────────

There are no allocations anymore in prolong2mortars! and profiling shows that the dynamic dispatch is gone.

@sloede
Copy link
Member

sloede commented Aug 12, 2021

[...] There are no allocations anymore in prolong2mortars! and profiling shows that the dynamic dispatch is gone.

Very impressive analysis, thanks for sharing your step-by-step process with all details! What is your conclusion what we can do about it?

@ranocha
Copy link
Member

ranocha commented Aug 12, 2021

What is your conclusion what we can do about it?

#778

@ranocha ranocha mentioned this issue Jul 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance We are greedy
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants