Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression up to 20% when updating from Julia v1.10.4 to v1.11.0-rc1 #55009

Open
benegee opened this issue Jul 3, 2024 · 26 comments
Assignees
Labels
performance Must go faster regression Regression in behavior compared to a previous version regression 1.11 Regression in the 1.11 release

Comments

@benegee
Copy link

benegee commented Jul 3, 2024

We have recently started testing julia 1.11 in our projects Trixi.jl and TrixiParticles.jl. We observed performance regressions up to 20%. This has been confirmed on laptops (Apple M2, M3), a workstation (AMD Ryzen Threadripper 3990X), and the JUWELS HPC system at Jülich supercomputing centre (AMD EPYC Rome 7402 CPU).

@sloede has put together instructions and a MWE here: https://github.com/trixi-framework/performance-regression-julia-v1.10-vs-v1.11
It boils down to adding Trixi.jl and running BenchmarkTools for an isolated function in Trixi.jl.

using Trixi
using BenchmarkTools

equations = CompressibleEulerEquations3D(1.4)

volume_flux = flux_ranocha
solver = DGSEM(polydeg=3, surface_flux=flux_ranocha,
               volume_integral=VolumeIntegralFluxDifferencing(volume_flux))

coordinates_min = (0.0, 0.0, 0.0)
coordinates_max = (2.0, 2.0, 2.0)

trees_per_dimension = (4, 4, 4)

mesh = P4estMesh(trees_per_dimension, polydeg=1,
                 coordinates_min=coordinates_min, coordinates_max=coordinates_max,
                 periodicity=true, initial_refinement_level=1)

semi = SemidiscretizationHyperbolic(mesh, equations, initial_condition_convergence_test,
                                    solver)

u_ode = compute_coefficients(0.0, semi)
du_ode = similar(u_ode)

@benchmark Trixi.rhs!($du_ode, $u_ode, $semi, 0.0)

For this code we get on our workstation a median time of 6.079 ms with julia v1.10.4 vs. 7.200 ms with v1.11.0-rc1.

@sloede @ranocha @vchuravy @efaulhaber @LasNikas

@jishnub
Copy link
Contributor

jishnub commented Jul 3, 2024

Would it be possible to profile this to identify the point of regression?

@benegee benegee changed the title Performance regression when updating from Julia v1.10.4 to v1.11.0-rc1 Performance regression up to 20% when updating from Julia v1.10.4 to v1.11.0-rc1 Jul 3, 2024
@benegee
Copy link
Author

benegee commented Jul 3, 2024

Thanks for picking this up!

For a start, running @profile on 1000 iterations of rhs! and using PProf gives these flamegraphs (top: 1.10.4, bottom: 1.11-rc1)
1 10 4vs1 11

We can of course dig deeper.

@gbaraldi
Copy link
Member

gbaraldi commented Jul 3, 2024

The difference looks to be some getindex calls that now appear in the flamegraph.

@benegee
Copy link
Author

benegee commented Jul 3, 2024

Yes! For 1.11 we have checkbounds beneath these calls, which make up a fair amount of samples. For 1.10 they are not captured.

@KristofferC
Copy link
Member

Would be good to get a MWE that gets a bit closer to those suspected functions.

@gbaraldi
Copy link
Member

gbaraldi commented Jul 3, 2024

The issue is not the checkbounds btw. There's a whole getindex call that seems to be here. It might be Memory{T} bottom is 1.11
image
image

@ranocha ranocha added performance Must go faster regression Regression in behavior compared to a previous version labels Jul 3, 2024
@ranocha
Copy link
Member

ranocha commented Jul 3, 2024

Yeah, it looks indeed like the calls to get_contravariant_vector (get_contrav...) before get_node_vars and getindex after get_node_vars are much more expensive on Julia v1.11...

For reference: this code is called from

https://github.com/trixi-framework/Trixi.jl/blob/b4eef6d87d1e5a737ab31ad9c1e5bad293a19b4d/src/solvers/dgsem_structured/dg_3d.jl#L143-L156

get_contravariant_vector is defined in

https://github.com/trixi-framework/Trixi.jl/blob/b4eef6d87d1e5a737ab31ad9c1e5bad293a19b4d/src/solvers/dgsem_structured/dg.jl#L26-L29

@gbaraldi
Copy link
Member

gbaraldi commented Jul 3, 2024

So there are two commits that might deserve testing here. 9aa7980 as reported in #53158 (comment) and #51720

@oscardssmith
Copy link
Member

I do think we might need to actually do the initialized memory zeroing so we can revert that change. it has caused a lot of regressions.

@sloede
Copy link
Contributor

sloede commented Jul 4, 2024

Would be good to get a MWE that gets a bit closer to those suspected functions.

After running the code posted by @benegee above, you can directly benchmark get_contravariant_vector by executing the following lines:

@unpack contravariant_vectors = semi.cache.elements

ii = j = k = element = 1

@benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)

On our Rocinante (AMD Ryzen), this gives me the following:

v1.10.4

julia> @benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  3.930 ns  32.081 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     4.050 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.061 ns ±  0.309 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

            █ ▂
  ▂▂▁▂▂▁▁▁▁▃█▄█▇▃▅▅▂▂▂▂▁▂▁▁▂▁▁▂▁▁▁▁▂▂▂▂▂▁▁▁▂▂▁▂▂▁▂▂▂▂▁▂▂▂▂▁▂ ▂
  3.93 ns        Histogram: frequency by time        4.47 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

v1.11.0-rc1

julia> @benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min  max):  7.147 ns  15.917 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     7.167 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.186 ns ±  0.183 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▃
  ▂██▂▂▅▃▆▂▂▂▁▁▁▂▂▁▂▂▁▁▁▂▁▁▁▁▁▂▁▂▁▁▁▁▁▁▂▁▂▁▂▁▂▁▁▂▂▁▂▂▁▁▁▁▂▁▂ ▂
  7.15 ns        Histogram: frequency by time        7.64 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@sloede
Copy link
Contributor

sloede commented Jul 5, 2024

@gbaraldi @oscardssmith thanks for your input! It would be great if you could continue to help with the analysis of the root cause and to understand what needs to be fixed on the Julia side, as we do not have the expertise for something this close to the metal.

I know everyone has a full plate, but with this already being v1.11 RC1, on our side multiple research groups are slowly getting into panic mode 😅. Losing the performance boost of the now-defunct --check-bounds=no was already hard to swallow, but a 20-30% drop in serial performance would be near catastrophic for us in the HPC world.

@vchuravy
Copy link
Member

vchuravy commented Jul 5, 2024

Julia 1.10 will likely become an LTS release once 1.11 has been released. We will of course see if we can fix this in a patch release, but performance regression are not release blockers.

@sgaure
Copy link

sgaure commented Jul 5, 2024

I've bisected this to 909bcea, the big Memory{T} PR.
There seems to be a problem with optimizing getindex with the new Memory{T}-based arrays. Some of the performance difference, but not all, seems to go away by rewriting get_contravariant_vector slightly.

@inline function get_contravariant_vector(index, contravariant_vectors, indices...)
    cv = @view(contravariant_vectors[:, index, indices...])
    SVector(ntuple(dim->cv[dim], Val(ndims(contravariant_vectors) - 3)))
end

@vtjnash
Copy link
Member

vtjnash commented Jul 5, 2024

Seems mostly like a typo there too, since in the original code @inline(dim->contravariant_vectors[dim, index, indices...]),, the @inbounds does not apply to any call expression. Though it was good to see that the code previously worked well without needing to manually override the correctness of the bounds checking.

@sloede
Copy link
Contributor

sloede commented Jul 7, 2024

I've bisected this to 909bcea, the big Memory{T} PR. There seems to be a problem with optimizing getindex with the new Memory{T}-based arrays. Some of the performance difference, but not all, seems to go away by rewriting get_contravariant_vector slightly.

@inline function get_contravariant_vector(index, contravariant_vectors, indices...)
    cv = @view(contravariant_vectors[:, index, indices...])
    SVector(ntuple(dim->cv[dim], Val(ndims(contravariant_vectors) - 3)))
end

@sgaure Thanks a lot for taking a look. Your implementation does indeed recover some of the performance in v1.11.0-rc1. Interestingly, with it the v1.10.4 implementation is as fast as the v1.11.0-rc1 implementation, i.e., while it does make v1.11.0-rc1 faster, it actually slows down execution on v1.10.4.

On our Rocinante (AMD Ryzen Threadripper):

v1.10.4

julia> @benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.930 ns … 20.820 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.030 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.054 ns ±  0.217 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                        █ ▆ ▅ ▃    ▅ ▂                   ▁ ▁ ▁
  ▃▁▅▁▆▁▃▁▁▁▁▁▁▁▁▁▁▁▁█▄▁█▁█▁█▁█▇▁█▁█▁█▁▄▁▄▁▁▇▁▃▁▄▁▃▁▁▄▁▄▁█▁█ █
  3.93 ns      Histogram: log(frequency) by time     4.19 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sgaure_get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  5.780 ns … 16.460 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.780 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.795 ns ±  0.183 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  █▄▁▁▁▁▁▁▁▁▁▁▁▁█▃▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  5.78 ns        Histogram: frequency by time        5.82 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

v1.11.0-rc1

julia> @benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  7.147 ns … 17.998 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.157 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.172 ns ±  0.195 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█                                                         ▁
  ██▁█▁▃▁▃▁▁▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▁▁▃▁▁▁▃▁▃▁▁▄▁▄▁▃▁▄▁▄▁▁▅ █
  7.15 ns      Histogram: log(frequency) by time     7.47 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sgaure_get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  5.770 ns … 12.080 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.800 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.819 ns ±  0.158 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁  █    ▁
  ▃█▁▄█▁▇▁▄█▁▅▂▂▁▁▂▂▁▁▁▁▂▁▂▂▁▂▁▂▁▁▁▂▁▁▁▂▁▁▁▂▁▂▁▂▁▁▂▁▁▂▁▂▂▁▁▂ ▂
  5.77 ns        Histogram: frequency by time        6.13 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

It thus seems that v1.10.4 is still able to optimize more of the code away in the original implementation, and v1.11.0-rc1 is not able to fully catch up. It's a start, but it does explain why v1.11.0-rc1 is not able to fully optimize the code yet, and it would be nice to also get the missing ~50% of performance back.

Furthermore, it would be nice to also understand why getindex starts showing up in the profile for v1.11.0-rc1.

@sloede
Copy link
Contributor

sloede commented Jul 7, 2024

Another interesting observation:

@sgaure I took your implementation for get_contravariant_vector and applied it also to get_node_vars:

@inline function sgaure_get_node_vars(u, equations, solver::DG, indices...)
    uv = @view(u[:, indices...])
    SVector(ntuple(idx->uv[idx], Val(nvariables(equations))))
end

with

u = Trixi.wrap_array(u_ode, semi)

ii = j = k = element = 1

@benchmark Trixi.get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)

@benchmark sgaure_get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)

it turns out that v1.10.4 and v1.11.0-rc1 are again equally fast and, to my surprise, both are faster than our original implementation on v1.10.4.

On our Rocinante (AMD Ryzen Threadripper):

v1.10.4

julia> @benchmark Trixi.get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.930 ns … 14.730 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.930 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.943 ns ±  0.188 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  █▁▁▁▁▁▆▂▁▁▁▁▂▂▁▁▁▁▁▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  3.93 ns        Histogram: frequency by time        4.02 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sgaure_get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.240 ns … 22.801 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.240 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.251 ns ±  0.220 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  █▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  3.24 ns        Histogram: frequency by time        3.26 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

v1.11.0-rc1

julia> @benchmark Trixi.get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.930 ns … 48.041 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.930 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.946 ns ±  0.457 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                   ▂
  █▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▃ ▂
  3.93 ns        Histogram: frequency by time        3.94 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sgaure_get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.240 ns … 21.401 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.240 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.252 ns ±  0.214 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  █▁▄▁▂▁▂▁▁▁▁▁▁▂▁▂▁▂▁▁▂▁▂▁▂▁▂▁▁▁▁▂▁▁▁▂▁▁▁▁▁▁▂▁▂▁▂▁▁▂▁▂▁▂▁▂▁▂ ▂
  3.24 ns        Histogram: frequency by time         3.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, what's up with that? Seeing this, maybe you could take another look at get_contravariant_vector to massage it to getting the original performance back, @sgaure?

@gbaraldi
Copy link
Member

gbaraldi commented Jul 7, 2024

The proper fix to this is to revive https://reviews.llvm.org/D136218 and make it correct

@benegee
Copy link
Author

benegee commented Jul 8, 2024

Confirming @sloede 's observations: With the alternative implementation of get_node_vars we see an improvement both with 1.10.4 and 1.11.0-rc1. For get_contravariant_vector we see an improvement with 1.11.0-rc1 but a slowdown with 1.10.4.

In the end, the alternative implementation recovers some of the performance in v1.11.0-rc1, but unfortunately the overall performance for the higher-level functions rhs! or flux_differencing_kernel! is still best with 1.10.4 and the original implementation.

@j-fu
Copy link

j-fu commented Jul 12, 2024

... seeing a similar regression in VoronoiFVM.jl nonlinear stationary solver examples, with KrylovJL_GMRES & ILUZero.ilu0 preconditioner.

@sloede
Copy link
Contributor

sloede commented Jul 16, 2024

@gbaraldi @oscardssmith Thanks a lot for the very informative and cordial discussions about performance at JuliaCon ♥️.

Did you, by any change, make progress regarding the proposed LLVM patch and if yes, is it already testable somewhere (preferably with pre-built binaries)?

@gbaraldi
Copy link
Member

Am working on it. (it's not super trivial since it's basically a general store sinking pass). Given that, I would consider it not release blocking (So that the release gets out and we may find more issues). I believe it's a very self contained thing so we can backport it quite easily.

@sloede
Copy link
Contributor

sloede commented Aug 27, 2024

With v1.11.0-rc3 being released, I reran the benchmarks above.

TL;DR No overall performance improvements from v1.11.0-rc1 to v1.11.0-rc3 😢

Each result is the minimum time obtained from using @benchmark ... as posted above (in ms for rhs! and in ns for the rest):

Overall

  v1.10.4 v1.11.0-rc1 v1.11.0-rc3 1.10 → 1.11-rc1 1.10 → 1.11-rc3
Trixi.rhs! [ms] 6.039 7.031 6.997 16% 16%
Trixi.get_contravariant_vector [ns] 3.940 7.147 7.147 81% 81%
sgaure_get_contravariant_vector [ns] 5.430 5.770 5.310 6% -2%
Trixi.get_node_vars [ns] 3.420 3.320 3.320 -3% -3%
sgaure_get_node_vars [ns] 3.010 2.550 2.780 -15% -8%

As written above, the overall times have, unfortunately, not improved at all. However, the sgaure_get_contravariant_vector implementation is now faster than before (now also faster than v1.10), which is great. Unfortunately, the sgaure_get_node_vars has a small regression compared to v1.11-rc1 (though still faster than v1.10.4)

At the moment it still looks like we should go with the implementations of @sgaure, but an overall performance regression of 16% is still very sad.

@gbaraldi @oscardssmith Did you manage to fix the known LLVM-related (?) performance regression we talked about at JuliaCon during the subsequent hackathon? And if yes, should/would this have already been reflected in the numbers for rc3 above?

@gbaraldi
Copy link
Member

I haven't yet sorry. The LLVM thing is more involved than I had imagined earlier and basically requires writing a full pass.

@sloede
Copy link
Contributor

sloede commented Aug 27, 2024

Thanks for keeping at it! I'll keep my fingers crossed 🤞

@sloede
Copy link
Contributor

sloede commented Sep 28, 2024

Is there a chance that rc4 might have fixed some of the performance regressions reported here?

@svchb
Copy link

svchb commented Sep 28, 2024

@sloede We have tested it with RC4 its still about 20% slower than 1.10.5. RC2 though was almost 30% slower than 1.10.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster regression Regression in behavior compared to a previous version regression 1.11 Regression in the 1.11 release
Projects
None yet
Development

No branches or pull requests