-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression up to 20% when updating from Julia v1.10.4 to v1.11.0-rc1 #55009
Comments
Would it be possible to profile this to identify the point of regression? |
The difference looks to be some getindex calls that now appear in the flamegraph. |
Yes! For 1.11 we have |
Would be good to get a MWE that gets a bit closer to those suspected functions. |
Yeah, it looks indeed like the calls to For reference: this code is called from
|
So there are two commits that might deserve testing here. 9aa7980 as reported in #53158 (comment) and #51720 |
I do think we might need to actually do the initialized memory zeroing so we can revert that change. it has caused a lot of regressions. |
After running the code posted by @benegee above, you can directly benchmark @unpack contravariant_vectors = semi.cache.elements
ii = j = k = element = 1
@benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element) On our Rocinante (AMD Ryzen), this gives me the following: v1.10.4julia> @benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 3.930 ns … 32.081 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.050 ns ┊ GC (median): 0.00%
Time (mean ± σ): 4.061 ns ± 0.309 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▂
▂▂▁▂▂▁▁▁▁▃█▄█▇▃▅▅▂▂▂▂▁▂▁▁▂▁▁▂▁▁▁▁▂▂▂▂▂▁▁▁▂▂▁▂▂▁▂▂▂▂▁▂▂▂▂▁▂ ▂
3.93 ns Histogram: frequency by time 4.47 ns <
Memory estimate: 0 bytes, allocs estimate: 0. v1.11.0-rc1julia> @benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 7.147 ns … 15.917 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.167 ns ┊ GC (median): 0.00%
Time (mean ± σ): 7.186 ns ± 0.183 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▃
▂██▂▂▅▃▆▂▂▂▁▁▁▂▂▁▂▂▁▁▁▂▁▁▁▁▁▂▁▂▁▁▁▁▁▁▂▁▂▁▂▁▂▁▁▂▂▁▂▂▁▁▁▁▂▁▂ ▂
7.15 ns Histogram: frequency by time 7.64 ns <
Memory estimate: 0 bytes, allocs estimate: 0. |
@gbaraldi @oscardssmith thanks for your input! It would be great if you could continue to help with the analysis of the root cause and to understand what needs to be fixed on the Julia side, as we do not have the expertise for something this close to the metal. I know everyone has a full plate, but with this already being v1.11 RC1, on our side multiple research groups are slowly getting into panic mode 😅. Losing the performance boost of the now-defunct |
Julia 1.10 will likely become an LTS release once 1.11 has been released. We will of course see if we can fix this in a patch release, but performance regression are not release blockers. |
I've bisected this to 909bcea, the big
|
Seems mostly like a typo there too, since in the original code |
@sgaure Thanks a lot for taking a look. Your implementation does indeed recover some of the performance in v1.11.0-rc1. Interestingly, with it the v1.10.4 implementation is as fast as the v1.11.0-rc1 implementation, i.e., while it does make v1.11.0-rc1 faster, it actually slows down execution on v1.10.4. On our Rocinante (AMD Ryzen Threadripper): v1.10.4julia> @benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 3.930 ns … 20.820 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.030 ns ┊ GC (median): 0.00%
Time (mean ± σ): 4.054 ns ± 0.217 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▆ ▅ ▃ ▅ ▂ ▁ ▁ ▁
▃▁▅▁▆▁▃▁▁▁▁▁▁▁▁▁▁▁▁█▄▁█▁█▁█▁█▇▁█▁█▁█▁▄▁▄▁▁▇▁▃▁▄▁▃▁▁▄▁▄▁█▁█ █
3.93 ns Histogram: log(frequency) by time 4.19 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark sgaure_get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 5.780 ns … 16.460 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.780 ns ┊ GC (median): 0.00%
Time (mean ± σ): 5.795 ns ± 0.183 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█
█▄▁▁▁▁▁▁▁▁▁▁▁▁█▃▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
5.78 ns Histogram: frequency by time 5.82 ns <
Memory estimate: 0 bytes, allocs estimate: 0. v1.11.0-rc1julia> @benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 7.147 ns … 17.998 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.157 ns ┊ GC (median): 0.00%
Time (mean ± σ): 7.172 ns ± 0.195 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄█ ▁
██▁█▁▃▁▃▁▁▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▁▁▃▁▁▁▃▁▃▁▁▄▁▄▁▃▁▄▁▄▁▁▅ █
7.15 ns Histogram: log(frequency) by time 7.47 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark sgaure_get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 5.770 ns … 12.080 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.800 ns ┊ GC (median): 0.00%
Time (mean ± σ): 5.819 ns ± 0.158 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ █ ▁
▃█▁▄█▁▇▁▄█▁▅▂▂▁▁▂▂▁▁▁▁▂▁▂▂▁▂▁▂▁▁▁▂▁▁▁▂▁▁▁▂▁▂▁▂▁▁▂▁▁▂▁▂▂▁▁▂ ▂
5.77 ns Histogram: frequency by time 6.13 ns <
Memory estimate: 0 bytes, allocs estimate: 0. It thus seems that v1.10.4 is still able to optimize more of the code away in the original implementation, and v1.11.0-rc1 is not able to fully catch up. It's a start, but it does explain why v1.11.0-rc1 is not able to fully optimize the code yet, and it would be nice to also get the missing ~50% of performance back. Furthermore, it would be nice to also understand why |
Another interesting observation: @sgaure I took your implementation for @inline function sgaure_get_node_vars(u, equations, solver::DG, indices...)
uv = @view(u[:, indices...])
SVector(ntuple(idx->uv[idx], Val(nvariables(equations))))
end with u = Trixi.wrap_array(u_ode, semi)
ii = j = k = element = 1
@benchmark Trixi.get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)
@benchmark sgaure_get_node_vars($u, $equations, $solver, $ii, $j, $k, $element) it turns out that v1.10.4 and v1.11.0-rc1 are again equally fast and, to my surprise, both are faster than our original implementation on v1.10.4. On our Rocinante (AMD Ryzen Threadripper): v1.10.4julia> @benchmark Trixi.get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 3.930 ns … 14.730 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.930 ns ┊ GC (median): 0.00%
Time (mean ± σ): 3.943 ns ± 0.188 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█
█▁▁▁▁▁▆▂▁▁▁▁▂▂▁▁▁▁▁▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▂ ▂
3.93 ns Histogram: frequency by time 4.02 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark sgaure_get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 3.240 ns … 22.801 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.240 ns ┊ GC (median): 0.00%
Time (mean ± σ): 3.251 ns ± 0.220 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█
█▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
3.24 ns Histogram: frequency by time 3.26 ns <
Memory estimate: 0 bytes, allocs estimate: 0. v1.11.0-rc1julia> @benchmark Trixi.get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 3.930 ns … 48.041 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.930 ns ┊ GC (median): 0.00%
Time (mean ± σ): 3.946 ns ± 0.457 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▂
█▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▃ ▂
3.93 ns Histogram: frequency by time 3.94 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark sgaure_get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 3.240 ns … 21.401 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.240 ns ┊ GC (median): 0.00%
Time (mean ± σ): 3.252 ns ± 0.214 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█
█▁▄▁▂▁▂▁▁▁▁▁▁▂▁▂▁▂▁▁▂▁▂▁▂▁▂▁▁▁▁▂▁▁▁▂▁▁▁▁▁▁▂▁▂▁▂▁▁▂▁▂▁▂▁▂▁▂ ▂
3.24 ns Histogram: frequency by time 3.5 ns <
Memory estimate: 0 bytes, allocs estimate: 0. Now, what's up with that? Seeing this, maybe you could take another look at |
The proper fix to this is to revive https://reviews.llvm.org/D136218 and make it correct |
Confirming @sloede 's observations: With the alternative implementation of In the end, the alternative implementation recovers some of the performance in v1.11.0-rc1, but unfortunately the overall performance for the higher-level functions |
... seeing a similar regression in VoronoiFVM.jl nonlinear stationary solver examples, with KrylovJL_GMRES & ILUZero.ilu0 preconditioner. |
@gbaraldi @oscardssmith Thanks a lot for the very informative and cordial discussions about performance at JuliaCon Did you, by any change, make progress regarding the proposed LLVM patch and if yes, is it already testable somewhere (preferably with pre-built binaries)? |
Am working on it. (it's not super trivial since it's basically a general store sinking pass). Given that, I would consider it not release blocking (So that the release gets out and we may find more issues). I believe it's a very self contained thing so we can backport it quite easily. |
With v1.11.0-rc3 being released, I reran the benchmarks above. TL;DR No overall performance improvements from v1.11.0-rc1 to v1.11.0-rc3 😢 Each result is the minimum time obtained from using Overall
As written above, the overall times have, unfortunately, not improved at all. However, the At the moment it still looks like we should go with the implementations of @sgaure, but an overall performance regression of 16% is still very sad. @gbaraldi @oscardssmith Did you manage to fix the known LLVM-related (?) performance regression we talked about at JuliaCon during the subsequent hackathon? And if yes, should/would this have already been reflected in the numbers for rc3 above? |
I haven't yet sorry. The LLVM thing is more involved than I had imagined earlier and basically requires writing a full pass. |
Thanks for keeping at it! I'll keep my fingers crossed 🤞 |
Is there a chance that rc4 might have fixed some of the performance regressions reported here? |
@sloede We have tested it with RC4 its still about 20% slower than 1.10.5. RC2 though was almost 30% slower than 1.10.4. |
We have recently started testing julia 1.11 in our projects Trixi.jl and TrixiParticles.jl. We observed performance regressions up to 20%. This has been confirmed on laptops (Apple M2, M3), a workstation (AMD Ryzen Threadripper 3990X), and the JUWELS HPC system at Jülich supercomputing centre (AMD EPYC Rome 7402 CPU).
@sloede has put together instructions and a MWE here: https://github.com/trixi-framework/performance-regression-julia-v1.10-vs-v1.11
It boils down to adding Trixi.jl and running BenchmarkTools for an isolated function in Trixi.jl.
For this code we get on our workstation a median time of 6.079 ms with julia v1.10.4 vs. 7.200 ms with v1.11.0-rc1.
@sloede @ranocha @vchuravy @efaulhaber @LasNikas
The text was updated successfully, but these errors were encountered: