Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge compile-time regression in beta/nightly #91128

Closed
bnjbvr opened this issue Nov 22, 2021 · 13 comments · Fixed by #92110
Closed

Huge compile-time regression in beta/nightly #91128

bnjbvr opened this issue Nov 22, 2021 · 13 comments · Fixed by #92110
Assignees
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-compiletime Issue: Problems and improvements with respect to compile times. P-critical Critical priority perf-regression Performance regression. regression-from-stable-to-beta Performance or correctness regression from stable to beta. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@bnjbvr
Copy link
Contributor

bnjbvr commented Nov 22, 2021

We have a serious Rust compile-time regression happening on an internal (unfortunately closed-source) crate, happening on both the beta and nightly channels.

Before, compiling the crate would take around 1 minute 30 seconds in automation. Now if I'm using the nightly channel, it seems to take around an hour or so.

Running with RUSTFLAGS="-Z new-llvm-pass-manager=no" makes the regression disappear on the nightly channel, so the new pass manager seems to be at fault here.

I've tried compiling with a rustc build from #89830, but it didn't make the regression disappear, so it might not be the same inlining issue.

Thanks a bunch to @lqd who suggested disabling the pass manager, and trying the above patch. Pinging LLVM people: @nikic @Mark-Simulacrum. Of course I'm happy to run any other diagnostic and try any tool that could help figure out what's going on here.

@jyn514 jyn514 added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. perf-regression Performance regression. regression-from-stable-to-beta Performance or correctness regression from stable to beta. labels Nov 22, 2021
@rustbot rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Nov 22, 2021
@jyn514 jyn514 added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Nov 22, 2021
@apiraino
Copy link
Contributor

apiraino commented Nov 22, 2021

Assigning priority as discussed in the Zulip thread of the Prioritization Working Group.

We assign a P-critcal to raise the flag for the team. Further investigation will probably need some way to reproduce

@rustbot label -I-prioritize +P-critical

@rustbot rustbot added P-critical Critical priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Nov 22, 2021
@bnjbvr
Copy link
Contributor Author

bnjbvr commented Nov 23, 2021

I've managed to reduce our use case to a small file that reproduces the issue, all the code is accessible in the following repository https://github.com/EmbarkStudios/rustc-compile-time-regress:

  • on stable, it takes around 26 seconds to compile on my machine with --release
  • on beta and nightly, it takes around 2 minutes to compile with --release

There are multiple variants of the Shim trait in our code base, in some cases with many more functions in it that make use of the same patterns as those shown in this example. I am almost certain that it's the compilation of all these traits and impls that cause the very long compile times somehow, hence the reduced test case.

Note: I've also tried again the build from #89830, and the compile time is still the same with it, so it's definitely not the same issue.

@nikic
Copy link
Contributor

nikic commented Nov 24, 2021

nagisa added a commit to nagisa/rust that referenced this issue Nov 24, 2021
@nikic
Copy link
Contributor

nikic commented Nov 24, 2021

Top passes time -time-passes:

  423.0093 ( 33.2%)   0.5118 ( 33.9%)  423.5211 ( 33.2%)  423.8182 ( 33.2%)  ModuleInlinerWrapperPass
  423.0046 ( 33.2%)   0.5112 ( 33.9%)  423.5159 ( 33.2%)  423.8129 ( 33.2%)  DevirtSCCRepeatedPass
  370.2776 ( 29.0%)   0.2598 ( 17.2%)  370.5374 ( 29.0%)  370.8072 ( 29.0%)  CorrelatedValuePropagationPass
  23.3253 (  1.8%)   0.0002 (  0.0%)  23.3255 (  1.8%)  23.3373 (  1.8%)  GVNPass
   6.0481 (  0.5%)   0.1486 (  9.9%)   6.1967 (  0.5%)   6.1992 (  0.5%)  InlinerPass
   6.1613 (  0.5%)   0.0029 (  0.2%)   6.1642 (  0.5%)   6.1672 (  0.5%)  InstCombinePass
   4.7685 (  0.4%)   0.0080 (  0.5%)   4.7765 (  0.4%)   4.7780 (  0.4%)  SROAPass
   3.0350 (  0.2%)   0.0041 (  0.3%)   3.0391 (  0.2%)   3.0403 (  0.2%)  ADCEPass

Most time is spent in CVP. Though the real problem is that NewPM produces a huge function, with "_ZN102_$LT$core..iter..adapters..map..Map$LT$I$C$F$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$9size_hint17haa64424fbc292ec1E" having 750k or so instructions.

Most likely it's another case of catastrophic inlining, though not the same as the other one (as the mentioned patch indeed doesn't fix this case).

@kdy1
Copy link
Contributor

kdy1 commented Nov 29, 2021

This also affects swc.
There's an issue for this. (swc-project/swc#2872)

With workaround:

===-------------------------------------------------------------------------===
  Total Execution Time: 14.6342 seconds (14.6144 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  ---Instr---  --- Name ---
   2.5938 ( 18.8%)   0.0213 (  2.5%)   2.6151 ( 17.9%)   2.6148 ( 17.9%)  17898317626  Function Integration/Inlining
   1.4954 ( 10.8%)   0.0053 (  0.6%)   1.5008 ( 10.3%)   1.5005 ( 10.3%)  17085737226  MemCpy Optimization
   0.8417 (  6.1%)   0.0065 (  0.8%)   0.8482 (  5.8%)   0.8477 (  5.8%)  6363534522  SROA #2
   0.6932 (  5.0%)   0.0054 (  0.6%)   0.6986 (  4.8%)   0.6982 (  4.8%)  4464950446  Combine redundant instructions #2
   0.5760 (  4.2%)   0.0057 (  0.7%)   0.5817 (  4.0%)   0.5811 (  4.0%)  4107118695  Global Value Numbering
   0.5173 (  3.7%)   0.0046 (  0.6%)   0.5219 (  3.6%)   0.5214 (  3.6%)  3496755636  Combine redundant instructions #4
   0.4593 (  3.3%)   0.0047 (  0.6%)   0.4640 (  3.2%)   0.4633 (  3.2%)  2996247269  Combine redundant instructions #3
   0.4480 (  3.2%)   0.0048 (  0.6%)   0.4528 (  3.1%)   0.4523 (  3.1%)  2818690217  Combine redundant instructions #5
   0.4087 (  3.0%)   0.0047 (  0.6%)   0.4134 (  2.8%)   0.4131 (  2.8%)  3581923261  Dead Store Elimination
   0.2232 (  1.6%)   0.0053 (  0.6%)   0.2284 (  1.6%)   0.2280 (  1.6%)  1827801629  Combine redundant instructions
   0.2236 (  1.6%)   0.0046 (  0.5%)   0.2281 (  1.6%)   0.2278 (  1.6%)  1707889189  Memory SSA
   0.1726 (  1.3%)   0.0530 (  6.3%)   0.2256 (  1.5%)   0.2256 (  1.5%)  864823098  Interprocedural Sparse Conditional Constant Propagation
   0.2089 (  1.5%)   0.0048 (  0.6%)   0.2137 (  1.5%)   0.2132 (  1.5%)  1552468131  Early CSE w/ MemorySSA
   0.0974 (  0.7%)   0.1058 ( 12.6%)   0.2033 (  1.4%)   0.2033 (  1.4%)  1119829288  Global Variable Optimizer
   0.1913 (  1.4%)   0.0047 (  0.6%)   0.1960 (  1.3%)   0.1956 (  1.3%)  1338609401  Jump Threading

Without workaround:

  10.0674 ( 30.6%)   2.4567 ( 30.9%)  12.5241 ( 30.6%)  12.5284 ( 30.6%)  80168985859  ModuleInlinerWrapperPass
   9.9170 ( 30.1%)   2.3687 ( 29.7%)  12.2857 ( 30.0%)  12.2897 ( 30.1%)  78943942431  DevirtSCCRepeatedPass
   2.7391 (  8.3%)   0.5879 (  7.4%)   3.3269 (  8.1%)   3.3273 (  8.1%)  19748806913  InlinerPass
   2.0915 (  6.3%)   0.3909 (  4.9%)   2.4824 (  6.1%)   2.4811 (  6.1%)  15760601093  InstCombinePass
   0.6599 (  2.0%)   0.4267 (  5.4%)   1.0867 (  2.7%)   1.0869 (  2.7%)  6916625691  BlockFrequencyAnalysis
   0.8143 (  2.5%)   0.0207 (  0.3%)   0.8350 (  2.0%)   0.8351 (  2.0%)  8673702465  MemCpyOptPass
   0.5102 (  1.5%)   0.3107 (  3.9%)   0.8208 (  2.0%)   0.8205 (  2.0%)  5318224460  BranchProbabilityAnalysis
   0.6374 (  1.9%)   0.0592 (  0.7%)   0.6966 (  1.7%)   0.6956 (  1.7%)  5198860878  SROA
   0.5529 (  1.7%)   0.0574 (  0.7%)   0.6103 (  1.5%)   0.6099 (  1.5%)  4223120654  GVN
   0.4074 (  1.2%)   0.0916 (  1.1%)   0.4990 (  1.2%)   0.4980 (  1.2%)  3360041371  EarlyCSEPass
   0.1873 (  0.6%)   0.2731 (  3.4%)   0.4605 (  1.1%)   0.4604 (  1.1%)  3700728437  AAManager

kdy1 added a commit to kdy1/swc that referenced this issue Nov 29, 2021
@sanxiyn sanxiyn added the I-compiletime Issue: Problems and improvements with respect to compile times. label Nov 30, 2021
@pnkfelix
Copy link
Member

pnkfelix commented Dec 2, 2021

As a workaround, we have disabled NewPM by default on 1.57 release.

Can someone confirm if this compile-time regression is indeed resolved on 1.57?

@bnjbvr
Copy link
Contributor Author

bnjbvr commented Dec 2, 2021

Fwiw, I tested with the pre-release and the problem had disappeared. Don't know if much has changed between the pre-release and the release, though. Sorry, forgot to confirm here when the pre-release happened.

(Hi Felix!)

@apiraino
Copy link
Contributor

apiraino commented Dec 2, 2021

Discussed in T-compiler meeting on Zulip, linking this to the progress of #91190

@nikic
Copy link
Contributor

nikic commented Dec 8, 2021

A slightly cleaned up IR reproducer (still large): https://gist.github.com/nikic/d66abc8901a21594d0798d169eb9d725

I've investigated this in a bit more detail. First, a note on the general design of the inliner, which is is not very robust. LLVM uses a bottom-up inliner where inlining starts at leaf functions and a local inlining decision for each call is made. After the leafs have been inlined the function will grow, and if it is gets large enough, then it will not get inlined into its callee due to cost modelling. This is what generally prevents runaway inlining. However, it does require that the callee is already "maximally inlined" -- if we have X calls Y calls Z and decided not to inline Z into Y, but then after inlining Y into X decide that we now also want to inline Z into X, we can get into these exponential inlining situations. An obvious case where the callee is not maximally inlined is with SCCs (and presumably the reason why it attempts to inline already inlined callees at all), where there is no "bottom" to start from, The inliner solves this by a combination of inlining history and by doing a round of inlining over all functions in an SCC before reprocessing new inlined calls, so that function size increases in a balanced way across the SCC.

In this particular case, we run into a different manifestation of this issue: Z is not inlined into Y because the call to Z is considered a cold callsite based on BFI information. But after inlining into X the callsite of Z becomes non-cold. This obviously doesn't make sense and is the result of precision loss. I'm not completely clear on why things degenerate as much as they do though.

Fixing the BFI issue (possibly through some rescaling during inlining) should address this particular issue, but it's rather concerning how fragile the inliner is.

@nikic
Copy link
Contributor

nikic commented Dec 9, 2021

Okay, here's the part that I was missing before (the BFI issue compounds on this, but it's really the root problem): The inlining advisor has a concept of "deferred inlining", where it decides not to inline a call into the caller, because it determines that it would be more profitable to inline the caller into its callers instead. Specifically, this can happen if the caller is an internal function, which means that there is a very large cost bonus if it can be removed entirely. So if we can show that if we don't inline into this caller it's going to be inlined into its callers and then dropped entirely, we may defer.

Now, this explicitly goes against the bottom-up inlining approach and causes predictable issues: If you have a tree of calls for which inlining gets deferred, then on reaching the top-level function (where inlining is no longer deferred), we can end up inlining the whole tree, because each individual inlining decision will now look profitable.

For example, if we run https://gist.github.com/nikic/ec18c736e69d29614ff268e2977fc491 through opt -inline we'll end up flattening the whole call tree and inline some 2000 calls to @foo().

@pnkfelix
Copy link
Member

pnkfelix commented Dec 9, 2021

@rustbot assign @nikic

@nikic
Copy link
Contributor

nikic commented Dec 9, 2021

I ran an experiment to disable deferred inlining entirely (#91703) with results at https://perf.rust-lang.org/compare.html?start=3b263ceb5cb89b6d53b5a03b47ec447c3a7f7765&end=07378cd9e76010238f64ea03d1219774eb60510d. The result looks promising as far as rustc is concerned, in that compile-times drop, and there is not much impact on check/incr-unchanged builds (our proxy for run-time). The worst regression there is "token-stream-stress check" at 1.3% and otherwise it's mostly mildly positive.

Unfortunately it doesn't seem like this can be disabled without patching LLVM. At this point I'm pretty convinced that deferred inlining is broken on a fundamental level and should be removed entirely, and any regressions arising from that addressed in some other way. However, I suspect that upstream is going to disagree with that :)

For an extra fun test case, https://gist.github.com/nikic/1262b5f7d27278e1b34a190ae10947f5 is less than 100 lines of IR, and produces about 500000 lines of IR when run through opt -inline :)

@nikic
Copy link
Contributor

nikic commented Dec 14, 2021

Upstream review is https://reviews.llvm.org/D115497.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-compiletime Issue: Problems and improvements with respect to compile times. P-critical Critical priority perf-regression Performance regression. regression-from-stable-to-beta Performance or correctness regression from stable to beta. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants