-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rustc_codegen_ssa: tune codegen scheduling to reduce memory usage #81736
rustc_codegen_ssa: tune codegen scheduling to reduce memory usage #81736
Conversation
For better throughput during parallel processing by LLVM, we used to sort CGUs largest to smallest. This would lead to better thread utilization by, for example, preventing a large CGU from being processed last and having only one LLVM thread working while the rest remained idle. However, this strategy would lead to high memory usage, as it meant the LLVM-IR for all of the largest CGUs would be resident in memory at once. Instead, we can compromise by ordering CGUs such that the largest and smallest are first, second largest and smallest are next, etc. If there are large size variations, this can reduce memory usage significantly.
r? @davidtwco (rust-highfive has picked a reviewer for you, use r? to override) |
@rustbot label T-compiler I-compilemem |
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit 29711d8 with merge 0d4c73fdf92ce3daf07991bde0444eb0b5d8ae9b... |
☀️ Try build successful - checks-actions |
Queued 0d4c73fdf92ce3daf07991bde0444eb0b5d8ae9b with parent e708cbd, future comparison URL. |
Finished benchmarking try commit (0d4c73fdf92ce3daf07991bde0444eb0b5d8ae9b): comparison url. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up. @bors rollup=never |
This change completely depends on the CGU size distribution, so not seeing an improvement across the board is expected. Also, ignore keccak-debug and keccak-opt RSS stats. They vary +/- 15% from run to run. Results are more consistent for non-incremental full builds. This change benefits crates with large variations in CGU sizes, and my guess is that full builds produce larger variations. It seems reasonable that they would, because they do more CGU merging than incremental builds. Supposing all CGUs are roughly the same size to start, then unless the number of CGUs is just right, merging causes us to have one set of CGUs of size N and another of size 2N. But even if they don't start out the same size, the merging process can get us to a point where the CGUs do end up being roughly the same size. And we end up in the same place from there. On my system, this change reduces peak memory usage while compiling rustc_middle by a whopping 500MB, for both incremental and non-incremental. That crate has some very outsized CGUs, so the change is particularly helpful there. |
@bors r+ Pretty nice improvements! I don't see any egregious comptime regressions, in particular its reassuring that the bootstrap wall time is -0.1% overall. |
📌 Commit 29711d8 has been approved by |
🌲 The tree is currently closed for pull requests below priority 1000. This pull request will be tested once the tree is reopened. |
Thanks @nagisa! One thing I didn't address with this change is the fact that as we add compiled CGUs to the optimization queue, LLVM threads pick the larger CGUs to work on first. I suspect this doesn't affect memory usage much, because even if we processed them in codegen order, the LLVM threads would finish with the small CGUs quickly, and we'd end up with the largest CGUs being processed concurrently anyway. But it's something to experiment with. Edit: I think there's more room for improvement than I initially thought. After all, this change doesn't directly address the issue mentioned above (chewing through small CGUs quickly so that we still end up with the largest CGUs in memory). It can help by delaying introduction of large CGUs for a bit, which sometimes means that a large CGU being processed gets finished with and dropped before another large CGU is codegen'd. But I'm sure we can do better than that in many cases. |
This is fascinating. I did some work years ago on optimal scheduling and bin packing is an easy and often used solution. I'm intrigued by the premise of this PR. Consequently, I wrote a little program to explore various schedule schemes to see how well they might work while also evaluating memory pressure. But, to match my model to this work I need to know some data.
|
@SunnyWar, thanks for the bin-packing reference. I assumed there had to be some well-studied formalization of the problem or related problems. My hope was that the simple approach in this PR would improve the situation until I or someone could put the time into a better approach.
The number of CPUs or hyperthreads on the system. No consideration of the amount of system memory is made, so as you can imagine, this has potential to cause problems on high CPU count systems without a lot of memory.
We have two size estimates that we base scheduling decisions on. The first is the number of statements in the MIR of the CGU. The second is the time it takes to codegen the MIR into the initial, unoptimized LLVM-IR. We have the first estimate for all CGUs before we start scheduling. The second estimate is only available as we go along codegening CGUs to LLVM-IR. I don't know what's typical. It can vary wildly from crate to crate and depend on compilation mode.
Depends on the mode of compilation: typically 16 for non-incremental builds and up to 256 for incremental builds. By default, that means 16 for release builds and up to 256 for debug builds. Also, things are slightly more complex than this, as there are different kinds of work items that can be in the queue, there are some different phases in the process, etc. I'm just getting familiar with the area TBH. |
☀️ Test successful - checks-actions |
@tgnottingham I did a quick check assuming only 8 threads with 256 tasks. The "size" of the task was evenly distributed 1-256. The model also assumes the runtime of a task is directly proportional to its size. |
@SunnyWar, awesome! If you'd like to make your simulation more accurate, feel free to ask more questions, or you can go straight to the source in One thing I neglected to mention is that we actually have more codegen'd CGUs in memory than there are LLVM threads running, so that when an LLVM thread finishes optimizing one CGU, there's already another codegen'd CGU ready for it to work on in the queue. Long story short, we basically ramp up to keeping By the way, I have a change up for review (#81538) that will show the CGU cost estimates in
If that lands, it will be easy for you to see real cost estimate distributions by compiling crates with the nightly compiler ( |
By the way, here's a contrived example that shows how profitable work in this area could be. Suppose we have 2 CPUs, 2 jobs of size 8N, 8 jobs of size N, and we ignore a ton of details. We used to schedule the work something like this:
This utilizes CPUs ideally, and so minimizes runtime (supposing the high memory usage doesn't have detrimental effects). But it maximizes memory usage by working on the largest CGUs concurrently. We'll call the peak memory cost for this schedule 16N. We could have utilized CPUs ideally, while minimizing memory usage, with this schedule:
The peak memory cost here is 9N, a 44% reduction in memory usage versus the first approach. Of course, we can't always minimize memory usage and maximize throughput. Any solution to this scheduling problem will need to make good tradeoffs. E.g. if we can reduce memory usage by 30% at a 1% cost to runtime, it's probably the right tradeoff. This is especially true when it enables us to parallelize more (both within rustc and across multiple rustcs spawned by cargo) without risking swapping or thrashing, because then we get a runtime improvement too. |
For better throughput during parallel processing by LLVM, we used to sort
CGUs largest to smallest. This would lead to better thread utilization
by, for example, preventing a large CGU from being processed last and
having only one LLVM thread working while the rest remained idle.
However, this strategy would lead to high memory usage, as it meant the
LLVM-IR for all of the largest CGUs would be resident in memory at once.
Instead, we can compromise by ordering CGUs such that the largest and
smallest are first, second largest and smallest are next, etc. If there
are large size variations, this can reduce memory usage significantly.