-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LclVars and throughput #8715
Comments
cc @AndyAyersMS @CarolEidt @dotnet/jit-contrib |
Thanks for writing this up! Overall, I agree with your thoughts and direction.
|
I actually recently experimented with reusing the array and index LclVars created during FWIW, the more I work with the LclVar code, the less convinced I am that it is going to be worth the trouble to handle block-local LclVars as "tracked" LclVars that do not receive a name in the usual namespace. Though it's probably worth doing some experimentation to be sure, I think that we may be better off exploring ways to simply increase the tracked LclVar limit.
Yes. With the ref-count phase experiment referenced above, the only parts of the JIT that need to maintain ref counts are those that run between the ref count phase and liveness and those that run after the final liveness pass. This amounts to:
|
I think I asked Brian this before, and forgot what he said, but it's not clear to me what the obligations are in the current jit when a tree is deemed unnecessary and is disposed. Certainly there are places that carefully go and decrement ref counts, but I'm sure this step gets missed in places and sometimes it is only conditionally needed. In past compilers I've worked on there was a checkable discipline about disposing IR, basically by requiring some kind of dispose action that maintained invariants, kept track of the number of nodes disposed and modified the disposed IR in ways that made asserts or faults likely if they were somehow still accessed. Then at phase boundaries the "live IR" could be walked and some basic math (live-at-start + new - disposed) == currently-live? would tell you if some IR had not been cleaned up properly or leaked. (This was extendable in various interesting ways, eg the number of volatile ops in the live IR should generally be an invariant, with a few exceptions, so there can be special dispose/create disciplines around things like that....) If disposing is conceptually a no-op then this at least serves as a check that IR is not leaking; if disposing carries some obligations then this is a way to ensure that those are being properly addressed and that incrementally maintained information is likely more accurate. In our case it appears ref counts are sparsely read, so on demand makes more sense. If ref count maintenance is the only obligation when disposing a tree, then removing that and having no obligations on dispose would be a nice improvement. Whether we want to go to the trouble of having a placeholder dispose operation that operates in checked modes to give us leak detection and/or dead tree marking or other checking is then a separable question. |
Yes, that's why this is an outstanding issue (and not yet already done). Once I've eliminated the separate That still, however, leaves the remaining issue of splitting intervals for lclVars that are live across blocks, which will still have value. |
Aside from ref count maintenance, we may want to correct side-effect flags for ancestor nodes when removing an HIR tree, but that is the only other obligation of which I am aware. Currently the ref-count maintenance and side-effect propagation are handled in separate walks if either is performed (which is not always the case). LIR is only obligated to perform ref-count maintenance, and attempts to encapsulate this in
Another data point to consider here is that before trying the recount phase approach, I first wrote a ref count checker. I quickly abandoned this approach after it became clear that due to the open-coded nature of the IR modifications performed by the JIT, it was going to be a rather arduous task to locate all of the various sites that need to dispose trees. IMO an approach that allows us to have no obligations upon disposal is likely to be best for both throughput (no need for extra IR walks) and maintainability (no question about what to do when throwing away a tree).
Do you also plan on building |
Yes, in fact what I was doing when I experimented with this previously is that for tree temps no intervals are built at all. For the not-live-across-blocks lclVars, one would presumably reset its preferencing info when starting a new block. |
The LclVar counts are there mainly for the Legacy JIT backend. In the Legacy JIT they are used to sort the LclVars and decided which locals are to be tracked and which ones are untracked. Also the coloring register allocator allocates the LclVars with the highest weight first. There is an assert that fires if you ever decrement a count below zero as that is obviously indicates that some kind of an error in ref count occurred. But there nothing that prevent over counting and in fact we actually use over counting to boost the internal short lifetime temps so that they are more likely to be tracked and enregistered. Since we have these ref counts I have suggested that they be used by LSRA when a heuristic needs to make a choice between two or more LclVars. |
Wouldn't it be possible to reduce the number of lclvars used by GT_INDEX's expansion by making GT_ARR_BOUNDS_CHECK return the index it checks? This way the original index would be used only once and no new lclvar would be needed for it. This would also slightly simplify the IR by avoiding the need for a GT_COMMA node. Granted, it's not a trivial change... |
Consider for 2.1. |
Data from #7281 which has partially corrected ref counts indicates that there's also a CS win here, mainly from reducing the local offsets. It should also give us a somewhat saner basis for using counts to help direct register allocation, since the counts today are somewhat inflated. |
Would love to get to this but don't see it wrapping up within the 2.1 timeframe as it is a fairly disruptive change (lots of codegen impact). So pushing it back to Future. |
I dusted off Pat's changes and squashed / rebased to a recent master commit. Updated bits here: master...AndyAyersMS:LclVarRefCountPhase2 Doesn't quite work out of the box as we try decrementing some ref count below zero. Likely some new count maintenance has popped into the code. Locally I've commented out that assert. But logically we should now never need to decrement ref counts. Looking at the changes. I think perhaps more is needed. What we end up with (in optimized builds) has some odd characteristics, as the ref counts and weighted counts are still looked at before they're "correctly" computed, in a number of places (and arguably these early counts are now even less correct than before):
So if the goal is to not create, consume, or maintain counts early on in the jit, we have to look at all these cases and find alternatives. The post computation count logic that looks at counts and weights is also sensitive and we see unexpected diffs. Typically the old counts and weights were somewhat inflated and the new counts and weights are lower. For example the RA is sensitive to the weights it sees for vars that represent register parameters. In the old code these vars appear to be specially handled by incrementing counts/weights in key places (or perhaps: there are implicit appearances of these vars at entry and at some calls that the current ref count recomputation doesn't account for...). So now they often have less weight and get spilled immediately on entry when it appears they did not need or deserve spilling. X64 jit diffs on a checked corelib shows:
So there is a fairly long tail of regressions to sort through. I haven't yet tried to verify the TP numbers or look at correctness or minopts impact. Will post more when I have it. |
I'm filing this issue to capture some of the various issues that have popped up recently as I've been experimenting with LclVars and their relationship to JIT throughput. This issue is going to assume basic familiarity with the RyuJIT IR and its terminology as described in the overview. Apologies if I ramble a bit :)
Basics
The fundamental characteristics that distinguish a LclVar is that from an SDSU temp are that it may be multiply-defined, multiply-used, and live-in or live-out of a block. The liveness of a lclVar is particularly important: the JIT uses the liveness information for blocks and LclVars in order to perform SSA numbering, register allocation, precise GC reporting, as well as a handful of miscellaneous smaller optimizations. Accurate liveness information comes at a cost, however: in order to calculate the live-in and live-out sets for each block, the JIT must run live variable analysis on the function being compiled.
Tracked and Untracked LclVars
In order for the JIT to calculate liveness information for a LclVar, it must be able to correctly determine the points at which it is defined and used. There are a number of limitations that prevent the JIT from doing so for certain LclVars, the most common the most common of which is that the LclVar is address-exposed (with better alias analysis, this limitation would be surmountable). I will refer to the set of LclVars for which the JIT is capable of determining def and use information as "trackable LclVars".
The most direct costs of a tracked LclVar are time spent in the liveness calculation and space occupied in the dense bitsets used to represent the set of live tracked LclVars at various points (most importantly at block entries and exists). In order to manage both these direct costs as well as the indirect costs of the phases that consume liveness information, the JIT limits the number of trackable LclVars that are actually tracked. This limit is currently set at 512, which was found to be the highest achievable point that did not adversely affect throughput. When the JIT is not optimizing, it goes even further: rather than calculating accurate liveness information for tracked LclVars, it simply assumes that they are live at all points.
The JIT decides which of the trackable LclVars to track by sorting the LclVar table roughly by weighted appearance count (see the comparison functions for details) and then marking any LclVars after the 512th as untracked. The sorted LclVar table is also used to guide the dataflow CSE pass's register pressure heuristics as well to determine the order in which parameters are considered for enregistration by the register allocator.
General LclVar Costs
The description above briefly mentions two of the primary throughput costs of a LclVar, whether tracked or untracked:
LclVarDsc::lvRefCnt
andLclVarDsc::lvRefCntWtd
, respectively) must be calculated and maintainedThe former cost is paid first by
lvaMarkLocalVars
. This phase walks the function and bumps a LclVar's ref counts at each appearance. Once this phase has run, each successive phase must in principle ensure that it maintains accurate LclVar ref counts. In the general case, any time a tree is removed/added from/to the function, it must be walked and LclVar reference counts adjusted accordingly. In practice, it is not always clear when this is necessary, so ref counts are often rather inaccurate. These inaccuracies can adversely affect the quality of the generated code, especially in cases where there are more then 512 LclVars, as LclVars that are important to track may be pushed out of the tracked set. Furthermore, as ref counts are used in the register allocator's spill heuristic, inaccurate counts may cause less-than-ideal allocation.The latter cost is paid first by
lvaMarkLocalVars
and then again any time the JIT deems it important to recalculate the set of tracked LclVars.Decreasing the Throughput Cost of LclVars
Given the above, here are some ideas that seem worth exploring.
category:throughput
theme:jit-coding-style
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: