-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable QJFL and OSR by default for x64 #61934
Conversation
Change these default values when the jit targets x64: * COMPlus_TC_QuickJitForLoops=1 * COMPlus_TC_OnStackReplacement=1 The upshot is that on x64 more methods will be jitted at Tier0, and we will rely on OSR to get out of long-running Tier0 methods. Other architectures continue to use the old behavior for now, as OSR is not yet supported outside of x64.
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsChange these default values when the jit targets x64:
The upshot is that on x64 more methods will be jitted at Tier0, and Other architectures continue to use the old behavior for now, as
|
cc @dotnet/jit-contrib Will be running various stress legs to try and spot areas where we still need work. |
/azp run runtime-coreclr libraries-pgo |
/azp run runtime-coreclr pgo |
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
Azure Pipelines successfully started running 1 pipeline(s). |
The pgo legs have expected failures, so will have to interpret results manually. Had planned to run jitstress too but that seems to be in bad shape right now. |
Runtime PGO failures are all "expected". But libraries PGO has some failures that will need investigating. A number of parallel tests fail across all the pgo modes.
|
Simple repro using System;
using System.Threading;
using System.Threading.Tasks;
class X
{
public static void RunSimpleParallelForeachAddTest_Array(int count, out int o_counter, out int o_expectCounter)
{
var data = new int[count];
int expectCounter = 0;
for (int i = 0; i < data.Length; i++)
{
data[i] = i;
expectCounter = unchecked(expectCounter + i);
}
int counter = 0;
// run inside of a separate task mgr to isolate impacts to other tests.
Task t = Task.Run(
delegate
{
Parallel.ForEach(data, (x) => Interlocked.Add(ref counter, x));
});
t.Wait();
o_counter = counter;
o_expectCounter = expectCounter;
}
public static void Main()
{
int counter;
int expectCounter;
RunSimpleParallelForeachAddTest_Array(100, out counter, out expectCounter);
Console.WriteLine($"got {counter} expected {expectCounter}");
}
} This is the same as the libraries test but with smaller data size. Run with a variant of OSR stress to force OSR to happen even at lower iteration counts:
The results are wrong and vary from run to run
Two methods get rejitted with OSR:
I've looked at the first one in some detail and it seems ok. So likely the issue is in the second one. |
The With that fixed, I'm running into another issue with OSR in the wider set of parallel tests) where the scratch BB is ending up as BBJ_COND. |
Found when trying to enable OSR by default. * Explicitly initalize the OSR step variable. * Prevent `fgOptimizeUncondBranchToSimpleCond` from changing the scratch entry BB to have conditional flow.
/azp run runtime-coreclr libraries-pgo |
Azure Pipelines successfully started running 1 pipeline(s). |
We recently had a conversation (on GitHub, I believe) where you pointed out that the scratch entry BB could be BBJ_COND for OSR. Are you saying now that doing so creates a bad condition and so the scratch BB should NOT ever be BBJ_COND? |
You mean this? I wasn't specific. The other option we support now (for OSR) is |
There's at least one more issue to fix -- now that OSR + PGO does instrumentation + optimization (see #61453), we may end up putting an instrumentation probe into the detached |
Repro for the case noted above. ;; with this PR
using System;
using System.Runtime.CompilerServices;
class X
{
static int s;
static int N;
public static void F(int[] a)
{
for (int j = 0; j < N; j++)
{
for (int i = 0; i < a.Length; i++)
{
s -= a[i];
}
}
}
public static void T(bool p, int[] a)
{
if (p)
{
for (int j = 0; j < N; j++)
{
for (int i = 0; i < a.Length; i++)
{
s += a[i];
}
}
F(a);
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Main()
{
int[] a = new int[1000];
N = 100;
s = 100;
a[3] = 33;
a[997] = 67;
T(true, a);
return s;
}
} results in
|
Not clear yet how to fix the case above -- trying to detect and handle this during instrumentation seems iffy. And the return block has multiple predecessors, so "moving" the probe is not quite the right fix. |
When both OSR and PGO are enabled, the jit will add PGO probes to OSR methods. And if the OSR method also has a tail call, the jit must take care to not add block probes to any return block reachable from possible tail call blocks. Instead, instrumentation should create copies of the return block probe in each return block predecessor (possibly splitting critical edges to make this viable). Because all this happens early on, there are no pred lists. The analysis leverages cheap preds instead, which means it needs to handle cases where a given pred has multiple pred list entries. And it must also be aware that the OSR method's actual flowgraph is a subgraph of the full initial graph. This came up while scouting what it would take to enable OSR by default. See dotnet#61934.
When both OSR and PGO are enabled, the jit will add PGO probes to OSR methods. And if the OSR method also has a tail call, the jit must take care to not add block probes to any return block reachable from possible tail call blocks. Instead, instrumentation should create copies of the return block probe in each return block predecessor (possibly splitting critical edges to make this viable). Because all this happens early on, there are no pred lists. The analysis leverages cheap preds instead, which means it needs to handle cases where a given pred has multiple pred list entries. And it must also be aware that the OSR method's actual flowgraph is a subgraph of the full initial graph. This came up while scouting what it would take to enable OSR by default. See #61934.
Merged to pick up #62263 and fixes for jit stress. Will kick off another round of stress testing. |
Two Pri0 tests failing with known issue #62285. |
/azp run runtime-coreclr libraries-pgo |
Azure Pipelines successfully started running 1 pipeline(s). |
Time to trigger jit stress. This also has some known failures so I expect it to fail and will have to dig through and see if there's anything novel. |
/azp run runtime-coreclr jitstress |
Azure Pipelines successfully started running 1 pipeline(s). |
Jitstress failures are the "Known" issues with create span. Libraries PGO looks good. |
/azp run runtime-coreclr pgo |
Azure Pipelines successfully started running 1 pipeline(s). |
Going to run GC stress but it has a lot of failures from #62067. So also expect it to fail and will have to sort through the results. |
/azp run runtime-coreclr gcstress0x3-gcstress0xc |
Azure Pipelines successfully started running 1 pipeline(s). |
Merge to pick up latest round of fixes. Net change is now just updating two config switches. |
Runtime failures are #62285. |
Things are looking good from correctness standpoint. Still need to look at perf and startup. |
For perf, first looking at benchmark games on windows as run in the perf repo (via BDN).
and
and
Generally speaking, for benchmark perf we'd expect to see very similar numbers. Benchmarking (via BDN) should be measuring Tier1 code, so the fact that we may also generate OSR methods shouldn't matter. So will drill into these. |
For So, two questions:
For (1) -- running with a checked jit so that we're collecting block counts and forcing everything through tier0, we see For future reference here's exactly how to do this.
That saves the PGO data to a file in text format -- and in the file we find for `Bench':
So the method was called 19 times, and the innermost loop executed (on average) 1660144/19 = 87,376 iterations per call. For (2) -- the OSR method body code is very similar to that for the optimized version. Profiling shows almost no time spent in Most of the time is in But per profiling, we see quite a bit more time in these methods in the OSR run. BDN decides to do two calls to
Note in the default that the time per iteration drops quite a bit 155 -> 100 when going from one to two calls. My guess is that this is the main factor in the perf difference because doing two calls per iteration alters how the benchmark interacts with GC (it is allocation intensive) and somehow, on average, this improves performance. My homegrown ETL analysis shows something similar. The main discrepancy is time within the runtime itself.
This decision on how many invocations per iteration can be fixed with
where note the median times for the two are now fairly close. The per iteration times for the OSR version show that the Tier1 code for
So the mean value above for OSR shows a blend of Tier0, OSR and Tier1 times. Despite all that the final iterations for OSR are consistently a bit slower (~5%) than the default iterations. Not clear why. But it seems the predominant factor here is BDN's strategy changing because |
For
The Tier1 method is loaded in the middle of iteration 11, so gets called for iteration 12. The "slow" iteration times after that point are similar to those seen by the default config (perhaps a tiny bit better). So naturally, the question is why. All the time here is spent in First, the Tier1 and full opt version are identical. The OSR version just omits the initial but contains the rest of the loops. We do less cloning in the OSR version. That seems to lead to somewhat better register allocation, though unclear if that accounts for all of the perf improvement. Sorting out why things are faster for OSR is going to take some time. |
Running some other segments of the full suite at random:
At this point it seems clear that there are some tests -- almost certainly the ones that do a lot of looping internally, possibly restricted to the subset where the looping methods aren't called a lot -- where the BDN results will reflect the performance of the OSR versions. And the OSR version can be faster or slower in ways that will be hard to predict. By and large all these should be cases where the benchmark strategy isn't sufficient to get us to Tier1 reliably. In the past for this subset it did not matter, as QJFL=0 ensured these methods were exempt from tiering. I'm going to try and do larger sweeps and get a ballpark estimate as to how many tests are likely impacted in this way. |
I hope dotnet/performance infrastructure will be in a good state when this is merged to catch all improvements/regressions |
I won't merge unless things are in a good state. |
Draft Pull Request was automatically closed for inactivity. Please let us know if you'd like to reopen it. |
Change these default values when the jit targets x64:
The upshot is that on x64 more methods will be jitted at Tier0, and
we will rely on OSR to get out of long-running Tier0 methods.
Other architectures continue to use the old behavior for now, as
OSR is not yet supported outside of x64.