Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Remove BBJ_NONE #94239

Merged
merged 20 commits into from
Nov 28, 2023
Merged

JIT: Remove BBJ_NONE #94239

merged 20 commits into from
Nov 28, 2023

Conversation

amanasifkhalid
Copy link
Member

Next step for #93020, per conversation on #93772. Replacing BBJ_NONE with BBJ_ALWAYS to the next block helps limit our use of implicit fall-through (though we still expect BBJ_COND to fall through when its false branch is taken; #93772 should eventually address this).

I've added a small peephole optimization to skip emitting unconditional branches to the next block during codegen.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 31, 2023
@ghost ghost assigned amanasifkhalid Oct 31, 2023
@ghost
Copy link

ghost commented Oct 31, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Next step for #93020, per conversation on #93772. Replacing BBJ_NONE with BBJ_ALWAYS to the next block helps limit our use of implicit fall-through (though we still expect BBJ_COND to fall through when its false branch is taken; #93772 should eventually address this).

I've added a small peephole optimization to skip emitting unconditional branches to the next block during codegen.

Author: amanasifkhalid
Assignees: amanasifkhalid
Labels:

area-CodeGen-coreclr

Milestone: -

@amanasifkhalid
Copy link
Member Author

Failures look like #91757.

@amanasifkhalid amanasifkhalid marked this pull request as ready for review November 2, 2023 20:05
@amanasifkhalid
Copy link
Member Author

CC @dotnet/jit-contrib, @AndyAyersMS PTAL. I tried to rein in the asmdiffs as much as possible without adding too many weird edge cases. The code size increases in the libraries_tests.run... collections for FullOpts are pretty dramatic, though for what it's worth, the JIT seems to be justifying these increases with improved PerfScores. Here's the PerfScore diff for this collection when targeting Windows ARM64:

Found 72 files with textual diffs.

Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 632533.3599999999
Total PerfScoreUnits of diff: 459753.63
Total PerfScoreUnits of delta: -172779.73 (-27.32 % of base)
Total relative delta: -19.98
    diff is an improvement.
    relative diff is an improvement.


Top file regressions (PerfScoreUnits):
       13.10 : 618160.dasm (45.96% of base)
       12.40 : 562710.dasm (40.91% of base)
       12.40 : 153205.dasm (40.90% of base)

Top file improvements (PerfScoreUnits):
    -6109.65 : 121208.dasm (-34.90% of base)
    -6100.66 : 308960.dasm (-34.96% of base)
    -6098.06 : 26795.dasm (-34.95% of base)
    -6094.56 : 72962.dasm (-34.94% of base)
    -6094.06 : 372595.dasm (-34.94% of base)
    -6093.96 : 267789.dasm (-34.94% of base)
    -6093.86 : 123496.dasm (-34.94% of base)
    -6093.86 : 622898.dasm (-34.94% of base)
    -6093.21 : 246705.dasm (-34.84% of base)
    -6093.21 : 169013.dasm (-34.84% of base)
    -6086.91 : 94037.dasm (-34.82% of base)
    -6080.10 : 356732.dasm (-35.18% of base)
    -6079.46 : 500179.dasm (-34.89% of base)
    -6079.30 : 68364.dasm (-35.17% of base)
    -6079.30 : 115234.dasm (-35.17% of base)
    -6069.86 : 623747.dasm (-34.85% of base)
    -6066.10 : 187066.dasm (-35.12% of base)
    -6065.55 : 347667.dasm (-35.22% of base)
    -6064.80 : 120021.dasm (-35.12% of base)
    -6064.80 : 103264.dasm (-35.12% of base)

72 total files with Perf Score differences (69 improved, 3 regressed), 20 unchanged.

@AndyAyersMS
Copy link
Member

Diffs

Very interesting. I would not have expected massive code size improvements from something like this, and I'd like to understand this aspect a bit better (especially the min opts cases). Can we pick a few examples for case studies?

I will need some time to go through the changes -- will try to get you a first pass later today.

@jakobbotsch
Copy link
Member

jakobbotsch commented Nov 2, 2023

CC @dotnet/jit-contrib, @AndyAyersMS PTAL. I tried to rein in the asmdiffs as much as possible without adding too many weird edge cases. The code size increases in the libraries_tests.run... collections for FullOpts are pretty dramatic, though for what it's worth, the JIT seems to be justifying these increases with improved PerfScores. Here's the PerfScore diff for this collection when targeting Windows ARM64:

Found 72 files with textual diffs.

Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 632533.3599999999
Total PerfScoreUnits of diff: 459753.63
Total PerfScoreUnits of delta: -172779.73 (-27.32 % of base)
Total relative delta: -19.98
    diff is an improvement.
    relative diff is an improvement.


Top file regressions (PerfScoreUnits):
       13.10 : 618160.dasm (45.96% of base)
       12.40 : 562710.dasm (40.91% of base)
       12.40 : 153205.dasm (40.90% of base)

Top file improvements (PerfScoreUnits):
    -6109.65 : 121208.dasm (-34.90% of base)
    -6100.66 : 308960.dasm (-34.96% of base)
    -6098.06 : 26795.dasm (-34.95% of base)
    -6094.56 : 72962.dasm (-34.94% of base)
    -6094.06 : 372595.dasm (-34.94% of base)
    -6093.96 : 267789.dasm (-34.94% of base)
    -6093.86 : 123496.dasm (-34.94% of base)
    -6093.86 : 622898.dasm (-34.94% of base)
    -6093.21 : 246705.dasm (-34.84% of base)
    -6093.21 : 169013.dasm (-34.84% of base)
    -6086.91 : 94037.dasm (-34.82% of base)
    -6080.10 : 356732.dasm (-35.18% of base)
    -6079.46 : 500179.dasm (-34.89% of base)
    -6079.30 : 68364.dasm (-35.17% of base)
    -6079.30 : 115234.dasm (-35.17% of base)
    -6069.86 : 623747.dasm (-34.85% of base)
    -6066.10 : 187066.dasm (-35.12% of base)
    -6065.55 : 347667.dasm (-35.22% of base)
    -6064.80 : 120021.dasm (-35.12% of base)
    -6064.80 : 103264.dasm (-35.12% of base)

72 total files with Perf Score differences (69 improved, 3 regressed), 20 unchanged.

What is this from? You need to pass -metrics PerfScore to superpmi.py to do a correct perfscore measurement. Analyzing the example diffs produced for a normal asmdiffs run will not give right results (I would expect way more than 72 files). Even so the results may not be very insightful since I think it would include PerfScore diffs in MinOpts contexts.

For this change: it looks like the "jump to next BB" optimization is kicking in a few places in MinOpts. I don't see an easy way to make the behavior the same as before, but we should ensure that this doesn't impact debugging.

@amanasifkhalid
Copy link
Member Author

You need to pass -metrics PerfScore to superpmi.py to do a correct perfscore measurement. Analyzing the example diffs produced for a normal asmdiffs run will not give right results (I would expect way more than 72 files).

Ah, thanks for catching that. I'm rerunning the PerfScore measurement now; will update with the results here.

For this change: it looks like the "jump to next BB" optimization is kicking in a few places in MinOpts.

By "jump to next BB optimization," do you mean the peephole optimization for jumping to the next block during codegen, or one of the flowgraph optimizations (like fgCompactBlocks, which will compact a BBJ_ALWAYS to the next block, or fgOptimizeBranchToNext, etc)?

@jakobbotsch
Copy link
Member

By "jump to next BB optimization," do you mean the peephole optimization for jumping to the next block during codegen, or one of the flowgraph optimizations (like fgCompactBlocks, which will compact a BBJ_ALWAYS to the next block, or fgOptimizeBranchToNext, etc)?

I meant the peephole optimization, e.g. I assume that's the cause of diffs like:
image

We should just make sure we don't lose the ability to place breakpoints on "closing braces", for example (though I wouldn't expect that). As Andy mentioned, understanding these diffs would be a good idea. Also, any idea what's causing TP regressions in asp.net and benchmarks.run_pgo MinOpts?

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Nov 3, 2023

I'd like to understand this aspect a bit better (especially the min opts cases). Can we pick a few examples for case studies?

Sure thing. There are a couple methods in the libraries_tests.run... collection with big size regressions. For example, System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this (Instrumented Tier1) increased in size from 8741 bytes to 14283 bytes (+5542 bytes, or 63.40% of the base size). The instruction count increased from 1867 to 3249. However, the PerfScore decreased from 13300.68 to 8990.12, so the JIT thinks this increase is worth it. Looking at the JIT dumps, the biggest diffs are during loop cloning, where the JIT is now much more aggressive for this method: The baseline JIT didn't clone any loops, while the diff JIT cloned 4 loops. This cloning increased the number of basic blocks from 236 to 454. It seems that in the baseline JIT, Compiler::optCanOptimizeByLoopCloning bails out pretty early, getting only a few statements deep into each loop. Here's a dump snippet for the baseline JIT:

Considering loop L00 to clone for optimizations.
Checking loop L00 for optimization candidates (GDV tests)
...GDV considering [000111]
------------------------------------------------------------
Considering loop L01 to clone for optimizations.
Checking loop L01 for optimization candidates (GDV tests)
...GDV considering [000786]
...GDV considering [000761]
------------------------------------------------------------
Considering loop L02 to clone for optimizations.
Checking loop L02 for optimization candidates (GDV tests)
...GDV considering [001478]
...GDV considering [001486]
...GDV considering [001519]
...GDV considering [002400]
...GDV considering [002404]
...GDV considering [004228]
...GDV considering [004229]
... right form for type test with local V156
... but not invariant
...GDV considering [004300]
...GDV considering [002429]
...GDV considering [001464]
------------------------------------------------------------
Considering loop L03 to clone for optimizations.
Checking loop L03 for optimization candidates (GDV tests)
...GDV considering [001127]
------------------------------------------------------------
Considering loop L04 to clone for optimizations.
Checking loop L04 for optimization candidates (GDV tests)
...GDV considering [001150]
------------------------------------------------------------

Loops cloned: 0
Loops statically optimized: 0

I don't fully understand the requirements for deciding to clone a loop, but I'm guessing some slightly different decisions made in Compiler::fgUpdateFlowGraph enabled the diff JIT to do cloning. Some cloned loops are pretty large -- one clone added over 100 new blocks (I'm guessing we don't consider loop size when cloning, or the tolerance for size increases is pretty high?). Also @jakobbotsch I haven't investigated the TP diffs for MinOpts yet, but I hypothesize this ambitious loop cloning is to blame for the TP regression for libraries_tests.run... with FullOpts.

In this case, do we trust the perfScore improvements, or does this amount of cloning seem extreme? I'll update shortly with an example of a code size improvement in MinOpts.

@amanasifkhalid
Copy link
Member Author

I took a look at an example similar to the one Jakob screenshotted above; see MicroBenchmarks.Serializers.DataGenerator:Generate[int]():int (Tier0) in benchmarks.run_pgo.windows.arm64.checked. For the baseline JIT, this method is 28 instructions (112 bytes) long, with a PerfScore of 38.20. For the diff JIT, this method is 7 instructions (28 bytes) long, with a PerfScore of 8.80. By percent decreases, this method was one of the best improvements in the collection, with a 75% decrease in size. Taking a look at the JIT dumps, this method has a lot of BBJ_COND -> BBJ_RETURN chains. Here's a snippet:

--------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    lp [IL range]     [jump]
--------------------------------------------------------------------------
BB01 [0000]  1                             1       [000..01B)-> BB03 ( cond )                     
BB02 [0001]  1       BB01                  1       [01B..026)        (return)                     
BB03 [0002]  1       BB01                  1       [026..041)-> BB05 ( cond )                     
BB04 [0003]  1       BB03                  1       [041..04C)        (return)                     
BB05 [0004]  1       BB03                  1       [04C..067)-> BB07 ( cond )                     
BB06 [0005]  1       BB05                  1       [067..072)        (return)

The conditions of these BBJ_COND blocks get folded into constants, and we're able to convert them into BBJ_ALWAYS blocks. Here's what they look like after:

BB01 [0000]  1                             1       [000..01B)-> BB03 (always)
BB02 [0001]  0                             1       [01B..026)        (return)
BB03 [0002]  1       BB01                  1       [026..041)-> BB05 (always)
BB04 [0003]  0                             1       [041..04C)        (return)
BB05 [0004]  1       BB03                  1       [04C..067)-> BB07 (always)
BB06 [0005]  0                             1       [067..072)        (return)

The BBJ_RETURN blocks are no longer reachable, since their previous blocks cannot fall through into them. So during the post-import phase, these blocks are removed. Now, the block list looks like this:

BB01 [0000]  1                             1       [000..01B)-> BB03 (always)
BB03 [0002]  1       BB01                  1       [026..041)-> BB05 (always)
BB05 [0004]  1       BB03                  1       [04C..067)-> BB07 (always) 
BB07 [0006]  1       BB05                  1       [072..08D)-> BB09 (always)

Normally, the JIT would convert these jumps to the next block into BBJ_NONE via fgOptimizeBranchToNext in the layout optimization phase, but that phase isn't run in MinOpts, so by the time we get to code gen, these blocks are still BBJ_ALWAYS. The new peephole optimization kicks in, removing all the unnecessary jumps. So aside from that optimization, the actual flowgraph behavior doesn't seem to differ at all between the baseline and diff JITs; we just never had the chance to convert these BBJ_ALWAYS blocks to BBJ_NONE in MinOpts for these scenarios. The top code size improvements by raw and percent decrease for this collection are all Tier 0, so this pattern of not being able to optimize unnecessary BBJ_ALWAYS away with fgOptimizeBranchToNext in MinOpts seems to be fixed by the peephole optimization. So as Jakob alluded to, this optimization seems responsible for many of the improvements in MinOpts.

I imagine it would be easy to disable this optimization in MinOpts the same way we disable the layout optimization phase, if we find it interferes with debugging unoptimized code.

@AndyAyersMS
Copy link
Member

In this case, do we trust the perfScore improvements, or does this amount of cloning seem extreme? I'll update shortly with an example of a code size improvement in MinOpts

I've seen cloning "unexpectedly" kick in from changes like this. It is something we'll just have to tolerate for now.

It would be helpful to see how much of the code growth comes in methods where more cloning happens, and whether aside from that there are other things going on that would be worth understanding.

However it may be a little bit painful to gather this data. For instance you could add output to the jit's disasm footer line indicating the number of cloned loops, then update jit-analyze or similar to parse this and aggregate data separately for methods where # of clones matches vs those where # of clones differs...

I imagine it would be easy to disable this optimization in MinOpts the same way we disable the layout optimization phase, if we find it interferes with debugging unoptimized code.

For now you should try and see if you can get to zero diffs (or close) for MinOpts, we can always come back later to see if this optimization can be safely enabled in modes where we are generally (and intentionally) not optimizing the code. For instance, we might want to turn it on for Tier0 but not MinOpts or Debuggable Code.

@BruceForstall
Copy link
Member

However it may be a little bit painful to gather this data. For instance you could add output to the jit's disasm footer line indicating the number of cloned loops, then update jit-analyze or similar to parse this and aggregate data separately for methods where # of clones matches vs those where # of clones differs...

DOTNET_JitTimeLogCsv=* includes cloned loops as one of the columns in the (per-function) output.

@amanasifkhalid
Copy link
Member Author

DOTNET_JitTimeLogCsv=* includes cloned loops as one of the columns in the (per-function) output.

Thanks for pointing this out. I'll share some metrics on loop cloning for the collections with the biggest size regressions for FullOpts.

For now you should try and see if you can get to zero diffs (or close) for MinOpts, we can always come back later to see if this optimization can be safely enabled in modes where we are generally (and intentionally) not optimizing the code.

Sure thing, I'll try disabling the optimization for MinOpts. FYI that while disabling it will probably reduce/remove the size improvements, I expect us to get some pretty big diffs in the opposite direction for MinOpts, as all the blocks that used to be BBJ_NONE will now have a branch instruction at the end. So either way, I don't think we'll be close to zero diffs.

@AndyAyersMS
Copy link
Member

I expect us to get some pretty big diffs in the opposite direction for MinOpts,

Interesting ... so we seemingly have lost something important by removing BBJ_NONE. I wonder if we have anything else lying around that could help us figure out when to materialize the jump for MinOpts.

One idea is to look at the associated IL offset info. If the block end IL offset is valid and different than the last statement in the block end IL offset, and also different from the next block's start IL offset, then the jump may be significant for source-level debugging, and so it should result in some instruction (though perhaps emitting a nop would be sufficient).

If we can't get this to zero diff for MinOpts/Debuggable code, we'll have to verify behavior with the debugger tests.

I suppose we could also try and look at the debug info we generate for some of these methods ((say pass --debuginfo to superpmi.py asmdiffs)... while it won't match up exactly before/after, every IL offset we used to report should still be reported. So perhaps a simple check that the number of offset records is the same would be sufficient. I don't recall how smart the SPMI debug info differ is, it may already flag this case.

@@ -737,7 +737,9 @@ void CodeGen::genCodeForBBlist()
{
// Peephole optimization: If this block jumps to the next one, skip emitting the jump
// (unless we are jumping between hot/cold sections, or if we need the jump for EH reasons)
const bool skipJump = block->JumpsToNext() && !(block->bbFlags & BBF_KEEP_BBJ_ALWAYS) &&
// (Skip this optimization in MinOpts)
const bool skipJump = !compiler->opts.MinOpts() && block->JumpsToNext() &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want compiler->optimizationsEnabled() here.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Nov 4, 2023

Original Revised
image image

Seems like there is more than just this peephole involved, as x64 is still smaller than it was.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some notes.

Not sure about some of the changes to fgReorderBlocks but that method is complex enough that I'd need to walk through some cases to understand it better.

I still think the goal here for now should be to try and minimize diffs. If there are opportunities to improve, we can do those as follow-ups. We should do is pick a few relatively small / simple methods for case studies and make sure we understand what is leading to the diffs there.

Happy to help with this.

// If that happens, make sure a NOP is emitted as the last instruction in the block.
emitNop = true;
break;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we ever hit the case in the old code where we had BBJ_NONE on the last block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an assert(false) to that case in the old code to see if I could get it to hit during a SuperPMI replay, and it never hit across all collections. Also in the new code, I added an assert that BBJ_ALWAYS has a jump before trying to emit the jump, so that we never have a BBJ_ALWAYS that "falls into" nothing at the end of the block list -- that also never hit.

@@ -3157,7 +3155,7 @@ unsigned Compiler::fgMakeBasicBlocks(const BYTE* codeAddr, IL_OFFSET codeSize, F

jmpDist = (sz == 1) ? getI1LittleEndian(codeAddr) : getI4LittleEndian(codeAddr);

if ((jmpDist == 0) && (opcode == CEE_BR || opcode == CEE_BR_S) && opts.DoEarlyBlockMerging())
if ((jmpDist == 0) && (jmpKind == BBJ_ALWAYS) && opts.DoEarlyBlockMerging())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this handle the CEE_LEAVE cases like the old code did?

Suspect you may want to change this back to what it was before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, since jmpKind is set to BBJ_ALWAYS only when the opcode is CEE_BR or CEE_BR_S, though I can change this back for simplicity.

@@ -1432,19 +1428,6 @@ void Compiler::fgDebugCheckTryFinallyExits()
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update the comment above this code since case (d) is no longer possible? Instead of compacting (e) and (f) maybe just do something like

                // ~~(d) via a fallthrough to an empty block to (b)~~ [no longer possible]

@@ -1472,7 +1455,7 @@ void Compiler::fgDebugCheckTryFinallyExits()
block->bbNum, succBlock->bbNum);
}

allTryExitsValid = allTryExitsValid & thisExitValid;
allTryExitsValid = allTryExitsValid && thisExitValid;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: since these bools are almost always true, & is possibly a bit more efficient than &&.

const bool fallThroughIsTruePred = BlockSetOps::IsMember(this, jti.m_truePreds, jti.m_fallThroughPred->bbNum);
const bool predJumpsToNext = jti.m_fallThroughPred->KindIs(BBJ_ALWAYS) && jti.m_fallThroughPred->JumpsToNext();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code can likely be simplified quite a bit now that there are no implicit fall throughs. That is, there is no longer any reason to set jti.m_fallThroughPred to true.

Can you leave a todo comment here and maybe a note in the meta-issue that we should revisit this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing.


// TODO: Now that block has a jump to bNext,
// can we relax this requirement?
assert(!fgInDifferentRegions(block, bNext));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, unless the block whose IR is moving is empty or can't cause an exception.

@@ -2750,6 +2733,13 @@ void Compiler::optRedirectBlock(BasicBlock* blk, BlockToBlockMap* redirectMap, R
break;

case BBJ_ALWAYS:
// Fall-through successors are assumed correct and are not modified
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this new logic really necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. That comment is copied from the notes in the doc comment for Compiler::optRedirectBlock; I added this check in to emulate the no-op behavior it previously had for BBJ_NONE. If I remove it, we hit assert(h->HasJumpTo(t) || !h->KindIs(BBJ_ALWAYS)) in Compiler::optCanonicalizeLoopCore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic seems wrong with this added code here -- I would expect optRedirectBlock to always redirect a BBJ_ALWAYS based on the map. The behavior now doesn't match the documentation. Maybe some update to redirectMap is needed somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To preserve the original behavior of this method, which was to not redirect BBJ_NONE, would it be ok to use BBF_NONE_QUIRK here instead to determine if we should redirect a BBJ_ALWAYS? This seems to work locally. e.g:

if (blk->JumpsToNext() && ((blk->bbFlags & BBF_NONE_QUIRK) != 0)) // Functionally equivalent to BBJ_NONE

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better, even though I generally think quirks should not affect behavior in this way. It seems like there is some form of bug here around how the redirection happens or how the map is constructed.

{
preHead = BasicBlock::bbNewBasicBlock(this, BBJ_ALWAYS, entry);
}
BasicBlock* preHead = BasicBlock::New(this, BBJ_ALWAYS, (isTopEntryLoop ? top : entry));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just always branch to entry now, the logic before was trying to optimize the case for top-entry loops and fall through, but we don't need to do that anymore.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cool to see how much code is getting deleted (and how much more will probably follow).

I agree with Andy that you should get as close to zero diffs as possible with this change, putting in temporary workarounds if necessary for cases we can later remove.

In addition:

  1. Comment unrelated to this change: I don't like how HasJump() determines if we have a jump using bbJumpDest != nullptr. Note that bbJumpDest is part of a union. We should never access that value without first checking bbJumpKind (or asserting that bbJumpKind is one where we expect bbJumpDest to be set.) It would probably be useful to have functions:
    HasJumpDest() => KindIs(BBJ_ALWAYS,<others>)  // bbJumpDest is valid
    HasJumpSwt() => KindIs(BBJ_SWITCH)            // bbJumpSwt is valid
    HasJumpEhf() => KindIs(BBJ_EHFINALLYRET)      // bbJumpEhf is valid

(and presumably bbJumpOffs is only valid during a limited time during importing?)

These could be used in appropriate asserts as well.

  1. Questions about next steps: (a) do we get rid of the BBJ_COND "fall through"? (b) do we get rid of "JumpsToNext()"? (c) do we get rid of bbFallsThrough()? (d) do we get rid of fgConnectFallThrough()? (e) do we rename/remove fgIsBetterFallThrough()?

@@ -1101,15 +1097,15 @@ PhaseStatus Compiler::fgCloneFinally()
{
BasicBlock* newBlock = blockMap[block];
// Jump kind/target should not be set yet
assert(newBlock->KindIs(BBJ_NONE));
assert(!newBlock->HasJump());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be:

Suggested change
assert(!newBlock->HasJump());
assert(newBlock->KindIs(BBJ_ALWAYS));

? Or do you also want:

            assert(newBlock->KindIs(BBJ_ALWAYS) && !newBlock->HasJump());

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second is what I was going for; I'll update it.

{
return true;
noway_assert(b1->KindIs(BBJ_ALWAYS));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert seems unnecessary. At least, make it a normal assert (not a noway_assert)

BBJ_THROW, // SCK_ARG_EXCPN
BBJ_THROW, // SCK_ARG_RNG_EXCPN
BBJ_THROW, // SCK_FAIL_FAST
BBJ_ALWAYS, // SCK_NONE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems odd. Does SCK_NONE ever get used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not; I added an assert in to test if add->acdKind == SCK_NONE is ever true, and the SuperPMI replay was clean. A quick search of the source code doesn't yield any places where we assign SCK_NONE, so maybe I can add an assert in here that assures we don't use SCK_NONE? Then I can remove the conditional logic below for setting the jump target if newBlk is a BBJ_ALWAYS.

if (newBlk->KindIs(BBJ_ALWAYS))
{
assert(add->acdKind == SCK_NONE);
newBlk->SetJumpDest(newBlk->Next());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if newBlk is the last block of the function? Then is bbJumpDest == nullptr an indication of "fall off the end?" (previously, we'd have a BBJ_NONE and generate an int3 / breakpoint if that happened)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. When emitting jumps, I added in assert to see if we ever get a BBJ_ALWAYS with a null jump target, and it never hit during SuperPMI replays. Per your note on SCK_NONE, I think we can get rid of this if statement altogether.

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Nov 6, 2023

@AndyAyersMS @BruceForstall thank you both for the code reviews! I'll address your feedback shortly.

I suppose we could also try and look at the debug info we generate for some of these methods ((say pass --debuginfo to superpmi.py asmdiffs)...

I tried this with the peephole optimization always enabled, and only found differences in the number of IL offsets when the number of blocks differed (usually from flowgraph optimizations like fgCompactBlocks behaving less/more aggressively), so those diffs only applied for FullOpts. I spot-checked the MinOpts examples where the peephole optimization significantly reduced the code size (like the second case study above), and found no differences in the number of IL offsets -- only the actual offsets differed to reflect the different codegen. So I don't think this optimization will affect debugging? If you'd like, I can try modifying jit-analyze to check for diffs in the number of IL offsets; maybe I'm missing something, but it doesn't explicitly report diffs in debug info for me.

It would be helpful to see how much of the code growth comes in methods where more cloning happens, and whether aside from that there are other things going on that would be worth understanding.

I replayed libraries_tests.run.windows.arm64.Release with JitTimeLogCsv and diff'd methods by the number of loops cloned, and all of the methods with the top size regressions by percentage reported by SuperPMI had diffs in number of cloned loops. @AndyAyersMS I see in #94363 you have similar size regressions. I'll tweak the script I used to collect diffs a bit to improve its usability, and share it with you offline.

Seems like there is more than just this peephole involved, as x64 is still smaller than it was.

I'll take a look at the JIT dumps for the top improvements next to see where else the decreases are coming from (particularly for MinOpts).

@BruceForstall
Copy link
Member

I'd like to understand why there is more (or less) cloning. There's a lot of code that likes "top entry loops" so perhaps those aren't being distinguished as before?

More fundamentally: are a different set/number of loops being recognized by the loop recognizer, without "fall through"? Does loop inversion happen the same amount?

@amanasifkhalid
Copy link
Member Author

More fundamentally: are a different set/number of loops being recognized by the loop recognizer, without "fall through"?

I think this is the case. Looking at the DOTNET_JitTimeLogCsv output, for the methods that the diff JIT does more loop cloning, many methods are reported as having more loops to begin with, so it looks like I've broken loop recognition. I think there are still plenty of methods where we recognize the same loops, but because prior optimization passes behaved slightly differently, we're able to clone loops we previously couldn't. But I'll look at fixing loop recognition first, and see how much that cuts down on the code size increases.

@kunalspathak
Copy link
Member

I've added a small peephole optimization to skip emitting unconditional branches to the next block during codegen.

Is this different from #69041?

@amanasifkhalid
Copy link
Member Author

@kunalspathak they seem to have the same goal, but I think my approach is more aggressive, in that it checks to see if a BBJ_ALWAYS is functionally equivalent to BBJ_NONE during codegen. Without the opt I added, I saw plenty of unnecessary jumps to next after replacing BBJ_NONE with BBJ_ALWAYS, so since #69041 was implemented with the assumption that we have BBJ_NONE, I'm guessing that approach isn't as aggressive?

@AndyAyersMS
Copy link
Member

Looks like that was the problem.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for hanging in there through a number of revisions.

@amanasifkhalid
Copy link
Member Author

Thank you for all the reviews!

@amanasifkhalid amanasifkhalid merged commit 52e65a5 into dotnet:main Nov 28, 2023
127 of 129 checks passed
@BruceForstall
Copy link
Member

I asked Andy if we should do similar work for BBJ_CALLFINALLY so we can split up BBJ_CALLFINALLY/BBJ_ALWAYS pairs, and we decided at the time to leave these be for now. I don't know if we have anything to gain from being able to split them up in terms of codegen, but removing all those edge cases around call/always pairs would be nice, so maybe we should add a similar successor pointer for BBJ_CALLFINALLY...

There is no point to allow the BBJ_ALWAYS to be split up from (live in a different place from; have code in between; etc.) its paired BBJ_CALLFINALLY. All the codegen happens when processing the BBJ_CALLFINALLY, which uses the data from the BBJ_ALWAYS. The BBJ_ALWAYS itself generates nothing.

@jakobbotsch
Copy link
Member

Yeah, the main benefit would be to allow us to get rid of all code in the JIT that has to deal with the possibility of implicit fallthrough (bbFallsThrough and all its uses). I expect that would be a significant simplification of a lot of logic. Something to consider, but clearly BBJ_COND is the more important one for now, and most (or at least a lot) of the places that deal with fallthrough only end up having to deal with BBJ_COND and BBJ_NONE anyway.

@amanasifkhalid
Copy link
Member Author

Thanks for clarifying @BruceForstall. Maybe later on, we can consider replacing BBJ_CALLFINALLY/BBJ_ALWAYS pairs with a new jump kind that has two jump targets: the CALLFINALLY jump target, and then the ALWAYS target after. This should allow us to get rid of a lot of the special checks for these pairs when modifying BBJ_ALWAYS blocks.

@BruceForstall
Copy link
Member

bbFallsThrough

It's very weird to think about non-retless CALLFINALLY as "fall through": you can't insert a block between the CALLFINALLY and ALWAYS, and control flow doesn't "fall through" to the block after the ALWAYS since the finally returns to the ALWAYS target. I'm not sure what it means.

Maybe later on, we can consider replacing BBJ_CALLFINALLY/BBJ_ALWAYS pairs with a new jump kind that has two jump targets: the CALLFINALLY jump target, and then the ALWAYS target after. This should allow us to get rid of a lot of the special checks for these pairs when modifying BBJ_ALWAYS blocks.

If we could make that work, it would be great. Currently, the finally's EHFINALLYRET returns each ALWAYS block as a successor, and that ALWAYS block has its EHFINALLYRET as a predecessor. What you suggest would simplify a lot of current logic but might add additional special logic to flow graph interpretation. E.g., the EHFINALLYRET would I suppose yield all the continuation blocks as successors (makes sense), and each of them would have an EHFINALLYRET as predecessor, as well as any non-finally block predecessors.

@jakobbotsch
Copy link
Member

It's very weird to think about non-retless CALLFINALLY as "fall through": you can't insert a block between the CALLFINALLY and ALWAYS, and control flow doesn't "fall through" to the block after the ALWAYS since the finally returns to the ALWAYS target. I'm not sure what it means.

The function has the comment "Can a BasicBlock be inserted after this without altering the flowgraph" and the current design definitely means retless CALLFINALLY falls in the category (well, the opposite meaning is implied I'm sure). I guess that's why it returns true.

I think the representation where we store the continuation in the CALLFINALLY sounds natural. Finding the successors of the EHFINALLYRET would be done by looking at the regular predecessors of the handler entry, which should be all relevant CALLFINALLY blocks. (Actually couldn't we do that even today, instead of the side table added in #93377?)

@amanasifkhalid
Copy link
Member Author

The function has the comment "Can a BasicBlock be inserted after this without altering the flowgraph"

And based on the return values, the comment should probably be something like "Would inserting after this block alter the flowgraph", since it returns true for blocks with implicit fallthrough.

@BruceForstall
Copy link
Member

Finding the successors of the EHFINALLYRET would be done by looking at the regular predecessors of the handler entry, which should be all relevant CALLFINALLY blocks. (Actually couldn't we do that even today, instead of the side table added in #93377?)

The "side table" is mostly for performance, simplicity, use by the iterators, and consistency with how switches are represented.

What you say about using the regular predecessors of the finally entry makes sense. It wasn't done that way before, possibly because the previous code was written before predecessors were always available.

@AndyAyersMS
Copy link
Member

I think we can indeed get rid of these, and it would nice to no longer have all that special case handling everywhere.

We might need to constrain layout and/or add to codegen to handle the case where the retfinallly target block does not end up immediately after the corresponding callfinally block. I don't know if codegen is flexible enough today to do the latter (basically introducing a label that's not at a block begin, and after that, a branch to the right spot).

@BruceForstall
Copy link
Member

I wrote a proposal to reconsider the BBJ_CALLFINALLY/BBJ_ALWAYS representation: #95355

Please add comments there about what might be required to make that work.

amanasifkhalid added a commit that referenced this pull request Nov 30, 2023
Follow-up to #94239. In MinOpts scenarios, we should remove branches to the next block regardless of whether BBF_NONE_QUIRK is set, as this yields code size and TP improvements.
@amanasifkhalid
Copy link
Member Author

Collated set of improvements/regressions (lower is better) as of 12/12/2023.

Notes Recent Score Orig Score arm64
Ubuntu
arm64
Windows
intel x64
Ubuntu
intel x64
amd x64
Windows
Benchmark
1.51 1.49 1.51
1.49
System.Memory.Span(Char).IndexOfAnyThreeValues(Size: 512)
1.40 1.41 1.40
1.41
System.Tests.Perf_Uri.EscapeDataString(input: "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1.39 1.38 1.39
1.38
System.Memory.Span(Char).LastIndexOfAnyValues(Size: 512)
1.31 1.31 1.31
1.31
Burgers.Test2
1.28 1.28 1.28
1.28
System.Memory.ReadOnlySpan.IndexOfString(input: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
1.26 1.26 1.26
1.26
System.Memory.ReadOnlySpan.IndexOfString(input: "???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
1.25 1.25 1.25
1.25
SciMark2.kernel.benchSparseMult
1.25 1.25 1.25
1.25
System.Tests.Perf_Int128.ParseSpan(value: "170141183460469231731687303715884105727")
1.25 1.25 1.25
1.25
System.Tests.Perf_Int128.Parse(value: "170141183460469231731687303715884105727")
1.23 1.23 1.23
1.23
System.Tests.Perf_Int128.TryParse(value: "170141183460469231731687303715884105727")
1.22 1.22 1.22
1.22
System.Tests.Perf_Int128.TryParseSpan(value: "170141183460469231731687303715884105727")
1.21 1.21 1.21
1.21
Benchstone.BenchI.CSieve.Test
1.21 1.17 1.21
1.17
System.Collections.IterateFor(Int32).ImmutableSortedSet(Size: 512)
1.20 1.20 1.20
1.20
System.Text.Perf_Utf8Encoding.GetByteCount(Input: Chinese)
1.20 1.20 1.20
1.20
System.Text.Perf_Utf8Encoding.GetByteCount(Input: Cyrillic)
1.20 1.19 1.20
1.19
System.Text.Perf_Utf8Encoding.GetByteCount(Input: EnglishMostlyAscii)
1.19 1.19 1.19
1.19
System.Text.Perf_Utf8Encoding.GetByteCount(Input: Greek)
1.17 1.13 1.17
1.13
System.Tests.Perf_Enum.InterpolateIntoStringBuilder_NonFlags(value: 42)
1.16 1.16 1.16
1.16
System.Numerics.Tests.Perf_VectorOf(Int16).DivisionOperatorBenchmark
1.15 1.13 1.15
1.13
System.Memory.Span(Char).Reverse(Size: 512)
1.14 1.13 1.14
1.13
System.Tests.Perf_String.ToUpperInvariant(s: "This is a much longer piece of text that might benefit more from vectorization.")
1.13 1.13 1.13
1.13
System.Collections.ContainsFalse(Int32).ImmutableSortedSet(Size: 512)
1.13 1.10 1.13
1.10
System.Tests.Perf_String.ToUpper(s: "This is a much longer piece of text that might benefit more from vectorization.")
1.13 1.13 1.13
1.13
System.MathBenchmarks.Single.ScaleB
1.12 1.05 1.12
1.05
System.Tests.Perf_Enum.InterpolateIntoStringBuilder_Flags(value: 32)
1.12 1.12 1.12
1.12
SciMark2.kernel.benchFFT
1.12 1.12 1.12
1.12
System.Tests.Perf_Enum.InterpolateIntoSpan_NonFlags(value: 42)
1.11 1.07 1.11
1.07
System.Collections.ContainsFalse(String).FrozenSet(Size: 512)
1.11 1.10 1.11
1.10
System.Collections.IndexerSet(Int32).SortedList(Size: 512)
1.11 1.10 1.11
1.10
Benchstone.BenchI.Array1.Test
1.10 1.11 1.10
1.11
System.Tests.Perf_String.ToLowerInvariant(s: "This is a much longer piece of text that might benefit more from vectorization.")
1.10 1.09 1.10
1.09
System.Tests.Perf_String.ToLower(s: "This is a much longer piece of text that might benefit more from vectorization.")
1.09 1.10 1.09
1.10
Benchstone.MDBenchI.MDAddArray2.Test
1.09 1.10 1.09
1.10
System.Memory.Span(Char).Clear(Size: 512)
1.09 1.32 1.09
1.32
System.Tests.Perf_String.Concat_CharEnumerable
1.08 1.08 1.08
1.08
System.Collections.IterateFor(String).ImmutableSortedSet(Size: 512)
1.08 1.08 1.08
1.08
System.Collections.ContainsFalse(Int32).SortedSet(Size: 512)
1.08 1.07 1.37
1.37
0.85
0.84
System.Tests.Perf_String.IndexerCheckPathLength
1.07 1.07 1.07
1.07
System.Collections.ContainsFalse(Int32).ImmutableHashSet(Size: 512)
1.07 1.06 1.07
1.06
System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: Url,?2020,16)
1.06 1.06 1.06
1.06
PerfLabTests.CastingPerf.IFooFooIsIFoo
1.06 1.06 1.06
1.06
PerfLabTests.CastingPerf.ObjFooIsObj2
1.03 1.29 1.03
1.29
System.Collections.ContainsFalse(Int32).Queue(Size: 512)
1.00 0.91 1.00
0.91
IfStatements.IfStatements.And
1.00 0.82 1.00
0.82
IfStatements.IfStatements.Single
1.00 0.84 1.00
0.84
PerfLabTests.LowLevelPerf.StructWithInterfaceInterfaceMethod
1.00 0.89 1.00
0.89
PerfLabTests.CastingPerf2.CastingPerf.FooObjCastIfIsa
1.00 0.89 1.00
0.89
PerfLabTests.LowLevelPerf.InterfaceInterfaceMethodSwitchCallType
0.94 0.94 0.95
0.94
0.93
0.94
System.Collections.Tests.Perf_BitArray.BitArrayCopyToByteArray(Size: 512)
0.94 0.93 0.94
0.93
System.MathBenchmarks.Single.SinCosPi
0.94 0.94 0.94
0.94
System.Numerics.Tests.Perf_BitOperations.Log2_ulong
0.94 0.94 0.94
0.94
System.Numerics.Tests.Perf_BitOperations.Log2_uint
0.94 0.93 0.94
0.93
System.Collections.ContainsKeyTrue(Int32, Int32).Dictionary(Size: 512)
0.94 0.88 0.94
0.88
System.Text.Tests.Perf_Encoding.GetChars(size: 16, encName: "ascii")
0.93 0.93 0.93
0.93
System.Text.Json.Tests.Utf8JsonReaderCommentsTests.Utf8JsonReaderCommentParsing(CommentHandling: Skip, SegmentSize: 0, TestCase: LongMultiLine)
0.93 0.94 0.93
0.94
System.Buffers.Tests.SearchValuesCharTests.LastIndexOfAny(Values: "abcdefABCDEF0123456789")
0.93 0.93 0.93
0.93
0.93
0.93
ByteMark.BenchIDEAEncryption
0.93 0.93 0.92
0.92
0.94
0.94
System.Collections.ContainsFalse(Int32).Span(Size: 512)
0.92 0.93 0.92
0.93
System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: UnsafeRelaxed,hello "there",512)
0.91 0.93 0.91
0.93
MicroBenchmarks.Serializers.Xml_ToStream(IndexViewModel).XmlSerializer_
0.91 0.92 0.91
0.92
Benchstone.MDBenchF.MDSqMtx.Test
0.91 0.91 0.91
0.90
0.92
0.93
System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: UnsafeRelaxed,hello "there",16)
0.91 0.91 0.91
0.91
System.Collections.IterateForEachNonGeneric(String).Stack(Size: 512)
0.91 0.90 0.91
0.90
System.Collections.Tests.Perf_PriorityQueue(Guid, Guid).Dequeue_And_Enqueue(Size: 100)
0.90 0.88 0.90
0.88
System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: JavaScript,&Hello+(World)!,512)
0.90 0.90 0.90
0.90
Benchstone.BenchI.BubbleSort.Test
0.90 0.87 0.92
0.90
0.88
0.85
System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: JavaScript,&Hello+(World)!,16)
0.90 0.86 0.90
0.86
System.Collections.Perf_LengthBucketsFrozenDictionary.ToFrozenDictionary(Count: 10000, ItemsPerBucket: 1)
0.90 0.84 0.90
0.84
Benchstone.BenchI.Puzzle.Test
0.90 0.90 0.90
0.90
System.Collections.IterateForEachNonGeneric(String).ArrayList(Size: 512)
0.89 0.94 0.89
0.94
System.Collections.ContainsKeyTrue(Int32, Int32).ConcurrentDictionary(Size: 512)
0.89 0.89 0.89
0.89
System.Collections.Perf_LengthBucketsFrozenDictionary.TryGetValue_True_FrozenDictionary(Count: 100, ItemsPerBucket: 1)
0.88 0.85 0.88
0.85
System.Text.Tests.Perf_Encoding.GetByteCount(size: 16, encName: "ascii")
0.87 0.87 0.87
0.87
System.Text.Json.Tests.Utf8JsonReaderCommentsTests.Utf8JsonReaderCommentParsing(CommentHandling: Skip, SegmentSize: 100, TestCase: LongMultiLine)
0.87 0.82 0.87
0.82
System.Tests.Perf_Int32.ParseSpan(value: "12345")
0.87 0.87 0.87
0.87
Benchstone.BenchI.Array2.Test
0.87 0.87 0.87
0.87
PerfLabTests.LowLevelPerf.GenericClassGenericStaticMethod
0.86 0.85 0.88
0.84
0.85
0.87
System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: Url,&lorem ipsum=dolor sit amet,512)
0.86 0.87 0.86
0.87
Benchstone.BenchI.BenchE.Test
0.86 0.87 0.86
0.87
System.Buffers.Tests.SearchValuesByteTests.IndexOfAnyExcept(Values: "abcdefABCDEF0123456789Ü")
0.86 0.85 0.88
0.88
0.83
0.83
System.Numerics.Tests.Perf_BitOperations.PopCount_uint
0.85 0.87 0.88
0.88
0.88
0.88
0.78
0.84
ByteMark.BenchAssignJagged
0.85 0.85 0.85
0.85
System.Numerics.Tests.Perf_BigInteger.Parse(numberString: 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012
0.85 0.84 0.85
0.84
System.Tests.Perf_Int32.TryParse(value: "12345")
0.84 0.84 0.84
0.84
Burgers.Test1
0.84 0.86 0.84
0.86
System.Buffers.Tests.SearchValuesCharTests.LastIndexOfAny(Values: "ßäöüÄÖÜ")
0.84 0.85 0.84
0.85
System.Tests.Perf_Int32.TryParseSpan(value: "12345")
0.83 0.81 0.83
0.81
System.Tests.Perf_Int32.ParseSpan(value: "2147483647")
0.83 0.79 0.83
0.79
Interop.StructureToPtr.MarshalDestroyStructure
0.83 0.82 0.83
0.82
System.Text.Tests.Perf_Encoding.GetByteCount(size: 512, encName: "ascii")
0.82 0.82 0.82
0.82
PerfLabTests.CastingPerf2.CastingPerf.IntObj
0.82 0.83 0.82
0.83
PerfLabTests.CastingPerf2.CastingPerf.ScalarValueTypeObj
0.82 0.82 0.82
0.82
System.Memory.Span(Char).IndexOfAnyFiveValues(Size: 512)
0.82 0.81 0.82
0.81
System.Numerics.Tests.Perf_BigInteger.Parse(numberString: -2147483648)
0.82 0.79 0.82
0.79
Interop.StructureToPtr.MarshalPtrToStructure
0.81 0.79 0.81
0.79
System.Tests.Perf_Int32.TryParseSpan(value: "2147483647")
0.81 0.80 0.82
0.81
0.80
0.80
System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: Url,&lorem ipsum=dolor sit amet,16)
0.81 0.80 0.81
0.80
Interop.StructureToPtr.MarshalStructureToPtr
0.80 0.81 0.81
0.83
0.80
0.80
System.Collections.Sort(IntStruct).List(Size: 512)
0.80 0.80 0.80
0.80
System.Collections.Tests.Perf_PriorityQueue(String, String).Enumerate(Size: 1000)
0.80 0.79 0.80
0.79
System.Tests.Perf_UInt32.ParseSpan(value: "4294967295")
0.80 0.80 0.80
0.80
System.Memory.Span(Byte).IndexOfAnyTwoValues(Size: 512)
0.79 0.83 0.80
0.82
0.78
0.84
System.Collections.Sort(IntStruct).Array(Size: 512)
0.78 0.75 0.78
0.75
System.Tests.Perf_Int32.TryParse(value: "2147483647")
0.78 0.78 0.78
0.78
System.Collections.IterateForEachNonGeneric(String).Queue(Size: 512)
0.78 0.77 0.69
0.69
0.88
0.86
System.Memory.Span(Char).IndexOfValue(Size: 512)
0.76 0.78 0.76
0.78
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_Mariomkas.Count(Pattern: "(?:(?:25[0-5]
0.76 0.76 0.76
0.76
System.Tests.Perf_Int32.ParseHex(value: "80000000")
0.75 0.74 0.75
0.74
System.Tests.Perf_UInt32.TryParse(value: "4294967295")
0.75 0.75 0.74
0.75
0.75
0.76
System.Buffers.Tests.ReadOnlySequenceTests(Char).FirstSpanSingleSegment
0.74 0.73 0.74
0.73
System.Tests.Perf_Int32.ParseHex(value: "7FFFFFFF")
0.73 0.76 0.73
0.76
System.Tests.Perf_UInt32.Parse(value: "4294967295")
0.73 0.72 0.72
0.72
0.73
0.73
System.Buffers.Tests.ReadOnlySequenceTests(Char).FirstSpanTenSegments
0.70 0.86 0.70
0.86
System.Text.Tests.Perf_Encoding.GetByteCount(size: 512, encName: "utf-8")
0.70 0.69 0.70
0.69
System.Memory.Span(Char).Fill(Size: 512)
0.67 0.67 0.67
0.67
System.MathBenchmarks.Single.Max
0.67 0.67 0.67
0.67
System.MathBenchmarks.Single.Min
0.67 0.68 0.68
0.68
0.66
0.68
System.Tests.Perf_String.IndexerCheckBoundCheckHoist
0.66 0.67 0.66
0.67
0.66
0.67
System.Tests.Perf_String.IndexerCheckLengthHoisting
0.56 0.54 0.56
0.54
System.Collections.Tests.Perf_Dictionary.ContainsValue(Items: 3000)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants