Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align inner loops #44370

Merged
merged 59 commits into from
Jan 12, 2021
Merged

Align inner loops #44370

merged 59 commits into from
Jan 12, 2021

Conversation

kunalspathak
Copy link
Member

@kunalspathak kunalspathak commented Nov 6, 2020

Perform loop alignment at 32B boundary by adding padding before hot inner loops. Contributes to #43227

Description

Detect all the inner loops (inside optimizer.cpp) in a method and mark the corresponding basic blocks that represent loop head with BBF_FIRST_BLOCK_IN_INNERLOOP. This information is propagated when the blocks are cloned or new ones are created/moved (in flowgraph.cpp). During codegen (in codegenlinear.cpp), if we see that the next basic block has BBF_FIRST_BLOCK_IN_INNERLOOP, we mark the current IG with IGF_ALIGN_LOOP. This IG is the one that precedes an IG that is a loop head. We emit one or more align instruction (see details in Implementations below) in current IG (the one that precedes the loop head IG). During emitting (emitxarch.cpp), when we see align instruction, we check how much padding should be added based on the target address and emit sequence of NOPs.

Implementations

In this PR, there are two implementations that determines how much padding should be added to align a loop. Those are non-adaptive and adaptive padding.

Non-Adaptive

In non-adaptive setup, user can specify the boundary COMPlus_JitAlignLoopBoundary at which the inner loops should be aligned. By default, this is set to 32 byte boundary. This is simliar to align-loops option that modern processors exposes. For such cases, the maximum padding we will do is COMPlus_JitAlignLoopBoundary - 1, thus by default, we will add 31 bytes of padding to align a loop. This implementation would do the padding only if loop size <= COMPlus_JitAlignLoopMaxCodeSize. By default, this is set to 96 bytes (3 x 32B i.e. 3 chunks of 32B).

Adaptive

In my experiments, I realized that having a limit on padding amount helps eliminate some regression that might occur because of execution of series of NOP instructions. It is also sensible to adjust the padding amount depending on the size of loop. Lastly, for 32B non-adaptive approach, I noticed that there were cases that we would not perform alignment if the target address or loop size position didn't meet the heuristics (see details in Heuristics below). In that case, we could have tried to align to 16B boundary because that would be still better than not aligning the loop. Taking that all in account, the adaptive approach works as follows:

The biggest possible padding that will be added is 15 bytes for a loop that fits in one 32B chunk. If the loop is bigger and fits in two 32B chunks, then reducing the padding amount to 7 bytes and so forth. The reasoning being that bigger the loop gets, lesser effect the alignment has on it. With this approach, we could align a loop that takes 4 32B chunks if padding needed is 1 byte. With 32B non-adaptive approach, we would never align such loops. Also, overall the padding size reduced compared to 32B non-adaptive approach.

Max Pad (bytes) Minimum 32B blocks needed to fit the loop
15 1
7 2
3 3
1 4

If we cannot get the loop to align at 32B boundary, we will try it to align to 16B boundary. We reduce the max padding limit if we get here as seen in table below.

Max Pad (bytes) Minimum 32B blocks to fit the loop
7 1
3 2
1 3

Heuristics

We will emit align instruction, but during emitting there are multiple reasons for which padding won't be added.

  1. Loop hotness: If the inner loop is not hot enough, we will not try to align it. It is controlled by COMPlus_JitAlignLoopMinBlockWeight which is set to 10 by default.

  2. Loop size: As mentioned above, if we determine that loop size exceed the threshold (96 bytes for non-adaptive and variable for adaptive), it will not add padding for such loops.

  3. Boundary: If we detect that the loop is already emitted at the required alignment boundary, it will not try to align it as there is no need. Likewise, even if the loop doesn't start at the boundary, but starts at certain offset from given alignment boundary that will still make it fall under efficient 32B chunks. I have tried experimenting the JCC Erratum impact and hence added COMPlus_JitAlignLoopForJcc flag that will try to ensure that the back-edge jump doesn't fall on the exact 32B boundary. This is DEBUG only flag.

  4. Padding limit: In case of adaptive approach, it can decide to not align the loop if padding needed exceeds the threshold of maximum padding that can be added for a loop of given size.

  5. Update: Call inside a loop: If there are calls from inside a loop, there is probably less benefit in aligning a loop.

Code size impact

Below graph demonstrates the code size and allocation size impact. While allocation size for 32B adaptive is little more than the 16B non-adaptive approach, the overall code size for 32B adaptive is smallest among other implementations.

image

Below graph demonstrates the (allocation size - code size) comparison of various approaches. The difference is highest for 32B non-adaptive implementation and lowest with 16B non-adaptive. 32B adaptive is marginally higher than 16B non-adaptive, but again since the overall code size is minimal as compared to 16B/32B non-adaptive, 32B adaptive is the winner. For me, the goal was to make the diff that matches the blue bar in the graph for non alignment. In other words, I wanted to get to the point where we only allocate memory for which padding is added and not otherwise. As I pointed in #8748 (comment), the only way to not over-allocate memory is if I can predict in advance whether alignment is needed or not. If you see the Heuristics section above, I can easily calculate the loopSize in advance and if I find the loop is big enough, I don't allocate extra space for align instruction because we know that we won't align such loop. However, for other heuristics, I need to know the precise size of all instructions before the loop so I can estimate in advance (before allocating the memory) if I will do the alignment. For eg, if the size of all instructions before the loop instruction is % 32, I know that the loop falls on 32B boundary and there is no need of extra alignment. In that case, I can remove the align instruction for that loop and thus while allocating, allocate 15 bytes less. But, today, at few places, we over-estimate the instruction size and we won't know precise size until the point where we emit the machine code at target address (which happens after allocating memory). If we can fix the over-estimation part, we could get the allocation size down (for 32B adaptive approach, it will go down from 0.68% -> 0.07% regression).

image

No-alignment vs. 32B non-adaptive

Summary of Code Size diffs:

Total bytes of base: 51527293
Total bytes of diff: 51646256
Total bytes of delta: 118963 (0.23% of base)
diff is a regression.

Code size impact details
jit-analyze -b E:\alignment\32-bytes\final-PR\noalign -d E:\alignment\32-bytes\final-PR\32B 
Found 173 files with textual diffs.

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 51527293
Total bytes of diff: 51646256
Total bytes of delta: 118963 (0.23% of base)
    diff is a regression.

Top file regressions (bytes):
       15187 : System.Private.CoreLib.dasm (0.31% of base)
       12673 : FSharp.Core.dasm (0.39% of base)
       12514 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (0.41% of base)
        7737 : System.Private.Xml.dasm (0.22% of base)
        5920 : System.Linq.dasm (0.65% of base)
        5576 : Microsoft.CodeAnalysis.dasm (0.32% of base)
        5498 : Microsoft.CodeAnalysis.CSharp.dasm (0.13% of base)
        4653 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.08% of base)
        4201 : System.Data.Common.dasm (0.28% of base)
        3666 : System.Collections.Immutable.dasm (0.33% of base)
        3276 : System.Collections.Concurrent.dasm (0.97% of base)
        2738 : System.Linq.Parallel.dasm (0.16% of base)
        2231 : System.Private.DataContractSerialization.dasm (0.29% of base)
        1500 : System.Linq.Expressions.dasm (0.19% of base)
        1499 : Microsoft.Diagnostics.FastSerialization.dasm (1.50% of base)
        1401 : System.Threading.Tasks.Dataflow.dasm (0.16% of base)
        1321 : System.Collections.dasm (0.29% of base)
        1252 : Newtonsoft.Json.dasm (0.15% of base)
        1006 : System.Numerics.Tensors.dasm (0.32% of base)
         916 : Microsoft.CSharp.dasm (0.24% of base)

156 total files with Code Size differences (0 improved, 156 regressed), 112 unchanged.

Top method regressions (bytes):
         140 ( 4.34% of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.RegexInterpreter:FindFirstChar():bool:this (2 methods)
         130 ( 0.96% of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.RegexInterpreter:Go():this (2 methods)
         124 ( 2.04% of base) : System.Private.CoreLib.dasm - System.Globalization.TextInfo:ChangeCaseCommon(System.String):System.String:this (7 methods)
         101 ( 5.90% of base) : System.Private.Xml.dasm - System.Xml.Xsl.Xslt.XslAstAnalyzer:Analyze(System.Xml.Xsl.Xslt.Compiler):int:this
          98 (12.79% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Symbols.SourceAssemblySymbol:ForceComplete(Microsoft.CodeAnalysis.SourceLocation,System.Threading.CancellationToken):this
          87 ( 3.92% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Symbols.SourceMemberContainerTypeSymbol:ForceComplete(Microsoft.CodeAnalysis.SourceLocation,System.Threading.CancellationToken):this
          87 ( 5.98% of base) : System.Data.OleDb.dasm - System.Data.OleDb.OleDbDataReader:CreateBindingsFromMetaData(bool):System.Data.OleDb.Bindings[]:this
          85 ( 6.13% of base) : System.Private.CoreLib.dasm - DecCalc:ScaleResult(long,int,int):int
          84 ( 5.43% of base) : System.Private.Xml.dasm - System.Xml.XmlSqlBinaryReader:RescanNextToken():int:this
          73 (11.61% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Parsers.Symbol.FileVersionTraceData:PayloadValue(int):System.Object:this
          73 ( 9.32% of base) : System.Collections.Immutable.dasm - System.Collections.Immutable.ImmutableInterlocked:Update(byref,System.Func`2[ImmutableArray`1,ImmutableArray`1]):bool (7 methods)
          70 ( 3.56% of base) : System.Data.Odbc.dasm - System.Data.Odbc.OdbcDataReader:RetrieveKeyInfoFromStatistics(QualifiedTableName,bool):int:this
          70 ( 1.91% of base) : System.Private.Xml.dasm - System.Xml.Serialization.TempAssembly:GenerateSerializerToStream(System.Xml.Serialization.XmlMapping[],System.Type[],System.String,System.Reflection.Assembly,System.Collections.Hashtable,System.IO.Stream):bool
          69 (12.50% of base) : System.Private.CoreLib.dasm - System.Buffers.Text.Utf8Formatter:TryFormatDecimalF(byref,System.Span`1[Byte],byref,ubyte):bool
          69 ( 5.23% of base) : System.Runtime.Numerics.dasm - System.Numerics.BigInteger:.ctor(System.ReadOnlySpan`1[Byte],bool,bool):this
          68 ( 5.59% of base) : System.Data.Common.dasm - System.Data.SqlTypes.SqlDecimal:Parse(System.String):System.Data.SqlTypes.SqlDecimal
          68 ( 3.98% of base) : System.Linq.dasm - System.Linq.Enumerable:Average(System.Collections.Generic.IEnumerable`1[Byte],System.Func`2[Byte,Nullable`1]):System.Nullable`1[Double] (3 methods)
          68 ( 3.98% of base) : System.Linq.dasm - System.Linq.Enumerable:Average(System.Collections.Generic.IEnumerable`1[Int16],System.Func`2[Int16,Nullable`1]):System.Nullable`1[Double] (3 methods)
          68 ( 3.98% of base) : System.Linq.dasm - System.Linq.Enumerable:Average(System.Collections.Generic.IEnumerable`1[Int32],System.Func`2[Int32,Nullable`1]):System.Nullable`1[Double] (3 methods)
          67 ( 2.12% of base) : FSharp.Core.dasm - <StartupCode$FSharp-Core>.$Quotations:eq@197(Microsoft.FSharp.Quotations.Tree,Microsoft.FSharp.Quotations.Tree):bool

Top method regressions (percentages):
          56 (68.29% of base) : System.Private.CoreLib.dasm - System.Threading.Interlocked:And(byref,long):long (2 methods)
          56 (68.29% of base) : System.Private.CoreLib.dasm - System.Threading.Interlocked:Or(byref,long):long (2 methods)
          23 (63.89% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.TraceEvent:CopyBlob(long,long,int)
          25 (62.50% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PackedFlags:InitializeMethodKind(int):this
          25 (62.50% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - PackedFlags:InitializeMethodKind(int):this
          21 (56.76% of base) : System.Collections.Specialized.dasm - System.Collections.Specialized.BitVector32:CountBitsSet(short):short
          22 (56.41% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.Syntax.InternalSyntax.BlockContextExtensions:RecoverFromMissingEnd(Microsoft.CodeAnalysis.VisualBasic.Syntax.InternalSyntax.BlockContext,Microsoft.CodeAnalysis.VisualBasic.Syntax.InternalSyntax.BlockContext)
          22 (55.00% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.Symbols.SourceMemberMethodSymbol:set_SuppressDuplicateProcDefDiagnostics(bool):this
          23 (54.76% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Flags:EnsureMetadataVirtual():this
          23 (54.76% of base) : Microsoft.CSharp.dasm - Microsoft.CSharp.RuntimeBinder.Semantics.Symbol:LookupNext(long):Microsoft.CSharp.RuntimeBinder.Semantics.Symbol:this
          23 (54.76% of base) : System.Runtime.Numerics.dasm - System.Numerics.BigIntegerCalculator:AddDivisor(long,int,long,int):int
          24 (54.55% of base) : Microsoft.CSharp.dasm - Microsoft.CSharp.RuntimeBinder.Semantics.AggregateSymbol:FindBaseAgg(Microsoft.CSharp.RuntimeBinder.Semantics.AggregateSymbol):bool:this
          26 (54.17% of base) : System.Private.Xml.dasm - System.Xml.BinXmlSqlDecimal:MpNormalize(System.UInt32[],byref)
          21 (53.85% of base) : Microsoft.CodeAnalysis.dasm - Roslyn.Utilities.Hash:CombineFNVHash(int,System.String):int
          23 (52.27% of base) : System.Private.Xml.dasm - System.Xml.Serialization.XmlMapping:IsShallow(System.Xml.Serialization.XmlMapping[]):bool
          19 (51.35% of base) : System.Private.CoreLib.dasm - System.MemoryExtensions:ClampStart(System.ReadOnlySpan`1[Int32],int):int
          19 (51.35% of base) : System.Private.CoreLib.dasm - System.MemoryExtensions:ClampStart(System.ReadOnlySpan`1[Int64],long):int
          23 (51.11% of base) : System.IO.Compression.dasm - System.IO.Compression.ZipHelper:RequiresUnicode(System.String):bool
          23 (51.11% of base) : System.Net.Http.dasm - System.Net.Http.HttpResponseMessage:ContainsNewLineCharacter(System.String):bool:this
          23 (51.11% of base) : System.Net.Mail.dasm - System.Net.Mime.MailBnfHelper:HasCROrLF(System.String):bool

9891 total methods with Code Size differences (0 improved, 9891 regressed), 331070 unchanged.

Summary of Allocation Size diffs:

(Lower is better)

Total bytes of base: 51850481
Total bytes of diff: 52495660
Total bytes of delta: 645179 (1.24% of base)
diff is a regression.

Allocation size impact details
jit-analyze -b E:\alignment\32-bytes\final-PR\noalign -d E:\alignment\32-bytes\final-PR\32B -m AllocSize 
Found 173 files with textual diffs.

Summary of Allocation Size diffs:
(Lower is better)

Total bytes of base: 51850481
Total bytes of diff: 52495660
Total bytes of delta: 645179 (1.24% of base)
    diff is a regression.

Top file regressions (bytes):
      124163 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (4.04% of base)
       73159 : System.Private.CoreLib.dasm (1.47% of base)
       59352 : FSharp.Core.dasm (1.82% of base)
       37372 : System.Private.Xml.dasm (1.04% of base)
       28894 : System.Linq.dasm (3.15% of base)
       26323 : Microsoft.CodeAnalysis.CSharp.dasm (0.60% of base)
       26280 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.47% of base)
       23576 : Microsoft.CodeAnalysis.dasm (1.35% of base)
       21778 : System.Data.Common.dasm (1.47% of base)
       18941 : System.Collections.Immutable.dasm (1.68% of base)
       12239 : System.Linq.Parallel.dasm (0.72% of base)
       12014 : System.Collections.Concurrent.dasm (3.56% of base)
       10915 : System.Private.DataContractSerialization.dasm (1.43% of base)
        8802 : System.Linq.Expressions.dasm (1.14% of base)
        8593 : System.Collections.dasm (1.88% of base)
        7828 : Microsoft.Diagnostics.FastSerialization.dasm (7.81% of base)
        6883 : System.Threading.Tasks.Dataflow.dasm (0.79% of base)
        6412 : Newtonsoft.Json.dasm (0.74% of base)
        5734 : System.Numerics.Tensors.dasm (1.80% of base)
        4945 : Microsoft.CSharp.dasm (1.31% of base)

163 total files with Allocation Size differences (0 improved, 163 regressed), 105 unchanged.

Top method regressions (bytes):
         500 (51.71% of base) : System.Configuration.ConfigurationManager.dasm - System.Configuration.SectionInformation:SetRuntimeConfigurationInformation(System.Configuration.BaseConfigurationRecord,System.Configuration.FactoryRecord,System.Configuration.SectionRecord):this
         448 (13.89% of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.RegexInterpreter:FindFirstChar():bool:this (2 methods)
         434 (13.76% of base) : FSharp.Core.dasm - <StartupCode$FSharp-Core>.$Quotations:eq@197(Microsoft.FSharp.Quotations.Tree,Microsoft.FSharp.Quotations.Tree):bool
         434 ( 3.20% of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.RegexInterpreter:Go():this (2 methods)
         341 (15.31% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Symbols.SourceMemberContainerTypeSymbol:ForceComplete(Microsoft.CodeAnalysis.SourceLocation,System.Threading.CancellationToken):this
         330 (23.52% of base) : System.Private.CoreLib.dasm - System.Type:FindMembers(int,int,System.Reflection.MemberFilter,System.Object):System.Reflection.MemberInfo[]:this
         313 (22.58% of base) : System.Private.CoreLib.dasm - DecCalc:ScaleResult(long,int,int):int
         310 ( 2.52% of base) : System.Formats.Asn1.dasm - System.Formats.Asn1.AsnWriter:WriteGeneralizedTimeCore(System.Formats.Asn1.Asn1Tag,System.DateTimeOffset,bool):this
         302 (22.88% of base) : System.Runtime.Numerics.dasm - System.Numerics.BigInteger:.ctor(System.ReadOnlySpan`1[Byte],bool,bool):this
         299 ( 8.41% of base) : System.DirectoryServices.Protocols.dasm - System.DirectoryServices.Protocols.LdapConnection:SendRequestHelper(System.DirectoryServices.Protocols.DirectoryRequest,byref):int:this
         279 ( 7.63% of base) : System.Private.CoreLib.dasm - System.Globalization.DateTimeFormatInfo:CreateTokenHashTable():System.Globalization.DateTimeFormatInfo+TokenHashValue[]:this
         271 (18.63% of base) : System.Data.OleDb.dasm - System.Data.OleDb.OleDbDataReader:CreateBindingsFromMetaData(bool):System.Data.OleDb.Bindings[]:this
         260 ( 7.12% of base) : System.Management.dasm - System.Management.PropertyData:MapValueToWmiValue(System.Object,byref,byref):System.Object
         260 ( 6.63% of base) : System.Private.Xml.dasm - System.Xml.Schema.XmlSchemaInference:InferSimpleType(System.String,byref):int
         256 ( 8.76% of base) : Newtonsoft.Json.dasm - Newtonsoft.Json.JsonValidatingReader:ValidateCurrentToken():this
         256 ( 2.46% of base) : System.Private.CoreLib.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this
         251 (39.90% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Parsers.Symbol.FileVersionTraceData:PayloadValue(int):System.Object:this
         248 (11.00% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Symbols.SourceMemberContainerTypeSymbol:HasInstanceData(Microsoft.CodeAnalysis.CSharp.Syntax.MemberDeclarationSyntax):bool
         248 (29.14% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Parsers.Symbol.FileVersionTraceData:ToXml(System.Text.StringBuilder):System.Text.StringBuilder:this
         248 (14.55% of base) : System.Private.CoreLib.dasm - System.SpanHelpers:SequenceCompareTo(byref,int,byref,int):int (7 methods)

Top method regressions (percentages):
          31 (1,550.00% of base) : System.Private.Xml.dasm - System.Xml.Xsl.Runtime.XmlAttributeCache:WriteValue(System.String):this
          31 (238.46% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Syntax.InternalSyntax.SyntaxListBuilder:Validate(int,int):this
          31 (238.46% of base) : xunit.performance.execution.dasm - BenchmarkIteratorImpl:SpinDelay(int)
          31 (206.67% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[__Canon][System.__Canon]:Log2(int):int
          31 (206.67% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[Byte][System.Byte]:Log2(int):int
          31 (206.67% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[Int16][System.Int16]:Log2(int):int
          31 (206.67% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[Int32][System.Int32]:Log2(int):int
          31 (206.67% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[Double][System.Double]:Log2(int):int
          31 (206.67% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[Vector`1][System.Numerics.Vector`1[System.Single]]:Log2(int):int
          31 (206.67% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[Int64][System.Int64]:Log2(int):int
          31 (193.75% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.Syntax.InternalSyntax.SyntaxListBuilder:Validate(int,int):this
          31 (193.75% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.Syntax.SyntaxListBuilder:Validate(int,int):this
          31 (172.22% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Byte][System.Byte]:get_Length():int:this
          31 (172.22% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Byte][System.Byte]:System.Collections.Generic.IReadOnlyCollection<'T>.get_Count():int:this
          31 (172.22% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int16][System.Int16]:get_Length():int:this
          31 (172.22% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int16][System.Int16]:System.Collections.Generic.IReadOnlyCollection<'T>.get_Count():int:this
          31 (172.22% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int32][System.Int32]:get_Length():int:this
          31 (172.22% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int32][System.Int32]:System.Collections.Generic.IReadOnlyCollection<'T>.get_Count():int:this
          31 (172.22% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Double][System.Double]:get_Length():int:this
          31 (172.22% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Double][System.Double]:System.Collections.Generic.IReadOnlyCollection<'T>.get_Count():int:this

19096 total methods with Allocation Size differences (0 improved, 19096 regressed), 321865 unchanged.

No-alignment vs. 16B non-adaptive

Summary of Code Size diffs:

(Lower is better)

Total bytes of base: 51527293
Total bytes of diff: 51583852
Total bytes of delta: 56559 (0.11% of base)
diff is a regression.

Code size impact details
jit-analyze -b E:\alignment\32-bytes\final-PR\noalign -d E:\alignment\32-bytes\final-PR\16B 
Found 173 files with textual diffs.

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 51527293
Total bytes of diff: 51583852
Total bytes of delta: 56559 (0.11% of base)
    diff is a regression.

Top file regressions (bytes):
        7955 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (0.26% of base)
        7612 : FSharp.Core.dasm (0.23% of base)
        6835 : System.Private.CoreLib.dasm (0.14% of base)
        3238 : System.Private.Xml.dasm (0.09% of base)
        2668 : Microsoft.CodeAnalysis.CSharp.dasm (0.06% of base)
        2380 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.04% of base)
        2243 : Microsoft.CodeAnalysis.dasm (0.13% of base)
        2206 : System.Linq.dasm (0.24% of base)
        2065 : System.Data.Common.dasm (0.14% of base)
        1606 : System.Collections.Immutable.dasm (0.14% of base)
        1257 : System.Private.DataContractSerialization.dasm (0.17% of base)
        1072 : System.Collections.Concurrent.dasm (0.32% of base)
         778 : System.Linq.Expressions.dasm (0.10% of base)
         706 : Microsoft.Diagnostics.FastSerialization.dasm (0.71% of base)
         645 : System.Collections.dasm (0.14% of base)
         590 : System.Linq.Parallel.dasm (0.03% of base)
         531 : System.Numerics.Tensors.dasm (0.17% of base)
         506 : Microsoft.VisualBasic.Core.dasm (0.11% of base)
         486 : Newtonsoft.Json.dasm (0.06% of base)
         473 : System.Net.Http.dasm (0.06% of base)

152 total files with Code Size differences (0 improved, 152 regressed), 116 unchanged.

Top method regressions (bytes):
          80 ( 2.54% of base) : FSharp.Core.dasm - <StartupCode$FSharp-Core>.$Quotations:eq@197(Microsoft.FSharp.Quotations.Tree,Microsoft.FSharp.Quotations.Tree):bool
          76 ( 2.36% of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.RegexInterpreter:FindFirstChar():bool:this (2 methods)
          74 (11.76% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Parsers.Symbol.FileVersionTraceData:PayloadValue(int):System.Object:this
          73 ( 9.32% of base) : System.Collections.Immutable.dasm - System.Collections.Immutable.ImmutableInterlocked:Update(byref,System.Func`2[ImmutableArray`1,ImmutableArray`1]):bool (7 methods)
          48 ( 3.72% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.SyntaxFactory:NodesAreCorrectType(Microsoft.CodeAnalysis.SyntaxNodeOrTokenList):bool (7 methods)
          43 (31.39% of base) : System.Net.Http.dasm - System.Net.Mime.MailBnfHelper:CreateCharactersAllowedInComments():System.Boolean[]
          43 (31.39% of base) : System.Net.Mail.dasm - System.Net.Mime.MailBnfHelper:CreateCharactersAllowedInComments():System.Boolean[]
          42 ( 4.34% of base) : System.Configuration.ConfigurationManager.dasm - System.Configuration.SectionInformation:SetRuntimeConfigurationInformation(System.Configuration.BaseConfigurationRecord,System.Configuration.FactoryRecord,System.Configuration.SectionRecord):this
          39 ( 1.76% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Symbols.SourceMemberContainerTypeSymbol:ForceComplete(Microsoft.CodeAnalysis.SourceLocation,System.Threading.CancellationToken):this
          37 ( 2.67% of base) : System.Private.CoreLib.dasm - DecCalc:ScaleResult(long,int,int):int
          37 ( 6.70% of base) : System.Private.CoreLib.dasm - System.Buffers.Text.Utf8Formatter:TryFormatDecimalF(byref,System.Span`1[Byte],byref,ubyte):bool
          37 ( 9.18% of base) : System.Private.DataContractSerialization.dasm - NamespaceManager:LookupAttributePrefix(System.String):System.String:this
          37 ( 9.39% of base) : System.Private.Xml.Linq.dasm - System.Xml.Linq.XNode:CompareDocumentOrder(System.Xml.Linq.XNode,System.Xml.Linq.XNode):int
          35 (15.28% of base) : System.Linq.dasm - System.Linq.EnumerableSorter`2[__Canon,Int64][System.__Canon,System.Int64]:ComputeKeys(System.__Canon[],int):this
          35 (15.28% of base) : System.Linq.dasm - System.Linq.EnumerableSorter`2[Int64,Int64][System.Int64,System.Int64]:ComputeKeys(System.Int64[],int):this
          34 ( 4.44% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Symbols.SourceAssemblySymbol:ForceComplete(Microsoft.CodeAnalysis.SourceLocation,System.Threading.CancellationToken):this
          34 ( 8.08% of base) : System.Private.CoreLib.dasm - System.Text.UTF7Encoding:MakeTables():this
          34 ( 0.25% of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.RegexInterpreter:Go():this (2 methods)
          31 (26.50% of base) : System.Net.Http.dasm - System.Net.Mime.MailBnfHelper:CreateCharactersAllowedInQuotedStrings():System.Boolean[]
          31 (26.50% of base) : System.Net.Http.dasm - System.Net.Mime.MailBnfHelper:CreateCharactersAllowedInDomainLiterals():System.Boolean[]

Top method regressions (percentages):
          14 (77.78% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Byte][System.Byte]:get_Length():int:this
          14 (77.78% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Byte][System.Byte]:System.Collections.Generic.IReadOnlyCollection<'T>.get_Count():int:this
          14 (77.78% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int16][System.Int16]:get_Length():int:this
          14 (77.78% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int16][System.Int16]:System.Collections.Generic.IReadOnlyCollection<'T>.get_Count():int:this
          14 (77.78% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int32][System.Int32]:get_Length():int:this
          14 (77.78% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int32][System.Int32]:System.Collections.Generic.IReadOnlyCollection<'T>.get_Count():int:this
          14 (77.78% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Double][System.Double]:get_Length():int:this
          14 (77.78% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Double][System.Double]:System.Collections.Generic.IReadOnlyCollection<'T>.get_Count():int:this
          14 (77.78% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Vector`1][System.Numerics.Vector`1[System.Single]]:get_Length():int:this
          14 (77.78% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Vector`1][System.Numerics.Vector`1[System.Single]]:System.Collections.Generic.IReadOnlyCollection<'T>.get_Count():int:this
          14 (77.78% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int64][System.Int64]:get_Length():int:this
          14 (77.78% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int64][System.Int64]:System.Collections.Generic.IReadOnlyCollection<'T>.get_Count():int:this
          14 (60.87% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Byte][System.Byte]:GetReverseIndex(int,int):int:this
          14 (60.87% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int16][System.Int16]:GetReverseIndex(int,int):int:this
          14 (60.87% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int32][System.Int32]:GetReverseIndex(int,int):int:this
          14 (60.87% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Double][System.Double]:GetReverseIndex(int,int):int:this
          14 (60.87% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Vector`1][System.Numerics.Vector`1[System.Single]]:GetReverseIndex(int,int):int:this
          14 (60.87% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int64][System.Int64]:GetReverseIndex(int,int):int:this
          12 (60.00% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ListModule:Length(Microsoft.FSharp.Collections.FSharpList`1[Byte]):int
          12 (60.00% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ListModule:Length(Microsoft.FSharp.Collections.FSharpList`1[Int16]):int

8312 total methods with Code Size differences (0 improved, 8312 regressed), 332649 unchanged.

Summary of Allocation Size diffs:

(Lower is better)

Total bytes of base: 51850481
Total bytes of diff: 52161852
Total bytes of delta: 311371 (0.60% of base)
diff is a regression.

Allocation size impact details
jit-analyze -b E:\alignment\32-bytes\final-PR\noalign -d E:\alignment\32-bytes\final-PR\16B -m AllocSize 
Found 173 files with textual diffs.

Summary of Allocation Size diffs:
(Lower is better)

Total bytes of base: 51850481
Total bytes of diff: 52161852
Total bytes of delta: 311371 (0.60% of base)
    diff is a regression.

Top file regressions (bytes):
       60113 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (1.95% of base)
       35303 : System.Private.CoreLib.dasm (0.71% of base)
       28517 : FSharp.Core.dasm (0.88% of base)
       17975 : System.Private.Xml.dasm (0.50% of base)
       13856 : System.Linq.dasm (1.51% of base)
       12833 : Microsoft.CodeAnalysis.CSharp.dasm (0.29% of base)
       12737 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.23% of base)
       11442 : Microsoft.CodeAnalysis.dasm (0.65% of base)
       10641 : System.Data.Common.dasm (0.72% of base)
        9183 : System.Collections.Immutable.dasm (0.81% of base)
        5718 : System.Collections.Concurrent.dasm (1.69% of base)
        5627 : System.Linq.Parallel.dasm (0.33% of base)
        5273 : System.Private.DataContractSerialization.dasm (0.69% of base)
        4205 : System.Collections.dasm (0.92% of base)
        4177 : System.Linq.Expressions.dasm (0.54% of base)
        3792 : Microsoft.Diagnostics.FastSerialization.dasm (3.78% of base)
        3329 : System.Threading.Tasks.Dataflow.dasm (0.38% of base)
        3061 : Newtonsoft.Json.dasm (0.36% of base)
        2775 : System.Numerics.Tensors.dasm (0.87% of base)
        2391 : Microsoft.CSharp.dasm (0.63% of base)

163 total files with Allocation Size differences (0 improved, 163 regressed), 105 unchanged.

Top method regressions (bytes):
         244 (25.23% of base) : System.Configuration.ConfigurationManager.dasm - System.Configuration.SectionInformation:SetRuntimeConfigurationInformation(System.Configuration.BaseConfigurationRecord,System.Configuration.FactoryRecord,System.Configuration.SectionRecord):this
         224 ( 6.94% of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.RegexInterpreter:FindFirstChar():bool:this (2 methods)
         210 ( 6.66% of base) : FSharp.Core.dasm - <StartupCode$FSharp-Core>.$Quotations:eq@197(Microsoft.FSharp.Quotations.Tree,Microsoft.FSharp.Quotations.Tree):bool
         210 ( 1.55% of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.RegexInterpreter:Go():this (2 methods)
         165 ( 7.41% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Symbols.SourceMemberContainerTypeSymbol:ForceComplete(Microsoft.CodeAnalysis.SourceLocation,System.Threading.CancellationToken):this
         153 (11.04% of base) : System.Private.CoreLib.dasm - DecCalc:ScaleResult(long,int,int):int
         150 ( 1.22% of base) : System.Formats.Asn1.dasm - System.Formats.Asn1.AsnWriter:WriteGeneralizedTimeCore(System.Formats.Asn1.Asn1Tag,System.DateTimeOffset,bool):this
         150 (10.69% of base) : System.Private.CoreLib.dasm - System.Type:FindMembers(int,int,System.Reflection.MemberFilter,System.Object):System.Reflection.MemberInfo[]:this
         143 ( 4.02% of base) : System.DirectoryServices.Protocols.dasm - System.DirectoryServices.Protocols.LdapConnection:SendRequestHelper(System.DirectoryServices.Protocols.DirectoryRequest,byref):int:this
         139 (10.53% of base) : System.Runtime.Numerics.dasm - System.Numerics.BigInteger:.ctor(System.ReadOnlySpan`1[Byte],bool,bool):this
         135 ( 3.69% of base) : System.Private.CoreLib.dasm - System.Globalization.DateTimeFormatInfo:CreateTokenHashTable():System.Globalization.DateTimeFormatInfo+TokenHashValue[]:this
         128 ( 3.51% of base) : System.Management.dasm - System.Management.PropertyData:MapValueToWmiValue(System.Object,byref,byref):System.Object
         128 ( 1.23% of base) : System.Private.CoreLib.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this
         128 ( 3.26% of base) : System.Private.Xml.dasm - System.Xml.Schema.XmlSchemaInference:InferSimpleType(System.String,byref):int
         124 ( 8.52% of base) : System.Data.OleDb.dasm - System.Data.OleDb.OleDbDataReader:CreateBindingsFromMetaData(bool):System.Data.OleDb.Bindings[]:this
         120 ( 5.32% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Symbols.SourceMemberContainerTypeSymbol:HasInstanceData(Microsoft.CodeAnalysis.CSharp.Syntax.MemberDeclarationSyntax):bool
         120 (19.08% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Parsers.Symbol.FileVersionTraceData:PayloadValue(int):System.Object:this
         120 (14.10% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Parsers.Symbol.FileVersionTraceData:ToXml(System.Text.StringBuilder):System.Text.StringBuilder:this
         120 ( 4.11% of base) : Newtonsoft.Json.dasm - Newtonsoft.Json.JsonValidatingReader:ValidateCurrentToken():this
         120 ( 7.04% of base) : System.Private.CoreLib.dasm - System.SpanHelpers:SequenceCompareTo(byref,int,byref,int):int (7 methods)

Top method regressions (percentages):
          15 (750.00% of base) : System.Private.Xml.dasm - System.Xml.Xsl.Runtime.XmlAttributeCache:WriteValue(System.String):this
          15 (115.38% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Syntax.InternalSyntax.SyntaxListBuilder:Validate(int,int):this
          15 (115.38% of base) : xunit.performance.execution.dasm - BenchmarkIteratorImpl:SpinDelay(int)
          15 (100.00% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[__Canon][System.__Canon]:Log2(int):int
          15 (100.00% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[Byte][System.Byte]:Log2(int):int
          15 (100.00% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[Int16][System.Int16]:Log2(int):int
          15 (100.00% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[Int32][System.Int32]:Log2(int):int
          15 (100.00% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[Double][System.Double]:Log2(int):int
          15 (100.00% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[Vector`1][System.Numerics.Vector`1[System.Single]]:Log2(int):int
          15 (100.00% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[Int64][System.Int64]:Log2(int):int
          15 (93.75% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.Syntax.InternalSyntax.SyntaxListBuilder:Validate(int,int):this
          15 (93.75% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.Syntax.SyntaxListBuilder:Validate(int,int):this
          15 (83.33% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Byte][System.Byte]:get_Length():int:this
          15 (83.33% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Byte][System.Byte]:System.Collections.Generic.IReadOnlyCollection<'T>.get_Count():int:this
          15 (83.33% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int16][System.Int16]:get_Length():int:this
          15 (83.33% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int16][System.Int16]:System.Collections.Generic.IReadOnlyCollection<'T>.get_Count():int:this
          15 (83.33% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int32][System.Int32]:get_Length():int:this
          15 (83.33% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int32][System.Int32]:System.Collections.Generic.IReadOnlyCollection<'T>.get_Count():int:this
          15 (83.33% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Double][System.Double]:get_Length():int:this
          15 (83.33% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Double][System.Double]:System.Collections.Generic.IReadOnlyCollection<'T>.get_Count():int:this

18120 total methods with Allocation Size differences (0 improved, 18120 regressed), 322841 unchanged.

No-alignment vs. 32B adaptive

Summary of Code Size diffs:

(Lower is better)

Total bytes of base: 51527293
Total bytes of diff: 51567317
Total bytes of delta: 40024 (0.08% of base)
diff is a regression.

Code size impact details
jit-analyze -b E:\alignment\32-bytes\final-PR\noalign -d E:\alignment\32-bytes\final-PR\32BElastic-pred 

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 51527293
Total bytes of diff: 51567317
Total bytes of delta: 40024 (0.08% of base)
    diff is a regression.

Top file regressions (bytes):
        5469 : System.Private.CoreLib.dasm (0.11% of base)
        4650 : FSharp.Core.dasm (0.14% of base)
        2714 : System.Private.Xml.dasm (0.08% of base)
        2415 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (0.08% of base)
        2299 : Microsoft.CodeAnalysis.CSharp.dasm (0.05% of base)
        1808 : Microsoft.CodeAnalysis.dasm (0.10% of base)
        1730 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.03% of base)
        1679 : System.Data.Common.dasm (0.11% of base)
        1310 : System.Linq.dasm (0.14% of base)
        1210 : System.Collections.Immutable.dasm (0.11% of base)
        1166 : System.Collections.Concurrent.dasm (0.35% of base)
         749 : System.Private.DataContractSerialization.dasm (0.10% of base)
         648 : System.Threading.Tasks.Dataflow.dasm (0.07% of base)
         575 : System.Collections.dasm (0.13% of base)
         475 : System.Numerics.Tensors.dasm (0.15% of base)
         388 : Microsoft.Diagnostics.FastSerialization.dasm (0.39% of base)
         384 : System.Text.RegularExpressions.dasm (0.15% of base)
         381 : System.Net.Http.dasm (0.05% of base)
         376 : System.Linq.Expressions.dasm (0.05% of base)
         373 : System.Configuration.ConfigurationManager.dasm (0.11% of base)

152 total files with Code Size differences (0 improved, 152 regressed), 116 unchanged.

Top method regressions (bytes):
          67 ( 2.12% of base) : FSharp.Core.dasm - <StartupCode$FSharp-Core>.$Quotations:eq@197(Microsoft.FSharp.Quotations.Tree,Microsoft.FSharp.Quotations.Tree):bool
          66 ( 0.49% of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.RegexInterpreter:Go():this (2 methods)
          58 ( 9.22% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Parsers.Symbol.FileVersionTraceData:PayloadValue(int):System.Object:this
          58 ( 1.80% of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.RegexInterpreter:FindFirstChar():bool:this (2 methods)
          52 ( 7.14% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Symbols.SourceMethodSymbol:ForceComplete(Microsoft.CodeAnalysis.SourceLocation,System.Threading.CancellationToken):this
          50 ( 6.53% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Symbols.SourceAssemblySymbol:ForceComplete(Microsoft.CodeAnalysis.SourceLocation,System.Threading.CancellationToken):this
          44 ( 5.18% of base) : System.Collections.Immutable.dasm - System.Collections.Immutable.ImmutableInterlocked:Update(byref,System.Func`3[ImmutableArray`1,Int64,ImmutableArray`1],long):bool (7 methods)
          44 ( 7.56% of base) : System.Runtime.Serialization.Formatters.dasm - System.Runtime.Serialization.Formatters.Binary.ObjectWriter:WriteRectangle(System.Runtime.Serialization.Formatters.Binary.WriteObjectInfo,int,System.Int32[],System.Array,System.Runtime.Serialization.Formatters.Binary.NameInfo,System.Int32[]):this
          43 (31.39% of base) : System.Net.Http.dasm - System.Net.Mime.MailBnfHelper:CreateCharactersAllowedInComments():System.Boolean[]
          43 (31.39% of base) : System.Net.Mail.dasm - System.Net.Mime.MailBnfHelper:CreateCharactersAllowedInComments():System.Boolean[]
          38 ( 1.93% of base) : System.Data.Odbc.dasm - System.Data.Odbc.OdbcDataReader:RetrieveKeyInfoFromStatistics(QualifiedTableName,bool):int:this
          37 ( 8.69% of base) : System.Data.Common.dasm - System.Data.SqlTypes.SqlString:CompareBinary2(System.Data.SqlTypes.SqlString,System.Data.SqlTypes.SqlString):int
          37 ( 6.70% of base) : System.Private.CoreLib.dasm - System.Buffers.Text.Utf8Formatter:TryFormatDecimalF(byref,System.Span`1[Byte],byref,ubyte):bool
          36 ( 4.96% of base) : System.Private.Xml.dasm - System.Xml.Xsl.XPathConvert:StringToDouble(System.String):double
          33 (13.25% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Imports:Complete(System.Threading.CancellationToken):this
          33 (27.97% of base) : System.IO.Compression.dasm - System.IO.Compression.HuffmanTree:GetStaticLiteralTreeLength():System.Byte[]
          33 ( 2.50% of base) : System.Runtime.Numerics.dasm - System.Numerics.BigInteger:.ctor(System.ReadOnlySpan`1[Byte],bool,bool):this
          32 ( 3.31% of base) : System.Configuration.ConfigurationManager.dasm - System.Configuration.SectionInformation:SetRuntimeConfigurationInformation(System.Configuration.BaseConfigurationRecord,System.Configuration.FactoryRecord,System.Configuration.SectionRecord):this
          31 (26.50% of base) : System.Net.Http.dasm - System.Net.Mime.MailBnfHelper:CreateCharactersAllowedInQuotedStrings():System.Boolean[]
          31 (26.50% of base) : System.Net.Http.dasm - System.Net.Mime.MailBnfHelper:CreateCharactersAllowedInDomainLiterals():System.Boolean[]

Top method regressions (percentages):
          15 (37.50% of base) : System.Collections.Immutable.dasm - Node[Double][System.Double]:get_Max():double:this
          15 (37.50% of base) : System.Collections.Immutable.dasm - Node[Double][System.Double]:get_Min():double:this
          15 (36.59% of base) : System.Private.CoreLib.dasm - Container[__Canon,__Canon][System.__Canon,System.__Canon]:RemoveAllKeys():this
          15 (35.71% of base) : Microsoft.CodeAnalysis.dasm - Microsoft.Cci.MetadataWriter:Count(System.String,ushort):int
          15 (34.88% of base) : System.Collections.Concurrent.dasm - System.Collections.Concurrent.ConcurrentDictionary`2[__Canon,Int64][System.__Canon,System.Int64]:<get_IsEmpty>g__AreAllBucketsEmpty|50_0():bool:this
          15 (34.88% of base) : System.Collections.Concurrent.dasm - System.Collections.Concurrent.ConcurrentDictionary`2[Byte,Int64][System.Byte,System.Int64]:<get_IsEmpty>g__AreAllBucketsEmpty|50_0():bool:this
          15 (34.88% of base) : System.Collections.Concurrent.dasm - System.Collections.Concurrent.ConcurrentDictionary`2[Int16,Int64][System.Int16,System.Int64]:<get_IsEmpty>g__AreAllBucketsEmpty|50_0():bool:this
          15 (34.88% of base) : System.Collections.Concurrent.dasm - System.Collections.Concurrent.ConcurrentDictionary`2[Int32,Int64][System.Int32,System.Int64]:<get_IsEmpty>g__AreAllBucketsEmpty|50_0():bool:this
          15 (34.88% of base) : System.Collections.Concurrent.dasm - System.Collections.Concurrent.ConcurrentDictionary`2[Double,Int64][System.Double,System.Int64]:<get_IsEmpty>g__AreAllBucketsEmpty|50_0():bool:this
          15 (34.88% of base) : System.Collections.Concurrent.dasm - System.Collections.Concurrent.ConcurrentDictionary`2[Vector`1,Int64][System.Numerics.Vector`1[System.Single],System.Int64]:<get_IsEmpty>g__AreAllBucketsEmpty|50_0():bool:this
          15 (34.88% of base) : System.Collections.Concurrent.dasm - System.Collections.Concurrent.ConcurrentDictionary`2[Int64,Int64][System.Int64,System.Int64]:<get_IsEmpty>g__AreAllBucketsEmpty|50_0():bool:this
          14 (32.56% of base) : System.Reflection.MetadataLoadContext.dasm - System.Reflection.TypeLoading.GetTypeCoreCache:ComputeHashCode(System.ReadOnlySpan`1[Byte]):int
          14 (32.56% of base) : System.Security.Claims.dasm - System.Security.Claims.ClaimsIdentity:IsCircular(System.Security.Claims.ClaimsIdentity):bool:this
          15 (31.91% of base) : System.Private.Xml.dasm - System.Xml.Serialization.SchemaObjectWriter:WriteIndent():this
          14 (31.82% of base) : System.Private.CoreLib.dasm - System.Reflection.MetadataToken:IsTokenOfType(int,System.Reflection.MetadataTokenType[]):bool
          43 (31.39% of base) : System.Net.Http.dasm - System.Net.Mime.MailBnfHelper:CreateCharactersAllowedInComments():System.Boolean[]
          43 (31.39% of base) : System.Net.Mail.dasm - System.Net.Mime.MailBnfHelper:CreateCharactersAllowedInComments():System.Boolean[]
          15 (31.25% of base) : System.Private.DataContractSerialization.dasm - System.Runtime.Serialization.Json.XmlJsonWriter:WriteIndent():this
          15 (30.00% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SynthesizedComInterface:GetNextAvailableDispId(System.Collections.Generic.HashSet`1[Int32],byref):int
          11 (29.73% of base) : System.Collections.Concurrent.dasm - System.Collections.Concurrent.ConcurrentDictionary`2[__Canon,Int64][System.__Canon,System.Int64]:GetCountInternal():int:this

6756 total methods with Code Size differences (0 improved, 6756 regressed), 334205 unchanged.

Summary of Allocation Size diffs:

(Lower is better)

Total bytes of base: 51850481
Total bytes of diff: 5221468
Total bytes of delta: 364201 (0.70% of base)
diff is a regression.

Allocation size impact details
jit-analyze -b E:\alignment\32-bytes\final-PR\noalign -d E:\alignment\32-bytes\final-PR\32BElastic-pred -m AllocSize 

Summary of Allocation Size diffs:
(Lower is better)

Total bytes of base: 51850481
Total bytes of diff: 52214682
Total bytes of delta: 364201 (0.70% of base)
    diff is a regression.

Top file regressions (bytes):
       61298 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (1.99% of base)
       39293 : System.Private.CoreLib.dasm (0.79% of base)
       30947 : FSharp.Core.dasm (0.95% of base)
       22265 : System.Private.Xml.dasm (0.62% of base)
       19541 : System.Linq.dasm (2.13% of base)
       16142 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.29% of base)
       15773 : Microsoft.CodeAnalysis.CSharp.dasm (0.36% of base)
       14496 : System.Data.Common.dasm (0.98% of base)
       13797 : Microsoft.CodeAnalysis.dasm (0.79% of base)
       11058 : System.Collections.Immutable.dasm (0.98% of base)
        8717 : System.Linq.Parallel.dasm (0.51% of base)
        6933 : System.Collections.Concurrent.dasm (2.05% of base)
        5933 : System.Private.DataContractSerialization.dasm (0.78% of base)
        5285 : System.Collections.dasm (1.16% of base)
        4582 : System.Linq.Expressions.dasm (0.59% of base)
        4122 : Microsoft.Diagnostics.FastSerialization.dasm (4.11% of base)
        3706 : Newtonsoft.Json.dasm (0.43% of base)
        3554 : System.Threading.Tasks.Dataflow.dasm (0.41% of base)
        3243 : System.Configuration.ConfigurationManager.dasm (0.93% of base)
        3180 : System.Numerics.Tensors.dasm (1.00% of base)

164 total files with Allocation Size differences (0 improved, 164 regressed), 104 unchanged.

Top method regressions (bytes):
         244 (25.23% of base) : System.Configuration.ConfigurationManager.dasm - System.Configuration.SectionInformation:SetRuntimeConfigurationInformation(System.Configuration.BaseConfigurationRecord,System.Configuration.FactoryRecord,System.Configuration.SectionRecord):this
         240 ( 1.77% of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.RegexInterpreter:Go():this (2 methods)
         224 ( 6.94% of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.RegexInterpreter:FindFirstChar():bool:this (2 methods)
         210 ( 6.66% of base) : FSharp.Core.dasm - <StartupCode$FSharp-Core>.$Quotations:eq@197(Microsoft.FSharp.Quotations.Tree,Microsoft.FSharp.Quotations.Tree):bool
         180 (12.83% of base) : System.Private.CoreLib.dasm - System.Type:FindMembers(int,int,System.Reflection.MemberFilter,System.Object):System.Reflection.MemberInfo[]:this
         173 ( 4.87% of base) : System.DirectoryServices.Protocols.dasm - System.DirectoryServices.Protocols.LdapConnection:SendRequestHelper(System.DirectoryServices.Protocols.DirectoryRequest,byref):int:this
         165 ( 7.41% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Symbols.SourceMemberContainerTypeSymbol:ForceComplete(Microsoft.CodeAnalysis.SourceLocation,System.Threading.CancellationToken):this
         158 ( 4.33% of base) : System.Management.dasm - System.Management.PropertyData:MapValueToWmiValue(System.Object,byref,byref):System.Object
         158 ( 4.03% of base) : System.Private.Xml.dasm - System.Xml.Schema.XmlSchemaInference:InferSimpleType(System.String,byref):int
         153 (11.04% of base) : System.Private.CoreLib.dasm - DecCalc:ScaleResult(long,int,int):int
         150 ( 1.22% of base) : System.Formats.Asn1.dasm - System.Formats.Asn1.AsnWriter:WriteGeneralizedTimeCore(System.Formats.Asn1.Asn1Tag,System.DateTimeOffset,bool):this
         150 ( 4.10% of base) : System.Private.CoreLib.dasm - System.Globalization.DateTimeFormatInfo:CreateTokenHashTable():System.Globalization.DateTimeFormatInfo+TokenHashValue[]:this
         139 (10.53% of base) : System.Runtime.Numerics.dasm - System.Numerics.BigInteger:.ctor(System.ReadOnlySpan`1[Byte],bool,bool):this
         128 ( 2.77% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.Syntax.InternalSyntax.Scanner:ScanNumericLiteral(Microsoft.CodeAnalysis.VisualBasic.Syntax.InternalSyntax.SyntaxList`1[[Microsoft.CodeAnalysis.VisualBasic.Syntax.InternalSyntax.VisualBasicSyntaxNode, Microsoft.CodeAnalysis.VisualBasic, Version=1.1.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]]):Microsoft.CodeAnalysis.VisualBasic.Syntax.InternalSyntax.SyntaxToken:this
         128 ( 7.30% of base) : System.DirectoryServices.Protocols.dasm - System.DirectoryServices.Protocols.LdapSessionOptions:StartTransportLayerSecurity(System.DirectoryServices.Protocols.DirectoryControlCollection):this
         128 ( 1.23% of base) : System.Private.CoreLib.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this
         124 ( 9.47% of base) : System.Data.Common.dasm - System.Data.Select:CreateIndex():this
         124 ( 8.52% of base) : System.Data.OleDb.dasm - System.Data.OleDb.OleDbDataReader:CreateBindingsFromMetaData(bool):System.Data.OleDb.Bindings[]:this
         124 ( 3.52% of base) : System.Private.Xml.dasm - System.Xml.Schema.Compiler:Compile():bool:this
         120 ( 4.33% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Syntax.InternalSyntax.LanguageParser:ParseNamespaceBody(byref,byref,byref,ushort):this

Top method regressions (percentages):
          15 (750.00% of base) : System.Private.Xml.dasm - System.Xml.Xsl.Runtime.XmlAttributeCache:WriteValue(System.String):this
          15 (115.38% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Syntax.InternalSyntax.SyntaxListBuilder:Validate(int,int):this
          15 (115.38% of base) : xunit.performance.execution.dasm - BenchmarkIteratorImpl:SpinDelay(int)
          15 (100.00% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[__Canon][System.__Canon]:Log2(int):int
          15 (100.00% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[Byte][System.Byte]:Log2(int):int
          15 (100.00% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[Int16][System.Int16]:Log2(int):int
          15 (100.00% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[Int32][System.Int32]:Log2(int):int
          15 (100.00% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[Double][System.Double]:Log2(int):int
          15 (100.00% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[Vector`1][System.Numerics.Vector`1[System.Single]]:Log2(int):int
          15 (100.00% of base) : System.Collections.dasm - System.Collections.Generic.SortedSet`1[Int64][System.Int64]:Log2(int):int
          15 (93.75% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.Syntax.InternalSyntax.SyntaxListBuilder:Validate(int,int):this
          15 (93.75% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.Syntax.SyntaxListBuilder:Validate(int,int):this
          15 (83.33% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Byte][System.Byte]:get_Length():int:this
          15 (83.33% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Byte][System.Byte]:System.Collections.Generic.IReadOnlyCollection<'T>.get_Count():int:this
          15 (83.33% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int16][System.Int16]:get_Length():int:this
          15 (83.33% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int16][System.Int16]:System.Collections.Generic.IReadOnlyCollection<'T>.get_Count():int:this
          15 (83.33% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int32][System.Int32]:get_Length():int:this
          15 (83.33% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Int32][System.Int32]:System.Collections.Generic.IReadOnlyCollection<'T>.get_Count():int:this
          15 (83.33% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Double][System.Double]:get_Length():int:this
          15 (83.33% of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.FSharpList`1[Double][System.Double]:System.Collections.Generic.IReadOnlyCollection<'T>.get_Count():int:this

19974 total methods with Allocation Size differences (0 improved, 19974 regressed), 320987 unchanged.

Update

Overestimation fix

Since the allocation size regressed heavily because of over-estimated align instructions, in the recent update, I tried to fix that problem. The only reason that exact padding needed to align a loop could not be determined before allocating the memory was because certain instructions were over-estimated and during outputting the they were outputted in compact manner leading to mismatch of the calculation we did to come up with padding amount needed. Majority of over-estimation was occurring because of #21453 where we were optimizing the encoding of certain instructions by trimming the VEX prefix from 3-bytes to 2-bytes. To make sure, that does not affect alignment calculation, I have disabled this optimization until we reach the last align instruction. Disabling compact VEX encoding should not affect performance. The only downside of it is that code size of methods containing align instructions will be more, but there won't be any change in amount of memory allocated. For the remaining over-estimated instructions,(see more details on them at #8748 (comment)), we will add a NOP after them. Again, this will happen until we reach last align instruction of a method after which no compensation for over-estimation will be done.

With that, we see that the difference of allocation size vs. code size reduces as seen in below graph.

image

image

Summary of Code Size diffs:

(Lower is better)
Total bytes of base: 51624344
Total bytes of diff: 51643618
Total bytes of delta: 19274 (0.04% of base)
diff is a regression.

Summary of Allocation Size diffs:

(Lower is better)

Total bytes of base: 51942836
Total bytes of diff: 51957731
Total bytes of delta: 14895 (0.03% of base)
diff is a regression.

Perf impact

I have been testing various Microbenchmarks with my loop alignment changes (both adaptive and non-adaptive) and sharing my findings in #44051. I also did a run of all Microbenchmarks on my machine for no-alignment, 32B (non-adaptive), 16B (non-adaptive), 32BAdaptive and 32BJCC. Then I compared the results with noise threshold of 1ns and statistical threshold of 3%.

In the following graph, X-axis is benchmarks ID and Y-axis is base/diff. Since both base and diff are measured in nanoseconds, higher the ratio, better is the performance.

image

Here are some of the key observations:

  • 32B adaptive approach beats 32B non-adaptive approach.
  • Fixing JCC when making it 32B aligned doesn't improve the performance when compared to 32B adaptive approach.

If we zoom the graph and make Y-axis to log10 scale, here are some observations:

  • 32B adaptive improves sooner after 171 benchmarks as compared to the next better approach which is 32B non-adaptive that gains performance after 241 benchmarks.
  • We get maximum performance benefit sooner with 32B adaptive approach.

image

I did analyzed some regressions and some of them were coming from memory alignment although I did not verified all of them.

Defaults

Looking at the code size and perf impact, 32B adaptive alignment approach is ON by default. We can switch off loop alignment using COMPlus_JitAlignLoops=0.

Flags

Here are set of flags that are added as part of this PR:

  • COMPlus_JitAlignLoopMinBlockWeight: Debug flag that controls the minimum block weight of a loop after which alignment will happen. Default value is 10.
  • COMPlus_JitAlignLoopMaxCodeSize: Debug flag that controls maximum loop size for which alignment will happen. Default value is 96 bytes for non-adaptive and ignored for adaptive.
  • COMPlus_JitAlignLoopBoundary : Debug flag that controls the boundary at which loop will be aligned. Default value is 32 bytes.
  • COMPlus_JitAlignLoopForJcc : Debug flag that controls if JCC adjustment should be done during non-adaptive alignment. Default value is 0.
  • COMPlus_JitAlignLoopAdaptive : Debug flag that controls if adaptive or non-adaptive alignment should happen. Default value is 1.
  • COMPlus_JitAlignLoops : If loop alignment should be done or not. Default value is 1.

Other changes

  • When COMPlus_JitDasmWithAddress is set, also emit the chunk boundary in the disassembly. The alignment boundary is whatever value is set in COMPlus_JitAlignLoopBoundary. If an instruction crosses the boundary, the logging will also indicate that.
Sample disassembly output
00007ffb`c8a0f418        0F84B5010000         je       G_M37683_IG30
 00007ffb`c8a0f41e        0FB74906             movzx    rcx, word  ptr [rcx+6]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (movzx: 2) 32B boundary ...............................
 00007ffb`c8a0f422        3BD9                 cmp      ebx, ecx
 00007ffb`c8a0f424        0F84A3010000         je       G_M37683_IG29
 00007ffb`c8a0f42a        4983C604             add      r14, 4
 00007ffb`c8a0f42e        4983C4FC             add      r12, -4
						;; bbWeight=2    PerfScore 28.50
G_M37683_IG07:              ;; offset=00B2H
 00007ffb`c8a0f432        4983FC04             cmp      r12, 4
 00007ffb`c8a0f436        7DBC                 jge      SHORT G_M37683_IG06
						;; bbWeight=16    PerfScore 20.00
G_M37683_IG08:              ;; offset=00B8H
 00007ffb`c8a0f438        4D85E4               test     r12, r12
 00007ffb`c8a0f43b        7E2A                 jle      SHORT G_M37683_IG10
 00007ffb`c8a0f43d        0FB7DF               movzx    rbx, di
; ............................... 32B boundary ...............................
 00007ffb`c8a0f440                             align    
						;; bbWeight=4    PerfScore 7.00
G_M37683_IG09:              ;; offset=00C0H
 00007ffb`c8a0f440        420FB70476           movzx    rax, word  ptr [rsi+2*r14]
 00007ffb`c8a0f445        3BC3                 cmp      eax, ebx
 00007ffb`c8a0f447        0F8483010000         je       G_M37683_IG32
 00007ffb`c8a0f44d        49FFC6               inc      r14
 00007ffb`c8a0f450        49FFCC               dec      r12
 00007ffb`c8a0f453        4D85E4               test     r12, r12
 00007ffb`c8a0f456        7FE8                 jg       SHORT G_M37683_IG09
						;; bbWeight=16    PerfScore 80.00
G_M37683_IG10:              ;; offset=00D8H
 00007ffb`c8a0f458        4D3BF7               cmp      r14, r15
 00007ffb`c8a0f45b        0F8D36010000         jge      G_M37683_IG27
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jge: 1) 32B boundary ...............................
 00007ffb`c8a0f461        488BD5               mov      rdx, rbp
 00007ffb`c8a0f464        498BCF               mov      rcx, r15
 00007ffb`c8a0f467        492BCE               sub      rcx, r14
 00007ffb`c8a0f46a        4883F908             cmp      rcx, 8
 00007ffb`c8a0f46e        7D08                 jge      SHORT G_M37683_IG12
						;; bbWeight=4    PerfScore 13.00
G_M37683_IG11:              ;; offset=00F0H
 00007ffb`c8a0f470        488BCA               mov      rcx, rdx
 00007ffb`c8a0f473        E8C0C8FFFF           call     System.Diagnostics.Debug:Fail(System.String,System.String)
						;; bbWeight=1    PerfScore 1.25
G_M37683_IG12:              ;; offset=00F8H
 00007ffb`c8a0f478        4A8D0C76             lea      rcx, bword ptr [rsi+2*r14]
 00007ffb`c8a0f47c        488BD1               mov      rdx, rcx
 00007ffb`c8a0f47f        F6C21F               test     dl, 31
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (test: 2) 32B boundary ...............................
  • Another change that I added is to dump the total allocation request made to runtime for the code produced for a method. This helped me understanding the effect of alignment instructions that I emitted but did not emit corresponding padding.

Update

Impacted loops

From my preliminary pmi run on .NET libraries, with this feature, we align 1908 loops present in 1635 methods. This number is less compared to my earlier implementation where I was also aligning loops containing call. In that case, we were aligning 4692 loops present in 4239, but since alignment will not benefit such loops anyway, I am disabling alignment if it has call.

Edge cases

There were certain edge cases that need to be fixed:

  • If a loop is cloned and/or a new block is added by resolution phase of register allocator, there were cases where the new block jumped to a previous block such that it covered a loop that is already marked for alignment. I now detect such patterns and do not mark loop 4~7 as needing alignment.
            4 <------x
            |        |
            V        |
            5 <---x  |
            |     |  |
            V     |  |
            6 ----x  |
            |        |
            V        |
            7 -------x

 loop 5~6 : marked for alignment
 block 7 : added by resolution phase
 
  • Critical edge: If critical edge is added by register allocator, when calculating the loop size, we might encounter an instruction group that has align instructions for next loop that follows current loop. For such cases, do not take the size of align instruction in to account because it is reserved for the next loop.
            4 <------x
            |        |
            V        |
            5 -------x
            | <align>
            V
            6 <------x
            |        |
            V        |
            7 -------x

 loop 5~6 : marked for alignment
 block 7 : added by resolution phase
 

Follow-ups

If we get this PR in, we will monitor the performance and stability of our Microbenchmarks. If we don't see expected promising results in our perf lab, we will turn OFF this feature. Other follow-ups that we will be doing is:

  • Currently, the padding is added just before the loop and that might sometime have adverse effect on the performance of the method because the NOP might come in the hot path and cycles might be spent for them. The goal is to add padding at various blind spots that occur not just immediately before the loop but anywhere from beginning of the method to the beginning of the loop. An example would be to add NOP after unconditional jump so that it doesn't get executed and we still our loop aligned.
  • Update As described above, we need to fix the over-estimation of instructions for xarch so we can precisely determine if loop alignment is needed and if not, avoid allocating memory for align instruction.

@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib , @danmosemsft , @adamsitnik , @DrewScoggins

@adamsitnik adamsitnik added the tenet-performance Performance related issue label Nov 13, 2020
Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left you some preliminary feedback; still looking things over.

src/coreclr/src/jit/compiler.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/emit.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/emit.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/emit.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/emit.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/emit.cpp Outdated Show resolved Hide resolved

#if defined(TARGET_XARCH)
// https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf
bool isJccAffectedIns = ((lastIns >= INS_i_jmp && lastIns < INS_align) || (lastIns == INS_call) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit seems fragile (lastIns >= INS_i_jmp && lastIns < INS_align) -- what guarantee is there this will remain true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but I didn't find a better way to capture all jmp instructions. Is there any other better way to do that?

src/coreclr/src/jit/emitxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/emitxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/emitxarch.cpp Outdated Show resolved Hide resolved
Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that the proper place for handling the INS_align size in branch tightening (emitJumpDistBind) -- they seem to be the same problem: a set of instructions that are pessimistically estimated at a larger size, then algorithmically shrunk to the final chosen size. In this way, the INS_align padding could be predetermined precisely before the actual memory allocation in allocMem occurs.

For alignment to be correct, it would require that estimated instruction sizes are correct, and we know there are cases today where we are not correct, and overestimate some sizes. While that is a bug, and we emit warnings in the JitDump about it, it would defeat alignment. Therefore, we could fix this for alignment by first determining if there are any INS_align in the function. If so, any time we see an instruction actual size less than estimated size, we make up the difference by emitting NOPs.

It looks like the code has only been implemented for xarch. Is the plan to validate this first, then expand to support arm32/64? In that case, maybe you should define a FEATURE_ALIGN_LOOPS defined for TARGET_XARCH and then it should be easier to find that relevant code when expanding to handle arm.

src/coreclr/src/jit/block.h Outdated Show resolved Hide resolved
src/coreclr/src/jit/compiler.h Outdated Show resolved Hide resolved
src/coreclr/src/jit/compiler.h Outdated Show resolved Hide resolved
src/coreclr/src/jit/flowgraph.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/jitconfigvalues.h Outdated Show resolved Hide resolved
src/coreclr/src/jit/emitxarch.cpp Outdated Show resolved Hide resolved

void emitter::emitVariableLoopAlign(unsigned short alignmentBoundary)
{
unsigned short nPaddingBytes = alignmentBoundary - 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really worth making these short instead of int?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just felt that unsigned short type should be good enough to hold the values in these variables. Do you prefer using unsigned instead?

src/coreclr/src/jit/emitxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/emitxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/emitxarch.cpp Outdated Show resolved Hide resolved
@jkotas
Copy link
Member

jkotas commented Nov 17, 2020

Would it make sense to take into account whether the loop contains any calls? If the loop contains calls, it is less likely to benefit from alignment.

@kunalspathak
Copy link
Member Author

Would it make sense to take into account whether the loop contains any calls? If the loop contains calls, it is less likely to benefit from alignment.

Thanks for the suggestion @jkotas . I have added a check to not perform alignment if there is a call in the loop.

@ViktorHofer
Copy link
Member

// Auto-generated message

69e114c which was merged 12/7 removed the intermediate src/coreclr/src/ folder. This PR needs to be updated as it touches files in that directory which causes conflicts.

To update your commits you can use this bash script: https://gist.github.com/ViktorHofer/6d24f62abdcddb518b4966ead5ef3783. Feel free to use the comment section of the gist to improve the script for others.

@kunalspathak
Copy link
Member Author

@AndyAyersMS , @BruceForstall - This PR is ready to review again. I have added several updates in PR description that describes the approach taken to reduce the allocation size.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed everything save for emit.cpp

src/coreclr/inc/corjitflags.h Show resolved Hide resolved
src/coreclr/jit/block.h Outdated Show resolved Hide resolved
src/coreclr/jit/block.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/codegenlinear.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitxarch.cpp Show resolved Hide resolved
src/coreclr/jit/emitxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/jitee.h Show resolved Hide resolved
src/coreclr/jit/optimizer.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/optimizer.cpp Outdated Show resolved Hide resolved
@kunalspathak kunalspathak force-pushed the loopalignment branch 2 times, most recently from 4aef008 to b530d14 Compare December 18, 2020 03:10
@AndyAyersMS AndyAyersMS added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-Infrastructure-coreclr labels Dec 18, 2020
Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments on things I looked at before. I still need to pull down your latest and look at emit.cpp.

I wonder if you could write up some pseudo-code outlining what we do during emission as it is getting hard to keep track of all the adjustments and counter-adjustments that go on.

Something like:

Before final codegen, we have:

  • marked some IGs as having align instructions at their ends
  • kept track of the last IG that has align instructions
  • alignment instructions size is computed assuming no size overestimates

During final codegen, we must concurrently track

  • need for alignment padding
  • overestimates of non-align instruction sizes upstream of alignment instructions
  • overstimates impact on branch distances

If no instructions sizes are over-estimated then each alignment instruction pads to the size computed before final codegen.

To minimize likelihood of over-estimates certain instructions that are known to be prone to over-estimation are kept at the initial sizes if they are upstream of any alignment instructions.

Remaining over-estimated instructions are handled by padding out to their original size estimate....

(proabably have the details wrong here but hopefully you get the idea)

src/coreclr/jit/compiler.h Show resolved Hide resolved
src/coreclr/jit/emitxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/optimizer.cpp Outdated Show resolved Hide resolved
@kunalspathak
Copy link
Member Author

I wonder if you could write up some pseudo-code outlining

Summary

  • During generateMachineCode(), inside genCodeForBBList():

    • Wen iterating over all the BBs, if we see that the next BB has flag BBF_LOOP_ALIGN, we insert align instruction (of maximum size of 15 bytes) in current IG and mark it with IGF_LOOP_ALIGN flag. The flag tells that the corresponding IG has align instruction and the loop starts from next IG.

    • Whenever we see the "first back-edge" going from BB X to Y, where Y had BBF_LOOP_ALIGN flag, we set the igLoopBackEdge of IG corresponding to BB X to point to IG of BB Y. This helps us in calculating the smallest possible loop size in future steps.

    • When an align instruction is added in an IG, we also add it in a IG-level list emitCurIGAlignList.

    • Whenever a new IG is created, we copy the align instructions from emitCurIGAlignList to the global-level list emitAlignList.

  • During generateMachineCode(), inside emitLoopAlignAdjustments():

    • Iterate over all the align instructions present in emitAlignList and figure out the real padding needed. The padding calculation is done inside emitCalculatePaddingForLoopAlignment().

    • Whenever we determine that paddingNeeded is less than the max value (15 bytes) we set during creation of align instruction, we update the idCodeSize() with the real padding value. We also update the igOffs and igSize to account for the difference.

    • Based on the heuristics, if we determine that padding is not needed because loop is already aligned or it is expensive to add padding, we will reset the flag IGF_LOOP_ALIGN of that IG.

    • At this point, update emitLastAlignedIgNum that represents the last IG containing align instruction that needed non-zero padding. Details on this below.

  • During genEmitMachineCode():

    • If we see INS_align instruction, we call emitOutputAlign() that read the idCodeSize() of that instruction to determine how much padding was needed and add NOP.

    • Under #DEBUG, as a safety check, inside emitOutputAlign(), I again calculate the padding needed based on the buffer address and make sure that the padding amount we estimated matches the actual padding amount.

    • If we encounter an instruction that was over-estimated, we need to compensate the over-estimation so our calculation related to the padding amount for loop alignment remains valid. As a result, we add compensatory NOP after every such over-estimated instruction. This is done near the end of emitOutputInstr() method. However, this compensation is only done until we emit the last align instruction in a method that needed non-zero padding. emitLastAlignedIgNum we tracked earlier is used in determining if we should compensate for over-estimation or not.

    • There is one more place where we were optimizing the VEX encoding of a prefix of an instruction from 3-bytes to 2-bytes and we won't do that until we do not get pass emitLastAlignedIgNum.

    • Lastly, during emitOutputLJ(), if the jump was ever shortened, we adjust the emitOffsAdj accordingly so the future forward jump instructions can calculate the distance accurately. But since we compensate that over-estimation by NOP, we need to adjust the emitOffsAdj back to its original value in emitOutputInstr(). The jumps over-estimation happens as a side-effect of shortening align instructions after emitJumpDistBind(). In future, we will combine the emitLoopAdjustments() and emitJumpDistBind() so the jump instruction sizes are accurate.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments for emit.cpp.

/* It is fatal to under-estimate the instruction size */
noway_assert(id->idCodeSize() >= csz);
// It is fatal to under-estimate the instruction size, except for alignment instructions
noway_assert(estimatedSize >= actualSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is it that prevents align instructions from hitting this assert?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the idCodeSize() for align instruction is always accurate from what we calculated in emitLoopAlignAdjustments().

src/coreclr/jit/emit.cpp Show resolved Hide resolved
src/coreclr/jit/emit.cpp Show resolved Hide resolved
Comment on lines +4646 to +4670
// If igInLoop's next IG is a loop and needs alignment, then igInLoop should be the last IG
// of the current loop and should have backedge to current loop header.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were going to split these critical edges earlier on...?

Copy link
Member Author

@kunalspathak kunalspathak Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was simplistic instead of having portion of alignment logic in register allocator as well. With this approach, we can simply check if the last IG was also marked as IGF_LOOP_ALIGN and if yes, then just remove the align instruction size from it.

// if currIG has back-edge to dstIG.
//
// Notes:
// If the current loop covers a loop that is already marked as align, then remove
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe this differently?

Suggested change
// If the current loop covers a loop that is already marked as align, then remove
// If the current loop encloses a loop that is already marked as align, then remove

Also what happens if the two loops intersect but one doesn't enclose the other?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give an example?

src/coreclr/jit/emit.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emit.cpp Outdated Show resolved Hide resolved
// 3b. If paddingNeeded > maxPaddingAmount, then recalculate to align to 16B boundary.
// 3b. If paddingNeeded == 0, then return 0. // already aligned at 16B
// 3c. If paddingNeeded > maxPaddingAmount, then return 0. // expensive to align
// 3d. If the loop already fits in minimum 32B blocks, then return 0. // already best aligned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't we check this first? If padding can't reduce the number of chunks then it seems like we should not pad..?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I first try to check the best alignment boundary as part of 3b. Only if we decide to move ahead, we will check for the 3d. So in 3d, i check the current offset of loop with regards to alignmentBoundary (32B or 16B`). If we do it before, we might be just checking the current offset % 32.

src/coreclr/jit/emit.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emit.cpp Show resolved Hide resolved
Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments; I should look more

src/coreclr/jit/jit.h Show resolved Hide resolved
src/coreclr/jit/codegenlinear.cpp Outdated Show resolved Hide resolved

if (block->bbJumpDest->isLoopAlign())
{
GetEmitter()->emitSetLoopBackEdge(block->bbJumpDest);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition seems odd to me: Presumably, you've already determined that an "isLoopAlign" block is the top of a loop. But now, any block that jumps to that is considered a back edge. What if the lexically first block of the loop is the target of branches that aren't loop branches: forward branches jumping to the loop top, or non-loop backward branches? (The non-loop backward branches might not matter since presumably your algorithm will hit an in-loop backward branch first, but possibly marking them could confuse other code? E.g., what if the first block in a non-first loop branched back to the top of the first loop? I'm not sure if all these cases can occur, due to loop canonicalization, etc., but this case seems not too specific)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emitSetLoopBackEdge() checks the igNum and accordingly decides if the edge is a back-edge or not.

@tannergooding
Copy link
Member

Disabling compact VEX encoding should not affect performance.

Have we measured this across a range of hardware to confirm? I'd be worried that this could subtly impact certain SIMD loops due to decoding bandwidth and number of instructions that can fit into the icache.

@kunalspathak
Copy link
Member Author

Have we measured this across a range of hardware to confirm? I'd be worried that this could subtly impact certain SIMD loops due to decoding bandwidth and number of instructions that can fit into the icache.

No, I haven't measured the performance across various hardware, but currently, we optimize very few methods having inner loops out of the total methods that have inner loop. To give an example, we just align loops in 1515 out of 380K methods in .NET libraries which is around 0.4% of them. We might mark an inner loop needs alignment in a method, but there are various heuristics to determine if the loop is worth aligning and if not, we would skip doing alignment. The prominent reason where alignment is rejected is the loop size. For such methods, we will continue to keep compacted VEX encoding. Even for methods, for which we would align an inner loop, we would continue to disable VEX encoding only up to the last align instruction after which we will re-enable it. So in my previous example, even out of those 0.4% methods, we will see compacted VEX encoding after the inner loop block ends.

There will definitely be edge cases which might regress as you pointed, but neither can't prove that will be the case without knowing the exact hardware/code, nor I have enough data to determine cost vs. benefit of not doing loop alignment with this approach. Alternative of not disabling the VEX encoding is to fix the estimation for such instructions, which we have both concluded offline that it will be a non trivial task and might take couple of weeks to accomplish.

So given the fact that we are very selective in which inner loop gets aligned and even in those methods, we do not completely disable VEX encoding, improvements we get in normal code by alignment and the challenges to fix the existing over-estimation, I feel, existing option is worth pursing. We should regardless strive to fix the over-estimation of VEX encoding problem which will help fix the regression that we might notice in SIMD loops.

By the way, could you elaborate on SIMD loops or share a sample code?

@AndyAyersMS
Copy link
Member

We also intend to fix the encoding size issues once this change is in. So if there is a perf impact from changing encoding, it should hopefully be short-lived.

@tannergooding
Copy link
Member

tannergooding commented Jan 11, 2021

By the way, could you elaborate on SIMD loops or share a sample code?

Many of our most performance oriented functions using some sort of vectorization and looping as they often go hand in hand. In many cases these loops are also fairly small (typically between 16 and 64 bytes).
Just searching for usages of Vector128<T> comes up with many samples: https://source.dot.net/#System.Private.CoreLib/Vector128_1.cs,a7d647b6968ad0f9,references

For example, in StringUtilities we have an "inner loop" here which is ~9 SIMD instructions, plus the loop logic: https://source.dot.net/#http2cat/StringUtilities.cs,115
It's followed by another loop which will be "inner" on non-AVX and trailing on AVX hardware and is likewise ~6 SIMD instructions: https://source.dot.net/#http2cat/StringUtilities.cs,143

In the case of the AVX loop, the loop is ~63 bytes today and as such likely benefits from both alignment and small encoding on modern hardware
If we start using the 3-byte prefix, this is going to jump another 5 bytes in size (5 of the SIMD instructions are currently the 2-byte prefix) and will push it over 64-bytes (most of the modern platforms are using 1-2x 32-byte fetch window(s))

Some of this will likely be mitigated by better alignment, but we may also not see improvements or may see regressions due to these changes in other cases.

I feel, existing option is worth pursing

I do as well, I just am wanting to ensure we are testing any known "hot" scenarios for potential regressions 😄

My main thought is still what we had discussed offline. Which is that the primary concern (that I know of) is that due to instruction size misestimation, we won't want to cause the method allocations to grow too much. However, the maximum difference is likely going to be a handful of bytes.

For example, if you have the following: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4a9Jq049+ggJI8ovLrmXrNjZu258BLIxhNmLLABoAOJBuqaAzEykDEIMAN40DFFMgdy42ABmMAzAEBCSDACyABTAAJ4YMABUDKZiHBhoDGAAFthQJRAV5ZWlPNVNPACUkdFh0b3R0Qr1DDBcACYMALxtLQwA1B12akNDg2sAajBgGNCkAKxIADy4+YUAfAzYCggAWjIQMwzbu/tHp+cwFywPUBC+DZDCYQIF9MFrBgjKBQnZ7GGzACCtxYABkINgJq94YckNkyhUuixEbhZAAhAowbJdXyQyG8BIMbIAQiENR2AGsKYVcEYSWBeLw2NguABzKkKOHQKo3e6PLo9ah0tYRJXKunEADsDAS2EkuBgtPVUQAvjQIWtoQwMOJGEjbqQWABVLhibBgDnogDu2UlbygMtufwgNItQytNrE5GeyIQjpdbo9AAleKKar6pQHrkH5YC1caAPQFhhklyitp7a3shgmNMYBjQCYyMPDUZNDB2hixx0ABRk+AqMFICHIpE82UjdCqkfIVTocboofz6qt7ej9rjLD7UAHhWHo/Hk+n4lnDHn/nIS5bUVjLFk8Kp2Q4uBq0AwRS67ZaVXbdGJuCdJxyDxBUjWNW972gR9n1fKB30/ZoKkWF5M1xY4YLfH4RDsH8KlYElAJ4YDqSvZdlQJesFlmbF3hOM5KSwzoMDA9UvyQqiUP9ND6MuYQmJYoYTQYL0aikZJ8VdJDjlmcYpjgTicQ+Hjvj4uwl2VC0tWtKAOENDYzWXDYAG1MhgDBXwmAxxEkbJTPMiBLOsgB5MQ+AgMxiVFUVYFwcxJSMSRTFMUUugAXQ2MQTBGQptCQFI0gyNlOW5VQ+VwAUhRFcVshoqBuK+K5ak5KpcvyhiGAAL3lDZVWVK18DwDkYwdPjxHqGAAHFYGwQooAAFTqLhsiKj0qiq/51I1bUn1MDAum7FhMggSVMka7IGtwDkuhmWZ5wAMQOvaAH1Dr2liDJNIA=

The actual emitted size is 93-bytes, but the estimated size (due to 3-byte prefix) is 99-bytes.

In the case where you are aligning both branch targets (L000D and L0057) to 16-bytes and everything is perfectly estimated, you'd request 3-bytes to make L000D->L0010 and L0057->L005A and then 6-bytes to make L005A->L0060.
So the entire function allocates 102 bytes and you have zero "waste".

However, in the case of misestimation (today's world), you'd have L000E and L005D as the estimated branch locations, so you'd request 2-bytes to make L000E->L0010 and L005D->L005F and 1-bytes to make L005F->L0060.
You'd then find that the first vxorps had its size misestimated and had the 2-byte prefix so it was 1-byte shorter. You now need 3-bytes and only requested 2-bytes. However, you do have three bytes because the vxorps was misestimated by 1 (over estimating size has always been non-ideal but not breaking).

The same will be true for every subsequent alignment needed, whether the misestimation was due to VEX or short JMP, etc. That is the overall size of the method is still correct because in every case where it was less, it will have been because some other instruction was over-estimated (today I believe that is just VEX and short jumps) and so the necessary bytes should still exist and be available for use as part of the padding (they effectively end up as trailing padding today anyways).

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you answer the Debug/Release code diff question?

@@ -822,6 +831,60 @@ insGroup* emitter::emitSavIG(bool emitAdd)
}
#endif

#ifdef FEATURE_LOOP_ALIGN
// Did we have any align instructions in this group?
if (emitCurIGAlignList)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there ever be more than a single align instruction in an IG?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of non-adaptive, we can have multiple align instructions that all represent a single padding. E.g. for non-adaptive 32B, there will be 3 align instructions

  • align (15)
  • align (15)
  • align (1)

@@ -250,6 +250,7 @@ struct insGroup
unsigned int igFuncIdx; // Which function/funclet does this belong to? (Index into Compiler::compFuncInfos array.)
unsigned short igFlags; // see IGF_xxx below
unsigned short igSize; // # of bytes of code in this group
insGroup* igLoopBackEdge; // "last" back-edge that branches back to an aligned loop head.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems potentially expensive, e.g. for MinOpts scenario where it's not used.

Should it be under #if FEATURE_ALIGN_ALIGN?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, makes sense.

@@ -13705,6 +13815,43 @@ size_t emitter::emitOutputInstr(insGroup* ig, instrDesc* id, BYTE** dp)
emitDispIns(id, false, dspOffs, true, emitCurCodeOffs(*dp), *dp, (dst - *dp));
}

#ifdef FEATURE_LOOP_ALIGN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this code within #ifdef DEBUG block? That's going to lead to Debug/Release code differences

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be, my bad. I guess the #endif got removed after rebase. I will fix it.

@kunalspathak
Copy link
Member Author

kunalspathak commented Jan 11, 2021

The same will be true for every subsequent alignment needed, whether the misestimation was due to VEX or short JMP, etc. That is the overall size of the method is still correct because in every case where it was less, it will have been because some other instruction was over-estimated (today I believe that is just VEX and short jumps) and so the necessary bytes should still exist and be available for use as part of the padding (they effectively end up as trailing padding today anyways).

@tannergooding - Yes, that's the thing we concluded and I went ahead and explored that option. Unfortunately, it was not that simple. I noticed cases where sometimes, because of the combination of multiple instructions involving VEX misprediction + align instruction adjustment (with the approach you suggested of taking the bytes we reserved for mis-predicted instruction), we end up in a situation where we want to emit an align instruction having M bytes padding, but now it needs N bytes padding (because of adjustment of mispredictions, align, etc.), but turns out that M + K < N, where K is the bytes we allocated extra due to misprediction and could utilize for align instruction. At that point, I started exploring this alternative of just turning off the optimization until we hit the last align instruction.

In the case where you are aligning both branch targets (L000D and L0057) to 16-bytes

We only align backward branch targets because that represent the loop. So, in your example it is L000D. I tried the test and looks like we will not align the loop because the loop needs minimum 3 blocks to fit and it is aligned that way, so we won't try to further align it. With that, we retain the VEX encoding.

Assembly code
G_M21588_IG01:              
 00007fff`56918d40        C5F877               vzeroupper
G_M21588_IG02:              
 00007fff`56918d43        4963C1               movsxd   rax, r9d
G_M21588_IG03:              
 00007fff`56918d46        4803C2               add      rax, rdx
G_M21588_IG04:              
 00007fff`56918d49        C5FC57C0             vxorps   ymm0, ymm0, ymm0
G_M21588_IG05:              
 00007fff`56918d4d                             align
G_M21588_IG06:              
 00007fff`56918d4d        C5FE6F0A             vmovdqu  ymm1, ymmword ptr[rdx]
G_M21588_IG07:              
 00007fff`56918d51        C5F564D0             vpcmpgtb ymm2, ymm1, ymm0
G_M21588_IG08:              
 00007fff`56918d55        C5FDD7CA             vpmovmskb ecx, ymm2
G_M21588_IG09:              
 00007fff`56918d59        83F9FF               cmp      ecx, -1
G_M21588_IG10:              
 00007fff`56918d5c        7539                 jne      SHORT G_M21588_IG24
G_M21588_IG11:              
 00007fff`56918d5e        C5F560D0             vpunpcklbw ymm2, ymm1, ymm0
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (punpcklbw: 2) 32B boundary ...............................
G_M21588_IG12:              
 00007fff`56918d62        C5F568C8             vpunpckhbw ymm1, ymm1, ymm0
G_M21588_IG13:              
 00007fff`56918d66        C4E36D46D920         vperm2i128 ymm3, ymm2, ymm1, 32
G_M21588_IG14:              
 00007fff`56918d6c        C4E36D46C931         vperm2i128 ymm1, ymm2, ymm1, 49
G_M21588_IG15:              
 00007fff`56918d72        C4C17E7F18           vmovdqu  ymmword ptr[r8], ymm3
G_M21588_IG16:              
 00007fff`56918d77        C4C17E7F4820         vmovdqu  ymmword ptr[r8+32], ymm1
G_M21588_IG17:              
 00007fff`56918d7d        4883C220             add      rdx, 32
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (add: 1) 32B boundary ...............................
G_M21588_IG18:              
 00007fff`56918d81        4983C040             add      r8, 64
G_M21588_IG19:              
 00007fff`56918d85        488D48E0             lea      rcx, [rax-32]
G_M21588_IG20:              
 00007fff`56918d89        483BD1               cmp      rdx, rcx
G_M21588_IG21:              
 00007fff`56918d8c        76BF                 jbe      SHORT G_M21588_IG06
G_M21588_IG22:              
 00007fff`56918d8e        B801000000           mov      eax, 1
G_M21588_IG23:              
 00007fff`56918d93        EB02                 jmp      SHORT G_M21588_IG25
G_M21588_IG24:              
 00007fff`56918d95        33C0                 xor      eax, eax
G_M21588_IG25:              
 00007fff`56918d97        0FB6C0               movzx    rax, al
G_M21588_IG26:              
 00007fff`56918d9a        C5F877               vzeroupper 
 00007fff`56918d9d        C3                   ret

@kunalspathak kunalspathak merged commit 16c48d6 into dotnet:master Jan 12, 2021
@JulieLeeMSFT JulieLeeMSFT added this to the 6.0.0 milestone Jan 15, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Feb 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

9 participants