Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono] Append -sroa and -instcombine to LLVM-JIT pass manager #37458

Merged
merged 1 commit into from
Jun 8, 2020

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Jun 4, 2020

Partially fixes #37449

static ReadOnlySpan<byte> Arr
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    get => new byte[] {1, 2, 3, 4};
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static byte GetByte(int i) => Arr[0];

Codegen for GetByte() in LLVM-JIT mode:

0000000000000000 <gram_GetByte__int_>:
<BB>:1
   0:   48 83 ec 28             sub    $0x28,%rsp
   4:   c5 f8 57 c0             vxorps %xmm0,%xmm0,%xmm0
   8:   c5 f8 29 04 24          vmovaps %xmm0,(%rsp)
   d:   48 b8 b0 bd 55 d2 3f    movabs $0x563fd255bdb0,%rax
  14:   56 00 00
  17:   48 89 04 24             mov    %rax,(%rsp)
  1b:   c7 44 24 08 04 00 00    movl   $0x4,0x8(%rsp)
  22:   00
  23:   48 8b 04 24             mov    (%rsp),%rax
  27:   48 89 44 24 10          mov    %rax,0x10(%rsp)
  2c:   8b 44 24 08             mov    0x8(%rsp),%eax
  30:   89 44 24 18             mov    %eax,0x18(%rsp)
  34:   8b 44 24 0c             mov    0xc(%rsp),%eax
  38:   89 44 24 1c             mov    %eax,0x1c(%rsp)
  3c:   83 7c 24 18 00          cmpl   $0x0,0x18(%rsp)
  41:   74 0c                   je     4f <gram_GetByte__int_+0x4f>
  43:   48 8b 44 24 10          mov    0x10(%rsp),%rax
  48:   8a 00                   mov    (%rax),%al
  4a:   48 83 c4 28             add    $0x28,%rsp
  4e:   c3                      retq
  4f:   48 b8 d0 ae 42 d2 3f    movabs $0x563fd242aed0,%rax
  56:   56 00 00
  59:   bf 02 01 00 00          mov    $0x102,%edi
  5e:   ff 10                   callq  *(%rax)

Codegen after I appended -sroa -instcombine (makes sense to add/move extra -simplifycfg for hygiene) to the end of LLVM passes list:

0000000000000000 <gram_GetByte__int_>:
<BB>:1
   0:   48 b8 d0 ef a6 ba 31    movabs $0x5631baa6efd0,%rax
   7:   56 00 00
   a:   8a 00                   mov    (%rax),%al
   c:   c3                      retq

This https://godbolt.org/z/YKjzsV link explains motivation (on the left is the "final" LLVM IR our llvm-jit produces after the default optimizations). Zoltan noticed that LLVM-AOT where we use opt -O2 instead of custom pass order optimized that code perfectly.

The other issue remains: for some reason we don't inline get_Arr() without AggressiveInlining, coreclr does inline it:

Successfully inlined Program:get_Arr():System.ReadOnlySpan`1[Byte] (12 IL bytes) (depth 1) [below ALWAYS_INLINE size]

@ghost
Copy link

ghost commented Jun 4, 2020

Tagging subscribers to this area: @lewurm
Notify danmosemsft if you want to be subscribed.

@vargaz
Copy link
Contributor

vargaz commented Jun 5, 2020

We don't inline get_Arr () because it hits the inlining cost limit of 60.

monojenkins pushed a commit to monojenkins/mono that referenced this pull request Jun 6, 2020
Mono's inliner is quite conservative (see dotnet/runtime#34286) so we have to workaround some limitations, e.g.
```csharp
static ReadOnlySpan<byte> Arr => new byte[] {1, 2, 3, 4};

// access example:
byte GetByte(int i) => Arr[i];
```
#### Current codegen for `GetByte` (with dotnet/runtime#37458 included)
```asm
0000000000000000 <gram_GetByte__int_>:
<BB>:1
   0:   48 83 ec 18             sub    $0x18,%rsp
   4:   c5 f8 57 c0             vxorps %xmm0,%xmm0,%xmm0
   8:   c5 f8 29 04 24          vmovaps %xmm0,(%rsp)
   d:   48 b8 50 22 73 d2 6e    movabs $0x556ed2732250,%rax
  14:   55 00 00
  17:   48 89 e7                mov    %rsp,%rdi
  1a:   ff 10                   callq  *(%rax)   ;;;; get_Arr() is not inlined!
  1c:   83 7c 24 08 00          cmpl   $0x0,0x8(%rsp)
  21:   74 0b                   je     2e <gram_GetByte__int_+0x2e>
  23:   48 8b 04 24             mov    (%rsp),%rax
  27:   8a 00                   mov    (%rax),%al
  29:   48 83 c4 18             add    $0x18,%rsp
  2d:   c3                      retq
  2e:   48 b8 68 22 73 d2 6e    movabs $0x556ed2732268,%rax
  35:   55 00 00
  38:   bf 02 01 00 00          mov    $0x102,%edi
  3d:   ff 10                   callq  *(%rax)
```

As you can see, `get_Arr()` is not inlined - it happened because the calculated size is 60 and the major contributor is `ReadOnlySpan<>.ctor` which is marked with [AggressiveInlining], so Mono inlined that `.ctor` into `get_Arr` and the total cost for `get_Arr` increased to 60.

#### Codegen with this PR:
```asm
0000000000000000 <gram_GetByte__int_>:
<BB>:1
   0:   48 b8 00 6e 1f 91 11    movabs $0x5611911f6e00,%rax
   7:   56 00 00
   a:   8a 00                   mov    (%rax),%al
   c:   c3                      retq
```
Thus, the fix helped to inline more and even reduced function size.

There are other places where we append method call costs to `inline_costs` but I decided to start with this one for constructors and only for LLVM to minimize possible regressions.

Possible regressions match current CoreCLR behavior, bad case:  [sharplab.io](https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4aNYgGYm5JE1IMhNAN4MLNCxYDaAWRgYAFhAAmASXGSAFA+dvPMUkAeTE+CC5cFgBBAHNY2FxcXgUYdy5JXi4s2IBKAF0rax09A2IUBgAhXli/F1dvXKKLU2brC2YATm8AIgAJHtyNanb2rt6YQeHR4vJunskptrG53sWh5dn5iCWRmb1ujb2Z8Z6AdV39g96do6vTqEv90/Xp59We1yeTj4BCb4YAF9NMcmLpmGUKtVao56mcoNgxGIZI02q1Qe0APSYqo1OpuRoMXi4IkZLIwVwMPAMGAIZFgDAUhhwBgQADWmws0PxDTu1mB1DaJQhTAqABVVBhUaD0VdsQx3FT8AwuBAMDS6TAGTlcTD/JSMBAGMAYKTMlwmVlDQwJbgpU0MdZ5cAOOreG6SSzVer2ZzdTz4YjkVBGm8LALAUA===)
@EgorBo EgorBo merged commit d01bb68 into dotnet:master Jun 8, 2020
EgorBo added a commit to mono/mono that referenced this pull request Jun 18, 2020
#19930)

Mono's inliner is quite conservative (see dotnet/runtime#34286) so we have to workaround some limitations, e.g.
```csharp
static ReadOnlySpan<byte> Arr => new byte[] {1, 2, 3, 4};

// access example:
byte GetByte(int i) => Arr[i];
```
#### Current codegen for `GetByte` (with dotnet/runtime#37458 included)
```asm
0000000000000000 <gram_GetByte__int_>:
<BB>:1
   0:   48 83 ec 18             sub    $0x18,%rsp
   4:   c5 f8 57 c0             vxorps %xmm0,%xmm0,%xmm0
   8:   c5 f8 29 04 24          vmovaps %xmm0,(%rsp)
   d:   48 b8 50 22 73 d2 6e    movabs $0x556ed2732250,%rax
  14:   55 00 00
  17:   48 89 e7                mov    %rsp,%rdi
  1a:   ff 10                   callq  *(%rax)   ;;;; get_Arr() is not inlined!
  1c:   83 7c 24 08 00          cmpl   $0x0,0x8(%rsp)
  21:   74 0b                   je     2e <gram_GetByte__int_+0x2e>
  23:   48 8b 04 24             mov    (%rsp),%rax
  27:   8a 00                   mov    (%rax),%al
  29:   48 83 c4 18             add    $0x18,%rsp
  2d:   c3                      retq
  2e:   48 b8 68 22 73 d2 6e    movabs $0x556ed2732268,%rax
  35:   55 00 00
  38:   bf 02 01 00 00          mov    $0x102,%edi
  3d:   ff 10                   callq  *(%rax)
```

As you can see, `get_Arr()` is not inlined - it happened because the calculated size is 60 and the major contributor is `ReadOnlySpan<>.ctor` which is marked with [AggressiveInlining], so Mono inlined that `.ctor` into `get_Arr` and the total cost for `get_Arr` increased to 60.

#### Codegen with this PR:
```asm
0000000000000000 <gram_GetByte__int_>:
<BB>:1
   0:   48 b8 00 6e 1f 91 11    movabs $0x5611911f6e00,%rax
   7:   56 00 00
   a:   8a 00                   mov    (%rax),%al
   c:   c3                      retq
```
Thus, the fix helped to inline more and even reduced function size.

There are other places where we append method call costs to `inline_costs` but I decided to start with this one for constructors and only for LLVM to minimize possible regressions.

Possible regressions match current CoreCLR behavior, bad case:  [sharplab.io](https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4aNYgGYm5JE1IMhNAN4MLNCxYDaAWRgYAFhAAmASXGSAFA+dvPMUkAeTE+CC5cFgBBAHNY2FxcXgUYdy5JXi4s2IBKAF0rax09A2IUBgAhXli/F1dvXKKLU2brC2YATm8AIgAJHtyNanb2rt6YQeHR4vJunskptrG53sWh5dn5iCWRmb1ujb2Z8Z6AdV39g96do6vTqEv90/Xp59We1yeTj4BCb4YAF9NMcmLpmGUKtVao56mcoNgxGIZI02q1Qe0APSYqo1OpuRoMXi4IkZLIwVwMPAMGAIZFgDAUhhwBgQADWmws0PxDTu1mB1DaJQhTAqABVVBhUaD0VdsQx3FT8AwuBAMDS6TAGTlcTD/JSMBAGMAYKTMlwmVlDQwJbgpU0MdZ5cAOOreG6SSzVer2ZzdTz4YjkVBGm8LALAUA===)

Co-authored-by: EgorBo <[email protected]>
@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[mono] ReadOnlySpan<byte> hack doesn't work in Mono
4 participants