Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono] Intrinsify API's in SpanHelpers.ByteMemOps.cs #99161

Closed
3 tasks done
Tracked by #43051
fanyang-mono opened this issue Mar 1, 2024 · 18 comments · Fixed by #101622
Closed
3 tasks done
Tracked by #43051

[mono] Intrinsify API's in SpanHelpers.ByteMemOps.cs #99161

fanyang-mono opened this issue Mar 1, 2024 · 18 comments · Fixed by #101622

Comments

@fanyang-mono
Copy link
Member

fanyang-mono commented Mar 1, 2024

Two new API's were added and intrinsified by CoreCLR. They are used in the library which caused Performance regression in Mono. See dotnet/perf-autofiling-issues#29872. These new API's were added via #98623

The API's to intrinsify are

  • SpanHelpers.Memmove -> Update existing intrinsics support for Buffer.Memmove. See the code below
    if (in_corlib && !strcmp (m_class_get_name (cmethod->klass), "Buffer")) {
    if (!strcmp (cmethod->name, "Memmove") && fsig->param_count == 3 && m_type_is_byref (fsig->params [0]) && m_type_is_byref (fsig->params [1]) && !cmethod->is_inflated) {
    MonoBasicBlock *end_bb;
    NEW_BBLOCK (cfg, end_bb);
    // do nothing if len == 0 (even if src or dst are nulls)
    MONO_EMIT_NEW_BIALU_IMM (cfg, OP_COMPARE_IMM, -1, args [2]->dreg, 0);
    MONO_EMIT_NEW_BRANCH_BLOCK (cfg, OP_IBEQ, end_bb);
    // throw NRE if src or dst are nulls
    MONO_EMIT_NEW_BIALU_IMM (cfg, OP_COMPARE_IMM, -1, args [0]->dreg, 0);
    MONO_EMIT_NEW_COND_EXC (cfg, EQ, "NullReferenceException");
    MONO_EMIT_NEW_BIALU_IMM (cfg, OP_COMPARE_IMM, -1, args [1]->dreg, 0);
    MONO_EMIT_NEW_COND_EXC (cfg, EQ, "NullReferenceException");
    MONO_INST_NEW (cfg, ins, OP_MEMMOVE);
    ins->sreg1 = args [0]->dreg; // i1* dst
    ins->sreg2 = args [1]->dreg; // i1* src
    ins->sreg3 = args [2]->dreg; // i32/i64 len
    MONO_ADD_INS (cfg->cbb, ins);
    MONO_START_BB (cfg, end_bb);
    }
    }
  • SpanHelpers.ClearWithoutReferences
  • SpanHelpers.Fill
@ghost
Copy link

ghost commented Mar 1, 2024

Tagging subscribers to this area: @SamMonoRT, @fanyang-mono
See info in area-owners.md if you want to be subscribed.

Issue Details

Two new API's were added and intrinsified by CoreCLR. They are used in the library which caused Performance regression in Mono. See dotnet/perf-autofiling-issues#29872. These new API's were added via #98623

The API's to intrinsify are

  • SpanHelpers.Memmove -> Update existing intrinsics support for Buffer.Memmove. See the code below
    if (in_corlib && !strcmp (m_class_get_name (cmethod->klass), "Buffer")) {
    if (!strcmp (cmethod->name, "Memmove") && fsig->param_count == 3 && m_type_is_byref (fsig->params [0]) && m_type_is_byref (fsig->params [1]) && !cmethod->is_inflated) {
    MonoBasicBlock *end_bb;
    NEW_BBLOCK (cfg, end_bb);
    // do nothing if len == 0 (even if src or dst are nulls)
    MONO_EMIT_NEW_BIALU_IMM (cfg, OP_COMPARE_IMM, -1, args [2]->dreg, 0);
    MONO_EMIT_NEW_BRANCH_BLOCK (cfg, OP_IBEQ, end_bb);
    // throw NRE if src or dst are nulls
    MONO_EMIT_NEW_BIALU_IMM (cfg, OP_COMPARE_IMM, -1, args [0]->dreg, 0);
    MONO_EMIT_NEW_COND_EXC (cfg, EQ, "NullReferenceException");
    MONO_EMIT_NEW_BIALU_IMM (cfg, OP_COMPARE_IMM, -1, args [1]->dreg, 0);
    MONO_EMIT_NEW_COND_EXC (cfg, EQ, "NullReferenceException");
    MONO_INST_NEW (cfg, ins, OP_MEMMOVE);
    ins->sreg1 = args [0]->dreg; // i1* dst
    ins->sreg2 = args [1]->dreg; // i1* src
    ins->sreg3 = args [2]->dreg; // i32/i64 len
    MONO_ADD_INS (cfg->cbb, ins);
    MONO_START_BB (cfg, end_bb);
    }
    }
  • SpanHelpers.ClearWithoutReferences
Author: fanyang-mono
Assignees: -
Labels:

area-Codegen-Intrinsics-mono

Milestone: 9.0.0

@lewing
Copy link
Member

lewing commented Apr 24, 2024

This is one of the regressions in dotnet/perf-autofiling-issues#30686 so it is potentially quite impactful on wasm
cc @steveisok

@lewing
Copy link
Member

lewing commented Apr 24, 2024

@lewing
Copy link
Member

lewing commented Apr 24, 2024

cc @EgorBo @JulieLeeMSFT

@EgorBo
Copy link
Member

EgorBo commented Apr 24, 2024

cc @EgorBo @JulieLeeMSFT

It was decided that it should be handled on Mono side #99059

@lewing
Copy link
Member

lewing commented Apr 25, 2024

cc @EgorBo @JulieLeeMSFT

It was decided that it should be handled on Mono side #99059

Yes, this issue is about fixing it in mono. I pinged you because this is a ship blocking regression for wasm and I wanted to make sure that was clear.

@lewing
Copy link
Member

lewing commented Apr 25, 2024

It is probably worth understanding why the other mono-AOT-llvm runtimes were not impacted in the same way.

image

@lewing
Copy link
Member

lewing commented Apr 25, 2024

Iinterpreter changes were added in #99115

@EgorBo
Copy link
Member

EgorBo commented Apr 25, 2024

this is a ship blocking regression

I think we can just land #99059 if so cc @fanyang-mono

@fanyang-mono
Copy link
Member Author

It is unfortunate that this wasn't flagged as a ship blocking issue for wasm at the time this regression was introduced. I could work on getting this in for Preview 5 instead of #99059.

@lewing
Copy link
Member

lewing commented Apr 25, 2024

It is unfortunate that this wasn't flagged as a ship blocking issue for wasm at the time this regression was introduced. I could work on getting this in for Preview 5 instead of #99059.

We couldn't flag it then because we couldn't identify it because wasm runs in dotnet/performance were not working. I was only able to identify it now that the performance team has backfilled the missing runs and I made time to dig into them again. But to be clear this needs to be fixed before rtm not preview 5

@lewing
Copy link
Member

lewing commented Apr 25, 2024

And thank you to the perf team for doing the backfill it made it possible to positively identify the cause. I'm still curious why it impacted wasm aot so much more heavily than the rest of mono.

cc @LoopedBard3 @DrewScoggins @sblom

@EgorBo
Copy link
Member

EgorBo commented Apr 25, 2024

I presume previously it was intrinsified to be an llvm.memset/memcpy calls. The thing is - those calls were not GC-interrupt friendly so, technically, this performance regression improves GC latency in fact.

@lewing
Copy link
Member

lewing commented Apr 25, 2024

I presume previously it was intrinsified to be an llvm.memset/memcpy calls. The thing is - those calls were not GC-interrupt friendly so, technically, this performance regression improves GC latency in fact.

GC latency in single threaded wasm is not impacted at all by that.

@fanyang-mono
Copy link
Member Author

I am also curious why this regression has bigger impact on wasm.

@fanyang-mono
Copy link
Member Author

fanyang-mono commented Apr 26, 2024

@lewing I digged into the data. And this is the same microbenchmark chart but on arm64 with Mono LLVM AOT, which doesn't have the same regression as wasm
Screenshot 2024-04-26 at 3 57 24 PM

@matouskozak Do you know if this chart contains accurate data?

@matouskozak
Copy link
Member

matouskozak commented Apr 27, 2024

n arm64 with Mono LLVM AOT, which doesn't have the same regression as wasm

I very much hope so. We are missing the data between February and March because Mono AOT-llvm was similarly affected as WASM was. Since we didn't observe significant regression after we regain the measurements we didn't ask for backfill of the missing values. Do you think the measurements could be flawed?

I run a quick local measurements using benchmarks_local.py script for this range 79dd9ba...5ef47c8 and got 300/270ns so nowhere near the regression magnitude that WASM had.

@fanyang-mono
Copy link
Member Author

I looked into this issue a little bit more and confirmed these:

  • Why did System.Collections.CopyTo<Int32>.ReadOnlySpan(Size: 2048) regress so much on wasm AOT?
    WASM AOT mode is a LLVM-AOT mode with interpreter fallback. When comparing the result between wasm AOT and wasm interpreter, the result matches. So it seems that something used to be AOT'ed wasn't AOT'ed anymore, after Move memset/memcpy helpers to managed impl #98623 was merged. In the WASM AOT chart, it also showed that with the fix here, it went back down to the performance level as before the regression.

WASM interpreter data
Screenshot 2024-04-29 at 10 26 34 AM

WASM AOT data
Screenshot 2024-04-29 at 10 27 54 AM

  • Why didn't System.Collections.CopyTo<Int32>.ReadOnlySpan(Size: 2048) regress on mono LLVM-AOT?
    This microbenchmark test is written with generics. Normal AOT mode usually doesn't AOT generic methods. And the relevant intrinsics are LLVM only. Additionally, the LLVM-AOT mode that we measure with microbenchmarks is LLVM-AOT mode with JIT fallback. And the number matches with JIT on x64.

Mono LLVM-AOT
Screenshot 2024-04-29 at 10 51 53 AM

Mono JIT
Screenshot 2024-04-29 at 10 51 06 AM

  • Are we measuring Mono AOT-llvm correctly?
    Yes. On x64, Vector4.Add is only intrinsified with LLVM. I saw that microbenchmark ran much faster on Mono LLVM-AOT than on Mono JIT.

Mono LLVM-AOT
Screenshot 2024-04-29 at 10 58 42 AM

Mono JIT
Screenshot 2024-04-29 at 10 57 55 AM

@github-actions github-actions bot locked and limited conversation to collaborators May 31, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants