Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move arithmetic helpers to managed code #109087

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

am11
Copy link
Member

@am11 am11 commented Oct 21, 2024

Only the FCThrow ones in jithelpers have been moved. The JIT_GetRefAny method was left out due to the existing TODO regarding type inheritance.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Oct 21, 2024
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check the perf impact on the affected architectures to see whether it is something we can live with.

There is alternatives with better perf possible: Inline the exception throwing in the JITed code (and optimize it out if the JIT can prove that it cannot happen). We do that on arm64 today.

src/coreclr/vm/JitQCallHelpers.h Outdated Show resolved Hide resolved
src/libraries/System.Private.CoreLib/src/System/Math.cs Outdated Show resolved Hide resolved
src/libraries/System.Private.CoreLib/src/System/Math.cs Outdated Show resolved Hide resolved
src/libraries/System.Private.CoreLib/src/System/Math.cs Outdated Show resolved Hide resolved
src/libraries/System.Private.CoreLib/src/System/Math.cs Outdated Show resolved Hide resolved
@am11
Copy link
Member Author

am11 commented Oct 24, 2024

@AaronRobinsonMSFT, win-x86 leg is failing this assert:

Assert failure(PID 2788 [0x00000ae4], Thread: 3704 [0x0e78]): CONTRACT VIOLATION by CreateCOMPlusExceptionObject at "D:\a\_work\1\s\src\coreclr\vm\excep.cpp":5725

GC_TRIGGERS encountered in a GC_NOTRIGGER scope

                        CONTRACT in CreateCOMPlusExceptionObject at "D:\a\_work\1\s\src\coreclr\vm\excep.cpp":5725
                        CONTRACT in MethodDescCallSite::CallTargetWorker at "D:\a\_work\1\s\src\coreclr\vm\callhelpers.cpp":307
                        GCX_COOP in Assembly::ExecuteMainMethod at "D:\a\_work\1\s\src\coreclr\vm\assembly.cpp":1380
                        CONTRACT in Assembly::ExecuteMainMethod at "D:\a\_work\1\s\src\coreclr\vm\assembly.cpp":1366
                        GCX_COOP in CorHost2::ExecuteAssembly at "D:\a\_work\1\s\src\coreclr\vm\corhost.cpp":329
                        CONTRACT in CorHost2::ExecuteAssembly at "D:\a\_work\1\s\src\coreclr\vm\corhost.cpp":269

Does it mean we need to use something like FC_GC_POLL_RET besides FCALL_CONTRACT?

@jkotas
Copy link
Member

jkotas commented Oct 24, 2024

Does it mean we need to use something like FC_GC_POLL_RET

FC_GC_POLL_RET expands to HMF. It would not be an improvement.

I think the crash that you are seeing is caused by calling convention mismatch for 64-bit helpers. The existing helpers use VV flavor of the FCall macros, but your new helpers do not.

The existing NAOT implementation deals with the calling convention issues with using QCalls that use standard calling convention instead of managed calling convention.

@am11 am11 marked this pull request as ready for review October 24, 2024 16:03
@jkotas
Copy link
Member

jkotas commented Oct 24, 2024

Did you have a chance to check the perf impact?

@am11
Copy link
Member Author

am11 commented Oct 24, 2024

Did you have a chance to check the perf impact?

I haven't yet. I will test on windows x86 next which will hopefully give enough insights to avoid linux-arm testing (I don't have the device and qemu arm can only handle nativeaot apps).

@am11
Copy link
Member Author

am11 commented Oct 24, 2024

Windows x86 release

Before:

Type Operation Avg Time (ticks) Min (ticks) Max (ticks)
int Division 1 0 5799
int Modulus 1 1 4882
uint Division 1 1 3683
uint Modulus 1 0 2902
long Division 0 0 3167
long Modulus 0 0 2929
ulong Division 0 0 3170
ulong Modulus 0 0 2649

After:

Type Operation Avg Time (ticks) Min (ticks) Max (ticks)
int Division 0 0 3040
int Modulus 0 0 2553
uint Division 0 0 6176
uint Modulus 0 0 3496
long Division 3 1 42000
long Modulus 3 1 3622
ulong Division 3 1 5184
ulong Modulus 1 1 3982
Benchmark code (standalone console app)
using System;
using System.Diagnostics;
using System.Linq;

class Program
{
    static void Main()
    {
        int[] intValues = { 10, 100, -100, 1_000_000, 123456789 };
        uint[] uintValues = { 10U, 100U, 1_000U, 1_000_000U, 123456789U };
        long[] longValues = { 10L, 100L, -100L, 1_000_000_000L, 12345678901234L };
        ulong[] ulongValues = { 10UL, 100UL, 1_000UL, 1_000_000_000UL, 12345678901234UL };

        Console.Write("Warming up");
        for (int i = 0; i < 3; i++)
        {
            PerformOperations(intValues, (a, b) => a / b, (a, b) => a % b);
            PerformOperations(uintValues, (a, b) => a / b, (a, b) => a % b);
            PerformOperations(longValues, (a, b) => a / b, (a, b) => a % b);
            PerformOperations(ulongValues, (a, b) => a / b, (a, b) => a % b);
            Console.Write(".");
            System.Threading.Thread.Sleep(500);
        }
        Console.WriteLine("\n");

        var intResults = Benchmark("int", intValues, (a, b) => a / b, (a, b) => a % b);
        var uintResults = Benchmark("uint", uintValues, (a, b) => a / b, (a, b) => a % b);
        var longResults = Benchmark("long", longValues, (a, b) => a / b, (a, b) => a % b);
        var ulongResults = Benchmark("ulong", ulongValues, (a, b) => a / b, (a, b) => a % b);

        Console.WriteLine("### Benchmark Results");
        Console.WriteLine("| Type  | Operation | Avg Time (ticks) | Min (ticks) | Max (ticks) |");
        Console.WriteLine("| ----- | --------- | ---------------- | ----------- | ----------- |");
        PrintResults("int", intResults);
        PrintResults("uint", uintResults);
        PrintResults("long", longResults);
        PrintResults("ulong", ulongResults);
    }

    static (long AvgDivision, long MinDivision, long MaxDivision, long AvgModulus, long MinModulus, long MaxModulus)
        Benchmark<T>(string typeName, T[] values, Func<T, T, T> divideFunc, Func<T, T, T> modFunc)
    {
        int runs = 10;
        long[] divisionTimes = new long[runs];
        long[] modulusTimes = new long[runs];

        for (int i = 0; i < runs; i++)
        {
            divisionTimes[i] = MeasureTime(() => PerformDivision(values, divideFunc));
            modulusTimes[i] = MeasureTime(() => PerformModulus(values, modFunc));
        }

        var avgDivisionTime = FilterAndAverage(divisionTimes);
        var avgModulusTime = FilterAndAverage(modulusTimes);

        return (avgDivisionTime, divisionTimes.Min(), divisionTimes.Max(), avgModulusTime, modulusTimes.Min(), modulusTimes.Max());
    }

    static void PerformOperations<T>(T[] values, Func<T, T, T> divideFunc, Func<T, T, T> modFunc)
    {
        for (int i = 1; i < values.Length; i++)
        {
            var resultDiv = divideFunc(values[i], values[i - 1]);
            var resultMod = modFunc(values[i], values[i - 1]);
        }
    }

    static void PerformDivision<T>(T[] values, Func<T, T, T> divideFunc)
    {
        for (int i = 1; i < values.Length; i++)
        {
            var result = divideFunc(values[i], values[i - 1]);
        }
    }

    static void PerformModulus<T>(T[] values, Func<T, T, T> modFunc)
    {
        for (int i = 1; i < values.Length; i++)
        {
            var result = modFunc(values[i], values[i - 1]);
        }
    }

    static long MeasureTime(Action action)
    {
        Stopwatch stopwatch = new Stopwatch();
        stopwatch.Start();
        action();
        stopwatch.Stop();
        return stopwatch.ElapsedTicks;
    }

    static long FilterAndAverage(long[] times)
    {
        var sortedTimes = times.OrderBy(t => t).ToArray();
        var filteredTimes = sortedTimes.Skip(1).Take(sortedTimes.Length - 2);
        return (long)filteredTimes.Average();
    }

    static void PrintResults(string typeName, (long AvgDivision, long MinDivision, long MaxDivision, long AvgModulus, long MinModulus, long MaxModulus) results)
    {
        Console.WriteLine($"| {typeName}  | Division  | {results.AvgDivision}          | {results.MinDivision}         | {results.MaxDivision}         |");
        Console.WriteLine($"| {typeName}  | Modulus   | {results.AvgModulus}          | {results.MinModulus}         | {results.MaxModulus}         |");
    }
}

@jkotas
Copy link
Member

jkotas commented Oct 24, 2024

These numbers are hard to interpret.

  • This change should not affect 32-bit division on x86, so the first 4 rows should be just noise.
  • For 64-bit division, my guestimate from these numbers is that this change makes 64-bit division and modulus about 30% slower on x86. 30% is a significant regression.

@dotnet/jit-contrib What would it take to inline the exception throwing for helper-based division and modulus (similar to what we do on arm64 without helpers) so that we can avoid this regression?

@EgorBo
Copy link
Member

EgorBo commented Oct 24, 2024

@dotnet/jit-contrib What would it take to inline the exception throwing for helper-based division and modulus (similar to what we do on arm64 without helpers) so that we can avoid this regression?

I guess that should be a trivial change, I can share a branch in a bit

@am11
Copy link
Member Author

am11 commented Oct 24, 2024

@EgorBo, thanks for looking into it. Maybe we can aim to get rid of USE_HELPERS_FOR_INT_DIV as part of it? (it's always true on linux-arm so all the #if !USE_HELPERS_FOR_INT_DIV type of branches are code rot).

@EgorBo
Copy link
Member

EgorBo commented Oct 24, 2024

Maybe we can aim to get rid of USE_HELPERS_FOR_INT_DIV as part of it?

not sure I understand, do all arm32 targets have div instructions?

@EgorBo
Copy link
Member

EgorBo commented Oct 24, 2024

@dotnet/jit-contrib What would it take to inline the exception throwing for helper-based division and modulus (similar to what we do on arm64 without helpers) so that we can avoid this regression?

Ah, I think I misunderstood the question at first. is inlining the 0/overflow checks in front of the helper calls supposed to improve @am11's benchmarks? presumably the checks are just moved from one place to another? (well, except the cases where jit can fold them, but this benchmark is not the case?)

@am11
Copy link
Member Author

am11 commented Oct 24, 2024

Maybe we can aim to get rid of USE_HELPERS_FOR_INT_DIV as part of it?

not sure I understand, do all arm32 targets have div instructions?

Comment was about this TODO:

// TODO-ARM-CQ: Use shift for division by power of 2
// TODO-ARM-CQ: Check for sdiv/udiv at runtime and generate it if available
#define USE_HELPERS_FOR_INT_DIV 1 // BeagleBoard (ARMv7A) doesn't support SDIV/UDIV

v7-A models before Cortex-A15 don't support sdiv/udiv instructions. We can determine it at runtime by looking at the CPUID's ID_ISAR0 register's DIVIDE field, which provides information about hardware division support. But seems like we always set USE_HELPERS_FOR_INT_DIV to 1 and never actually exercised these type of code paths:

void CodeGen::genCodeForDivMod(GenTreeOp* tree)

@EgorBo
Copy link
Member

EgorBo commented Oct 24, 2024

We can determine it at runtime by looking at the CPUID's ID_ISAR0 register's DIVIDE field

What about NativeAOT/R2R ?

@am11
Copy link
Member Author

am11 commented Oct 24, 2024

We can determine it at runtime by looking at the CPUID's ID_ISAR0 register's DIVIDE field

What about NativeAOT/R2R ?

I'd imagine it would be another entry in https://github.com/dotnet/runtime/blob/d450d9c9ee4dd5a98812981dac06d2f92bdb8213/src/native/minipal/cpufeatures.h, aka the usual approach. :)

@jkotas
Copy link
Member

jkotas commented Oct 24, 2024

is inlining the 0/overflow checks in front of the helper calls supposed to improve @am11's benchmarks?

Yes, it will save a call frame. The extra call frame introduced by the changes in this PR is where the overhead is.

main: managed method -> FCall (with argument checks that require HMF) -> internal C-runtime div helper

PR (notice the extra frame): managed method -> managed helper call (argument checks) -> FCall (without argument checks) -> internal C-runtime div helper

Inlined checks: managed method (with argument checks) -> FCall (without argument checks) -> internal C-runtime div helper

arm32

We do not need to be doing any extra work to improve arm32 perf by detecting div instruction.

My primary concern is to avoid regressions on win x86. Win x86 is broadly used, for Windows desktop apps in particular.

@am11
Copy link
Member Author

am11 commented Oct 25, 2024

We do not need to be doing any extra work to improve arm32 perf by detecting div instruction.

#if USE_HELPERS_FOR_INT_DIV can be just #ifdef TARGET_ARM and the stale else cases can be deleted which were never exercised because USE_HELPERS_FOR_INT_DIV was always 1 on arm (undefined elsewhere since 2015 initial commit). My comment was based on the TODO in targetarm.h which is likely never happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-VM-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants