-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move arithmetic helpers to managed code #109087
base: main
Are you sure you want to change the base?
Conversation
ac4bd28
to
3c98d26
Compare
Tagging subscribers to this area: @mangod9 |
src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/JitHelper.cs
Outdated
Show resolved
Hide resolved
b6f3e2b
to
8b0448f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check the perf impact on the affected architectures to see whether it is something we can live with.
There is alternatives with better perf possible: Inline the exception throwing in the JITed code (and optimize it out if the JIT can prove that it cannot happen). We do that on arm64 today.
d7c87f8
to
f340359
Compare
2e344a5
to
c2bf9cc
Compare
c2bf9cc
to
2a86f4d
Compare
@AaronRobinsonMSFT, win-x86 leg is failing this assert:
Does it mean we need to use something like |
I think the crash that you are seeing is caused by calling convention mismatch for 64-bit helpers. The existing helpers use The existing NAOT implementation deals with the calling convention issues with using QCalls that use standard calling convention instead of managed calling convention. |
Did you have a chance to check the perf impact? |
I haven't yet. I will test on windows x86 next which will hopefully give enough insights to avoid linux-arm testing (I don't have the device and qemu arm can only handle nativeaot apps). |
c1917d3
to
be3a9b5
Compare
Windows x86 release Before:
After:
Benchmark code (standalone console app)using System;
using System.Diagnostics;
using System.Linq;
class Program
{
static void Main()
{
int[] intValues = { 10, 100, -100, 1_000_000, 123456789 };
uint[] uintValues = { 10U, 100U, 1_000U, 1_000_000U, 123456789U };
long[] longValues = { 10L, 100L, -100L, 1_000_000_000L, 12345678901234L };
ulong[] ulongValues = { 10UL, 100UL, 1_000UL, 1_000_000_000UL, 12345678901234UL };
Console.Write("Warming up");
for (int i = 0; i < 3; i++)
{
PerformOperations(intValues, (a, b) => a / b, (a, b) => a % b);
PerformOperations(uintValues, (a, b) => a / b, (a, b) => a % b);
PerformOperations(longValues, (a, b) => a / b, (a, b) => a % b);
PerformOperations(ulongValues, (a, b) => a / b, (a, b) => a % b);
Console.Write(".");
System.Threading.Thread.Sleep(500);
}
Console.WriteLine("\n");
var intResults = Benchmark("int", intValues, (a, b) => a / b, (a, b) => a % b);
var uintResults = Benchmark("uint", uintValues, (a, b) => a / b, (a, b) => a % b);
var longResults = Benchmark("long", longValues, (a, b) => a / b, (a, b) => a % b);
var ulongResults = Benchmark("ulong", ulongValues, (a, b) => a / b, (a, b) => a % b);
Console.WriteLine("### Benchmark Results");
Console.WriteLine("| Type | Operation | Avg Time (ticks) | Min (ticks) | Max (ticks) |");
Console.WriteLine("| ----- | --------- | ---------------- | ----------- | ----------- |");
PrintResults("int", intResults);
PrintResults("uint", uintResults);
PrintResults("long", longResults);
PrintResults("ulong", ulongResults);
}
static (long AvgDivision, long MinDivision, long MaxDivision, long AvgModulus, long MinModulus, long MaxModulus)
Benchmark<T>(string typeName, T[] values, Func<T, T, T> divideFunc, Func<T, T, T> modFunc)
{
int runs = 10;
long[] divisionTimes = new long[runs];
long[] modulusTimes = new long[runs];
for (int i = 0; i < runs; i++)
{
divisionTimes[i] = MeasureTime(() => PerformDivision(values, divideFunc));
modulusTimes[i] = MeasureTime(() => PerformModulus(values, modFunc));
}
var avgDivisionTime = FilterAndAverage(divisionTimes);
var avgModulusTime = FilterAndAverage(modulusTimes);
return (avgDivisionTime, divisionTimes.Min(), divisionTimes.Max(), avgModulusTime, modulusTimes.Min(), modulusTimes.Max());
}
static void PerformOperations<T>(T[] values, Func<T, T, T> divideFunc, Func<T, T, T> modFunc)
{
for (int i = 1; i < values.Length; i++)
{
var resultDiv = divideFunc(values[i], values[i - 1]);
var resultMod = modFunc(values[i], values[i - 1]);
}
}
static void PerformDivision<T>(T[] values, Func<T, T, T> divideFunc)
{
for (int i = 1; i < values.Length; i++)
{
var result = divideFunc(values[i], values[i - 1]);
}
}
static void PerformModulus<T>(T[] values, Func<T, T, T> modFunc)
{
for (int i = 1; i < values.Length; i++)
{
var result = modFunc(values[i], values[i - 1]);
}
}
static long MeasureTime(Action action)
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
action();
stopwatch.Stop();
return stopwatch.ElapsedTicks;
}
static long FilterAndAverage(long[] times)
{
var sortedTimes = times.OrderBy(t => t).ToArray();
var filteredTimes = sortedTimes.Skip(1).Take(sortedTimes.Length - 2);
return (long)filteredTimes.Average();
}
static void PrintResults(string typeName, (long AvgDivision, long MinDivision, long MaxDivision, long AvgModulus, long MinModulus, long MaxModulus) results)
{
Console.WriteLine($"| {typeName} | Division | {results.AvgDivision} | {results.MinDivision} | {results.MaxDivision} |");
Console.WriteLine($"| {typeName} | Modulus | {results.AvgModulus} | {results.MinModulus} | {results.MaxModulus} |");
}
} |
These numbers are hard to interpret.
@dotnet/jit-contrib What would it take to inline the exception throwing for helper-based division and modulus (similar to what we do on arm64 without helpers) so that we can avoid this regression? |
src/libraries/System.Private.CoreLib/src/System/Math.DivModInt.cs
Outdated
Show resolved
Hide resolved
I guess that should be a trivial change, I can share a branch in a bit |
@EgorBo, thanks for looking into it. Maybe we can aim to get rid of |
not sure I understand, do all arm32 targets have div instructions? |
Ah, I think I misunderstood the question at first. is inlining the 0/overflow checks in front of the helper calls supposed to improve @am11's benchmarks? presumably the checks are just moved from one place to another? (well, except the cases where jit can fold them, but this benchmark is not the case?) |
Comment was about this TODO: runtime/src/coreclr/jit/targetarm.h Lines 10 to 12 in d450d9c
v7-A models before Cortex-A15 don't support sdiv/udiv instructions. We can determine it at runtime by looking at the CPUID's ID_ISAR0 register's DIVIDE field, which provides information about hardware division support. But seems like we always set runtime/src/coreclr/jit/codegenarm.cpp Line 1140 in d450d9c
|
What about NativeAOT/R2R ? |
I'd imagine it would be another entry in https://github.com/dotnet/runtime/blob/d450d9c9ee4dd5a98812981dac06d2f92bdb8213/src/native/minipal/cpufeatures.h, aka the usual approach. :) |
Yes, it will save a call frame. The extra call frame introduced by the changes in this PR is where the overhead is. main: managed method -> FCall (with argument checks that require HMF) -> internal C-runtime div helper PR (notice the extra frame): managed method -> managed helper call (argument checks) -> FCall (without argument checks) -> internal C-runtime div helper Inlined checks: managed method (with argument checks) -> FCall (without argument checks) -> internal C-runtime div helper
We do not need to be doing any extra work to improve arm32 perf by detecting div instruction. My primary concern is to avoid regressions on win x86. Win x86 is broadly used, for Windows desktop apps in particular. |
|
Only the FCThrow ones in jithelpers have been moved. The
JIT_GetRefAny
method was left out due to the existing TODO regarding type inheritance.