Reimplement stubs to improve performance #65738

janvorli · 2022-02-22T21:21:10Z

This change implements FixupPrecodeStub, PrecodeStub, CallCountingStub and VSD stubs LookupStub, DispatchStub and ResolveStub using a new mechanism with fixed code and separate RW data. The LoaderHeap was updated to support a new kind of allocation using interleaved code and data pages to support this new mechanism.
The JIT now generates code that uses indirection slot to jump to the methods using FixupPrecode, improving performance of the ASPNet plaintext benchmark by 3-4% depending on the target platform (measured on x64 Windows / Linux and arm64 Linux).

I have also removed the Holders, as the stubs are naturally properly aligned due to the way they are allocated.

There is now only a single variant of each stub, there are no long / short ones anymore as they are not needed - the indirect jumps we use now are not range limited.

Most of the stubs stuff is now target agnostic and the originally split implementation is now in single place for all targets. Only a few constants are defined as target specific in these.

The code for the stubs is no longer generated as bytes by C++ code, but rather written in asm and compiled. These precompiled templates are then used as a source to copy the code from. The x86 is a bit more complex than that due to the fact that it doesn't support PC relative indirect addressing, so we need to relocate all access to the data slots when generating the code pages.

As a further improvement, we could generate just a single page of the code and then just map it many times. This is left for future work.

ARM64 Unix differs from the other targets / platforms - there are various page sizes being used. So the asm templates are generated for 4k..64k page sizes and the variant is then picked at runtime based on the page size extracted from the OS.

This also removes a lot of writeable mappings created for modifications of the stub code when W^X is enabled, in the plaintext benchmark they were reduced by 75%. That results in a significant reducing of the .NET application startup time with W^X enabled.

I think the LoaderHeap would benefit from some refactoring, but I'd prefer leaving it for a follow up. It seems that for the sake of the review, it is better to keep it as is.

The change also implements logging of number of mappings and their exact locations. This helped me to drive the work and I am planning to use it for further changes. It can be removed in the future once we reach a final state.

There are still opportunities for improvement, but these stubs allowed me to scrape off the most significant portion of the mappings.

janvorli · 2022-02-22T21:51:50Z

Performance improvements - plaintext benchmark server side start time and client side read throughput (which is linearly proportional to requests/s). Please take the results with a grain of salt, they vary a lot and these are averages of 7 runs with the lowest outlier removed. But the trend is stable.

Win x64

	Main		This PR
	W^X off	W^X on	W^X off	W^X on
Start time [ms]	324	386	324	357
Read throughput [MB/s]	662.27	640.35	674.22	679.10

Linux Intel x64

	Main		This PR
	W^X off	W^X on	W^X off	W^X on
Start time [ms]	194	251	197	212
Read throughput [MB/s]	532.87	485.32	556.29	549.11

Linux AMD x64

	Main		This PR
	W^X off	W^X on	W^X off	W^X on
Start time [ms]	143	182	143	155
Read throughput [MB/s]	933.55	882.35	960.85	921.60

janvorli · 2022-02-22T21:53:26Z

I've measured similar trends on arm64 Linux in the past, but I need to re-measure the stuff after the recent cleanup and code unification changes.

janvorli · 2022-02-22T21:54:44Z

It seems disabling the mapping loggings has broken the build, I am looking into it.

src/coreclr/vm/arm/thunktemplates.S

src/coreclr/vm/amd64/thunktemplates.asm

src/coreclr/inc/jithelpers.h

src/coreclr/utilcode/executableallocator.cpp

jkotas · 2022-02-23T01:47:17Z

src/coreclr/vm/amd64/thunktemplates.asm

+       jmp    QWORD PTR [DATA_SLOT(LookupStub, ResolveWorkerTarget)]
+LEAF_END_MARKED LookupStubCode, _TEXT
+
+LEAF_ENTRY DispatchStubCode, _TEXT


I am surprised that you were able to get away with the extra indirection in DispatchStubCode. It will be interesting to see the results of the microbenchmark runs.

I am not sure what indirection you mean. This was the original code:

runtime/src/coreclr/vm/amd64/virtualcallstubcpu.hpp

Lines 286 to 289 in a1a653c

BYTE _entryPoint [2]; // 48 B8 mov rax,

size_t _expectedMT; // xx xx xx xx xx xx xx xx 64-bit address

BYTE part1 [3]; // 48 39 XX cmp [THIS_REG], rax

BYTE nopOp; // 90 nop ; 1-byte nop to align _implTarget

runtime/src/coreclr/vm/amd64/virtualcallstubcpu.hpp

Lines 183 to 191 in a1a653c

BYTE part1[2]; // 48 B8 mov rax,

size_t _implTarget; // xx xx xx xx xx xx xx xx 64-bit address

BYTE part2 [1]; // 75 jne

BYTE _failDispl; // xx failLabel

BYTE part3 [2]; // FF E0 jmp rax

// failLabel:

BYTE part4 [2]; // 48 B8 mov rax,

size_t _failTarget; // xx xx xx xx xx xx xx xx 64-bit address

BYTE part5 [2]; // FF E0 jmp rax

I meant the performance overhead from the extra indirections introduced by this refactoring. The original code had two immediates, the new code has two indirections.

@adamsitnik What's the best way to do dotnet/performance benchmark run for a PR like this one to see what improved/regressed on different target platforms?

Ah, got it, I've thought you meant it somehow the other way.

What's the best way to do dotnet/performance benchmark run for a PR like this one to see what improved/regressed on different target platforms?

We have two options:

Run the benchmarks using two different coreruns (before & after) and store them in dedicated folders, then compare them using ResultsComparer tool: https://github.com/dotnet/performance/blob/ef3edbc52d92c6b30ba1c316082f46865e5ff1d6/docs/benchmarking-workflow-dotnet-runtime.md#preventing-regressions It's the best solution if the implementation might change and you might need to re-run the benchmarks using updated corerun.

Run the benchmarks using two different coreruns without storing them in dedicated dolders, use BDN to perform the statistical test on the fly: https://github.com/dotnet/performance/blob/03207f183f042f6fc6b9f341df7a0e36b7175f5d/src/benchmarks/micro/README.md#private-runtime-builds it's good if you have only few benchmarks or don't intend to change the implementation

src/coreclr/vm/loaderallocator.cpp

jkotas · 2022-02-23T01:53:52Z

src/coreclr/vm/jitinterface.cpp

@@ -8804,7 +8804,15 @@ void CEEInfo::getFunctionEntryPoint(CORINFO_METHOD_HANDLE  ftnHnd,
    // Resolve methodImpl.
    ftn = ftn->GetMethodTable()->MapMethodDeclToMethodImpl(ftn);

-    ret = (void *)ftn->TryGetMultiCallableAddrOfCode(accessFlags);
+    if (!ftn->IsFCall() && ftn->MayHavePrecode() && ftn->GetPrecodeType() == PRECODE_FIXUP)


I think this condition is not right for the tiered compilation disabled case (not default, but still used by some first parties). Does this need to call into code version manager?

It is possible (although all the tests pass with tiered compilation disabled except for one where the failure looks unrelated to this). What do you think is the specific problem here?

I think that the tiered compilation disabled case will go through unnecessary indirection when the target method is JITed already.

I've tried to add !ftn->HasStableEntryPoint() to the condition. It severely degraded the plaintext benchmark performance. I've also thought that !ftn->IsPointingToStableNativeCode() might be needed, but the function seems to never report true together with the preexisting condition (based on my testing when I've added an assert and none of the coreclr pri1 tests hit it, with or without tiered compilation enabled).

I think the logic should be something like this:

if (ftn->IsVersionable()) { IAT_PVALUE // must go via indirection to enable versioning } else { if (the target has final code) IAT_VALUE for the final code // this includes FCalls and methods that are already JITed (w/o tiering) else IAT_PVALUE // must go via indirection to pick up the final code once it is ready }

I will give it a try

EgorBo · 2022-02-24T14:09:17Z

@janvorli since you seem to be refactoring a lot of stuff is it possible now to get an understanding where a currently jitted method will end up in memory precisely (or at lest approximately)?
Basically, exposed AllocPtr of a LoaderCodeHeap

to be able to rely on it for "relocs" in jitted code

janvorli · 2022-02-24T14:39:02Z

@EgorBo this change only changes allocation of stubs, it doesn't change in any way where jitted code will end up. But I am not sure if I fully understood what you are after.

This change implements `FixupPrecodeStub`, `PrecodeStub`, `CallCountingStub` and VSD stubs `LookupStub`, `DispatchStub` and `ResolveStub` using a new mechanism with fixed code and separate RW data. The `LoaderHeap` was updated to support a new kind of allocation using interleaved code and data pages to support this new mechanism. The JIT now generates code that uses indirection slot to jump to the methods using `FixupPrecode`, improving performance of the ASPNet plaintext benchmark by 3-4% depending on the target platform (measured on x64 Windows / Linux and arm64 Linux). I have also removed the Holders, as the stubs are naturally properly aligned due to the way they are allocated. There is now only a single variant of each stub, there are no long / short ones anymore as they are not needed - the indirect jumps we use now are not range limited. Most of the stubs stuff is now target agnostic and the originally split implementation is now in single place for all targets. Only a few constants are defined as target specific in these. The code for the stubs is no longer generated as bytes by C++ code, but rather written in asm and compiled. These precompiled templates are then used as a source to copy the code from. The x86 is a bit more complex than that due to the fact that it doesn't support PC relative indirect addressing, so we need to relocate all access to the data slots when generating the code pages. As a further improvement, we could generate just a single page of the code and then just map it many times. This is left for future work. ARM64 Unix differs from the other targets / platforms - there are various page sizes being used. So the asm templates are generated for 4k..64k page sizes and the variant is then picked at runtime based on the page size extracted from the OS. This also removes a lot of writeable mappings created for modifications of the stub code when W^X is enabled, in the plaintext benchmark they were reduced by 75%. That results in a significant reducing of the .NET application startup time with W^X enabled. I think the `LoaderHeap` would benefit from some refactoring, but I'd prefer leaving it for a follow up. It seems that for the sake of the review, it is better to keep it as is. The change also implements logging of number of mappings and their exact locations. This helped me to drive the work and I am planning to use it for further changes. It can be removed in the future once we reach a final state. There are still opportunities for improvement, but these stubs allowed me to scrape off the most significant portion of the mappings.

* Change the CallCountingStub to not to use return address of an unbalanced call as a stub identifying token. The counter address is used instead on all targets. * Fix some tabs instead of spaces * Fix getTargetMethodDesc - in some cases, we get address of the start of the FixupPrecode too. * Remove a leftover comment

The assembler was generating 32 bit conditional relative jumps instead of ones with 8 bit displacement. I've found that a presence of a global label between the jump site and the destination makes the assembler to do that. Changing PATCH_LABEL macro fixed it.

jkotas · 2022-03-03T02:13:58Z

src/coreclr/vm/jitinterface.cpp

+        {
+            Precode* pPrecode = Precode::GetPrecodeFromEntryPoint((PCODE)hlpDynamicFuncTable[dynamicFtnNum].pfnHelper);
+            _ASSERTE(pPrecode->GetType() == PRECODE_FIXUP);
+            *ppIndirection = ((FixupPrecode*)pPrecode)->GetTargetSlot();


What guarantees that we have the final code of the method by this point?

If this is a valid optimization, we should be doing it when filling hlpDynamicFuncTable.

We don't need to have the final code. This just passes the fixup precode indirection slot to the JIT. If we don't have the final code, the slot points to the fixup part of the slot. When we have the JITted code, the slot points to it. Without this optimization, JIT would jump to the beginning of the fixup stub and the indirect jump in there would jump through the slot. So we save one jump by this. I have found this using the performance repo microbenchmarks where the casting benchmarks were consistently slower with my change. With this fix, they are now consistently faster.

We can modify the hlpDynamicFuncTable entry to be able to carry either the indirection or the target address and a flag indicating which one it is. I was going back and forth in my head whether to do that or do it the way I've ended up doing it.

However, there is a _AddrIsJITHelper function on 32 bit Windows in the debugger controller code that compares addresses of all the helpers with an address that the debugger has stepped into to avoid the debugger stopping in unmanaged runtime code. While I believe we should not call this methods that code for prefix stubs (we should get TRACE_STUB tracetype), I need to double check that. I am talking about the case when the helper was not JITted yet and so the indirection would go to the middle of the fixup stub and the code there would not detect it was in a helper.

This just passes the fixup precode indirection slot to the JIT. If we don't have the final code, the slot points to the fixup part of the slot.

Ah ok, I have missed that.

janvorli · 2022-03-16T20:07:56Z

I have removed the VSD stubs change, it was causing steady state performance degradation in more complex ASPNet benchmarks like the Orchard or Fortunes. I am planning to re-add the Lookup stub in a follow up change, but for this PR, it was safer to remove it all.

Extensive benchmarking has shown that the new model for VSD stubs is causing steady state performance degradation for code using a lot of virtual calls. It wasn't showing up in the plaintext benchmark I have used as the driver of the change. This change looses part of the startup perf improvements, but I am planning to add back the Lookup stubs in a follow up change. Those should not be perf critical and they are likely the ones where the startup improvements are gained from.

src/coreclr/vm/common.h

janvorli · 2022-03-17T21:00:12Z

Before merging this change, I would like to share some new details on the performance. It turned out that while the plaintext and JSON benchmarks that I was using as a driver were showing gains all over the place, Orchard and Fortunes that I've tested after this PR was created are showing 2.5-4.5% regression in the read throughput on Citrine Intel Windows and Linux. But on Citrine AMD Linux, the same metric is showing 12-17% gain. The AMD machine has more CPU cores, so it might be one of the reasons.
I wasn't able to pinpoint the source of the regression yet, I'll keep investigating these regressions after the PR is merged.

EgorBo · 2022-03-22T07:15:58Z

I assume this perf improvement is because of this PR 👍 (according to the graph, the commit range is 05cb7f5...110cb9f)

kunalspathak · 2022-03-22T16:38:56Z

windows x64 improvements: dotnet/perf-autofiling-issues#4226, dotnet/perf-autofiling-issues#4225

kunalspathak · 2022-03-22T16:44:42Z

Ubuntu x64 improvements dotnet/perf-autofiling-issues#4219, dotnet/perf-autofiling-issues#4218, dotnet/perf-autofiling-issues#4217, dotnet/perf-autofiling-issues#4214

DrewScoggins · 2022-03-24T16:34:18Z

Windows Arm64 Improvements dotnet/perf-autofiling-issues#4251 dotnet/perf-autofiling-issues#4252

DrewScoggins · 2022-03-24T16:46:47Z

Ubuntu Arm64 improvements dotnet/perf-autofiling-issues#4258 dotnet/perf-autofiling-issues#4260

* Reimplement stubs to improve performance This change implements `FixupPrecodeStub`, `PrecodeStub` and `CallCountingStub` using a new mechanism with fixed code and separate RW data. The `LoaderHeap` was updated to support a new kind of allocation using interleaved code and data pages to support this new mechanism. The JIT now generates code that uses indirection slot to jump to the methods using `FixupPrecode`, improving performance of the ASPNet plaintext benchmark by 3-4% depending on the target platform (measured on x64 Windows / Linux and arm64 Linux). I have also removed the Holders, as the stubs are naturally properly aligned due to the way they are allocated. There is now only a single variant of each stub, there are no long / short ones anymore as they are not needed - the indirect jumps we use now are not range limited. Most of the stubs stuff is now target agnostic and the originally split implementation is now in single place for all targets. Only a few constants are defined as target specific in these. The code for the stubs is no longer generated as bytes by C++ code, but rather written in asm and compiled. These precompiled templates are then used as a source to copy the code from. The x86 is a bit more complex than that due to the fact that it doesn't support PC relative indirect addressing, so we need to relocate all access to the data slots when generating the code pages. As a further improvement, we could generate just a single page of the code and then just map it many times. This is left for future work. ARM64 Unix differs from the other targets / platforms - there are various page sizes being used. So the asm templates are generated for 4k..64k page sizes and the variant is then picked at runtime based on the page size extracted from the OS. This also removes a lot of writeable mappings created for modifications of the stub code when W^X is enabled, in the plaintext benchmark they were reduced by 75%. That results in a significant reducing of the .NET application startup time with W^X enabled. I think the `LoaderHeap` would benefit from some refactoring, but I'd prefer leaving it for a follow up. It seems that for the sake of the review, it is better to keep it as is. The change also implements logging of number of mappings and their exact locations. This helped me to drive the work and I am planning to use it for further changes. It can be removed in the future once we reach a final state. There are still opportunities for improvement, but these stubs allowed me to scrape off the most significant portion of the mappings.

jozkee · 2022-10-14T17:09:10Z

@janvorli we detected a slight x64 regression for the System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches(input: List) benchmark in the 7.0 RC2 vs 6.0 perf report that seems to be related to this commit range c032e0d...731d936. Do you suspect this PR could be to blame?

System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches(input: List)

Result	Base	Diff	Ratio	Alloc Delta	Operating System	Bit	Processor Name
Same	2740.76	2694.02	1.02	+0	ubuntu 18.04	Arm64	Unknown processor
Same	621.78	599.29	1.04	+0	Windows 11	Arm64	Unknown processor
Same	956.34	949.61	1.01	+0	Windows 11	Arm64	Microsoft SQ1 3.0 GHz
Same	1009.64	987.95	1.02	+0	Windows 11	Arm64	Microsoft SQ1 3.0 GHz
Same	561.31	561.70	1.00	+0	macOS Monterey 12.6	Arm64	Apple M1
Same	546.88	545.49	1.00	+0	macOS Monterey 12.6	Arm64	Apple M1 Max
Slower	917.99	1120.40	0.82	+0	Windows 10	X64	Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Slower	732.11	826.22	0.89	+0	Windows 11	X64	AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower	499.89	572.45	0.87	+0	Windows 11	X64	AMD Ryzen 9 5900X
Same	455.45	463.50	0.98	+0	Windows 11	X64	AMD Ryzen 9 7950X
Same	716.52	775.36	0.92	+0	Windows 11	X64	Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower	699.97	829.86	0.84	+0	debian 11	X64	Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Same	496.60	540.29	0.92	+0	ubuntu 18.04	X64	AMD Ryzen 9 5900X
Slower	757.61	911.97	0.83	+0	ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz
Slower	497.21	559.91	0.89	+0	ubuntu 20.04	X64	AMD Ryzen 9 5900X
Same	1078.91	1143.66	0.94	+0	ubuntu 20.04	X64	Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Same	755.07	779.39	0.97	+0	ubuntu 20.04	X64	Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower	1031.91	1159.22	0.89	+0	macOS Big Sur 11.7	X64	Intel Core i5-4278U CPU 2.60GHz (Haswell)
Slower	866.00	981.43	0.88	+0	macOS Monterey 12.6	X64	Intel Core i7-4870HQ CPU 2.50GHz (Haswell)

janvorli · 2022-10-26T12:16:26Z

@jozkee I am sorry for missing your message before. Yes, I believe the regression was caused by that change. There are few corner cases where the new stubs implementation has slight negative effect on micro-benchmarks. Some delegate calls are one of those cases and this test was one of those where I've seen it to have visible impact when I was developing the change.

janvorli added the area-VM-coreclr label Feb 22, 2022

janvorli requested a review from jkotas February 22, 2022 21:21

janvorli self-assigned this Feb 22, 2022

EgorBo reviewed Feb 22, 2022

View reviewed changes

src/coreclr/vm/arm/thunktemplates.S Show resolved Hide resolved

jkotas requested a review from davidwrighton February 22, 2022 22:21

jkotas reviewed Feb 23, 2022

View reviewed changes

This was referenced Feb 23, 2022

System.Runtime.Serialization.Xml.Tests failed #59926

Closed

system.diagnostics.tests.providermetadatatests.getprovidernames_assertproperties #64153

Open

Failing test system.tests.gcextendedtests.getallocatedbytesforcurrentthread #65183

Closed

rzikm mentioned this pull request Feb 24, 2022

Assertion failed in System.Net.NetworkInformation.Functional.Tests #65841

Closed

janvorli force-pushed the new-stubs branch from 813c033 to 6d62803 Compare February 24, 2022 15:10

janvorli added 6 commits February 25, 2022 17:43

Disable executable allocator statistics by default

9114ae8

Fix build with executable allocator logging disabled

9e5211a

Fix Windows ARM/ARM64 and macOS x64 builds

5e0d202

Fix macOS x64

c769d0f

The assembler was generating 32 bit conditional relative jumps instead of ones with 8 bit displacement. I've found that a presence of a global label between the jump site and the destination makes the assembler to do that. Changing PATCH_LABEL macro fixed it.

janvorli force-pushed the new-stubs branch from 44511cf to c769d0f Compare February 25, 2022 16:44

Fix ARM64 StubPrecodeCode_End extraction

4dd122b

janvorli mentioned this pull request Mar 1, 2022

[LoongArch64] coreclr-vm directory #62885

Merged

Reflect feedback and improve some JIT helpers perf

118b427

jkotas reviewed Mar 3, 2022

View reviewed changes

janvorli force-pushed the new-stubs branch from 553095a to 5292970 Compare March 16, 2022 20:05

janvorli force-pushed the new-stubs branch from 5292970 to 7c708cd Compare March 16, 2022 22:45

janvorli force-pushed the new-stubs branch from 7c708cd to 346af35 Compare March 17, 2022 00:26

jkotas reviewed Mar 17, 2022

View reviewed changes

src/coreclr/vm/common.h Outdated Show resolved Hide resolved

jkotas approved these changes Mar 17, 2022

View reviewed changes

janvorli and others added 2 commits March 17, 2022 16:53

Move FillStubCodePage to a better location

bd1c8ba

Merge branch 'main' into new-stubs

33703aa

janvorli merged commit eb8460f into dotnet:main Mar 17, 2022

DrewScoggins mentioned this pull request Mar 24, 2022

Regressions in System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer and Hashset.ContainsTrue<Int32> #67101

Closed

janvorli mentioned this pull request Mar 31, 2022

What's new in .NET 7 Preview 3 [WIP] dotnet/core#7108

Closed

wfurt mentioned this pull request Apr 13, 2022

segfault in libcoreclr.so`CallDescrWorkerWithHandler #66970

Closed

adamsitnik mentioned this pull request Apr 13, 2022

PerfLabTests.LowLevelPerf.StaticDelegate has regressed on Linux x64 #67967

Closed

BruceForstall mentioned this pull request Apr 16, 2022

Assert failure: !CREATE_CHECK_STRING(pMT && pMT->Validate()) #67046

Closed

ta264 mentioned this pull request Apr 22, 2022

#65738 causes SIGSEGV running helloworld on linux-x86 preview 3 #68391

Closed

ghost locked as resolved and limited conversation to collaborators Apr 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplement stubs to improve performance #65738

Reimplement stubs to improve performance #65738

janvorli commented Feb 22, 2022

janvorli commented Feb 22, 2022

janvorli commented Feb 22, 2022

janvorli commented Feb 22, 2022

jkotas Feb 23, 2022

janvorli Feb 23, 2022

jkotas Feb 23, 2022

janvorli Feb 23, 2022

adamsitnik Feb 23, 2022

jkotas Feb 23, 2022

janvorli Feb 23, 2022

jkotas Feb 23, 2022

janvorli Feb 24, 2022

jkotas Feb 24, 2022 •

edited

Loading

janvorli Feb 25, 2022

EgorBo commented Feb 24, 2022

janvorli commented Feb 24, 2022

jkotas Mar 3, 2022

janvorli Mar 3, 2022

janvorli Mar 3, 2022

jkotas Mar 3, 2022

janvorli commented Mar 16, 2022

janvorli commented Mar 17, 2022

EgorBo commented Mar 22, 2022

kunalspathak commented Mar 22, 2022 •

edited

Loading

kunalspathak commented Mar 22, 2022 •

edited

Loading

DrewScoggins commented Mar 24, 2022 •

edited

Loading

DrewScoggins commented Mar 24, 2022 •

edited

Loading

jozkee commented Oct 14, 2022

System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches(input: List)

janvorli commented Oct 26, 2022

	BYTE _entryPoint [2]; // 48 B8 mov rax,
	size_t _expectedMT; // xx xx xx xx xx xx xx xx 64-bit address
	BYTE part1 [3]; // 48 39 XX cmp [THIS_REG], rax
	BYTE nopOp; // 90 nop ; 1-byte nop to align _implTarget

	BYTE part1[2]; // 48 B8 mov rax,
	size_t _implTarget; // xx xx xx xx xx xx xx xx 64-bit address
	BYTE part2 [1]; // 75 jne
	BYTE _failDispl; // xx failLabel
	BYTE part3 [2]; // FF E0 jmp rax
	// failLabel:
	BYTE part4 [2]; // 48 B8 mov rax,
	size_t _failTarget; // xx xx xx xx xx xx xx xx 64-bit address
	BYTE part5 [2]; // FF E0 jmp rax

Reimplement stubs to improve performance #65738

Reimplement stubs to improve performance #65738

Conversation

janvorli commented Feb 22, 2022

janvorli commented Feb 22, 2022

Win x64

Linux Intel x64

Linux AMD x64

janvorli commented Feb 22, 2022

janvorli commented Feb 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas Feb 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EgorBo commented Feb 24, 2022

janvorli commented Feb 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janvorli commented Mar 16, 2022

janvorli commented Mar 17, 2022

EgorBo commented Mar 22, 2022

kunalspathak commented Mar 22, 2022 • edited Loading

kunalspathak commented Mar 22, 2022 • edited Loading

DrewScoggins commented Mar 24, 2022 • edited Loading

DrewScoggins commented Mar 24, 2022 • edited Loading

jozkee commented Oct 14, 2022

System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches(input: List)

janvorli commented Oct 26, 2022

jkotas Feb 24, 2022 •

edited

Loading

kunalspathak commented Mar 22, 2022 •

edited

Loading

kunalspathak commented Mar 22, 2022 •

edited

Loading

DrewScoggins commented Mar 24, 2022 •

edited

Loading

DrewScoggins commented Mar 24, 2022 •

edited

Loading