-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement stubs to improve performance #65738
Conversation
Performance improvements - plaintext benchmark server side start time and client side read throughput (which is linearly proportional to requests/s). Please take the results with a grain of salt, they vary a lot and these are averages of 7 runs with the lowest outlier removed. But the trend is stable. Win x64
Linux Intel x64
Linux AMD x64
|
I've measured similar trends on arm64 Linux in the past, but I need to re-measure the stuff after the recent cleanup and code unification changes. |
It seems disabling the mapping loggings has broken the build, I am looking into it. |
jmp QWORD PTR [DATA_SLOT(LookupStub, ResolveWorkerTarget)] | ||
LEAF_END_MARKED LookupStubCode, _TEXT | ||
|
||
LEAF_ENTRY DispatchStubCode, _TEXT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised that you were able to get away with the extra indirection in DispatchStubCode. It will be interesting to see the results of the microbenchmark runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what indirection you mean. This was the original code:
runtime/src/coreclr/vm/amd64/virtualcallstubcpu.hpp
Lines 286 to 289 in a1a653c
BYTE _entryPoint [2]; // 48 B8 mov rax, | |
size_t _expectedMT; // xx xx xx xx xx xx xx xx 64-bit address | |
BYTE part1 [3]; // 48 39 XX cmp [THIS_REG], rax | |
BYTE nopOp; // 90 nop ; 1-byte nop to align _implTarget |
runtime/src/coreclr/vm/amd64/virtualcallstubcpu.hpp
Lines 183 to 191 in a1a653c
BYTE part1[2]; // 48 B8 mov rax, | |
size_t _implTarget; // xx xx xx xx xx xx xx xx 64-bit address | |
BYTE part2 [1]; // 75 jne | |
BYTE _failDispl; // xx failLabel | |
BYTE part3 [2]; // FF E0 jmp rax | |
// failLabel: | |
BYTE part4 [2]; // 48 B8 mov rax, | |
size_t _failTarget; // xx xx xx xx xx xx xx xx 64-bit address | |
BYTE part5 [2]; // FF E0 jmp rax |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the performance overhead from the extra indirections introduced by this refactoring. The original code had two immediates, the new code has two indirections.
@adamsitnik What's the best way to do dotnet/performance benchmark run for a PR like this one to see what improved/regressed on different target platforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it, I've thought you meant it somehow the other way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the best way to do dotnet/performance benchmark run for a PR like this one to see what improved/regressed on different target platforms?
We have two options:
- Run the benchmarks using two different coreruns (before & after) and store them in dedicated folders, then compare them using ResultsComparer tool: https://github.com/dotnet/performance/blob/ef3edbc52d92c6b30ba1c316082f46865e5ff1d6/docs/benchmarking-workflow-dotnet-runtime.md#preventing-regressions It's the best solution if the implementation might change and you might need to re-run the benchmarks using updated corerun.
- Run the benchmarks using two different coreruns without storing them in dedicated dolders, use BDN to perform the statistical test on the fly: https://github.com/dotnet/performance/blob/03207f183f042f6fc6b9f341df7a0e36b7175f5d/src/benchmarks/micro/README.md#private-runtime-builds it's good if you have only few benchmarks or don't intend to change the implementation
src/coreclr/vm/jitinterface.cpp
Outdated
@@ -8804,7 +8804,15 @@ void CEEInfo::getFunctionEntryPoint(CORINFO_METHOD_HANDLE ftnHnd, | |||
// Resolve methodImpl. | |||
ftn = ftn->GetMethodTable()->MapMethodDeclToMethodImpl(ftn); | |||
|
|||
ret = (void *)ftn->TryGetMultiCallableAddrOfCode(accessFlags); | |||
if (!ftn->IsFCall() && ftn->MayHavePrecode() && ftn->GetPrecodeType() == PRECODE_FIXUP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this condition is not right for the tiered compilation disabled case (not default, but still used by some first parties). Does this need to call into code version manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible (although all the tests pass with tiered compilation disabled except for one where the failure looks unrelated to this). What do you think is the specific problem here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the tiered compilation disabled case will go through unnecessary indirection when the target method is JITed already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to add !ftn->HasStableEntryPoint()
to the condition. It severely degraded the plaintext benchmark performance. I've also thought that !ftn->IsPointingToStableNativeCode()
might be needed, but the function seems to never report true together with the preexisting condition (based on my testing when I've added an assert and none of the coreclr pri1 tests hit it, with or without tiered compilation enabled).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the logic should be something like this:
if (ftn->IsVersionable())
{
IAT_PVALUE // must go via indirection to enable versioning
}
else
{
if (the target has final code)
IAT_VALUE for the final code // this includes FCalls and methods that are already JITed (w/o tiering)
else
IAT_PVALUE // must go via indirection to pick up the final code once it is ready
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will give it a try
@janvorli since you seem to be refactoring a lot of stuff is it possible now to get an understanding where a currently jitted method will end up in memory precisely (or at lest approximately)? to be able to rely on it for "relocs" in jitted code |
@EgorBo this change only changes allocation of stubs, it doesn't change in any way where jitted code will end up. But I am not sure if I fully understood what you are after. |
This change implements `FixupPrecodeStub`, `PrecodeStub`, `CallCountingStub` and VSD stubs `LookupStub`, `DispatchStub` and `ResolveStub` using a new mechanism with fixed code and separate RW data. The `LoaderHeap` was updated to support a new kind of allocation using interleaved code and data pages to support this new mechanism. The JIT now generates code that uses indirection slot to jump to the methods using `FixupPrecode`, improving performance of the ASPNet plaintext benchmark by 3-4% depending on the target platform (measured on x64 Windows / Linux and arm64 Linux). I have also removed the Holders, as the stubs are naturally properly aligned due to the way they are allocated. There is now only a single variant of each stub, there are no long / short ones anymore as they are not needed - the indirect jumps we use now are not range limited. Most of the stubs stuff is now target agnostic and the originally split implementation is now in single place for all targets. Only a few constants are defined as target specific in these. The code for the stubs is no longer generated as bytes by C++ code, but rather written in asm and compiled. These precompiled templates are then used as a source to copy the code from. The x86 is a bit more complex than that due to the fact that it doesn't support PC relative indirect addressing, so we need to relocate all access to the data slots when generating the code pages. As a further improvement, we could generate just a single page of the code and then just map it many times. This is left for future work. ARM64 Unix differs from the other targets / platforms - there are various page sizes being used. So the asm templates are generated for 4k..64k page sizes and the variant is then picked at runtime based on the page size extracted from the OS. This also removes a lot of writeable mappings created for modifications of the stub code when W^X is enabled, in the plaintext benchmark they were reduced by 75%. That results in a significant reducing of the .NET application startup time with W^X enabled. I think the `LoaderHeap` would benefit from some refactoring, but I'd prefer leaving it for a follow up. It seems that for the sake of the review, it is better to keep it as is. The change also implements logging of number of mappings and their exact locations. This helped me to drive the work and I am planning to use it for further changes. It can be removed in the future once we reach a final state. There are still opportunities for improvement, but these stubs allowed me to scrape off the most significant portion of the mappings.
* Change the CallCountingStub to not to use return address of an unbalanced call as a stub identifying token. The counter address is used instead on all targets. * Fix some tabs instead of spaces * Fix getTargetMethodDesc - in some cases, we get address of the start of the FixupPrecode too. * Remove a leftover comment
The assembler was generating 32 bit conditional relative jumps instead of ones with 8 bit displacement. I've found that a presence of a global label between the jump site and the destination makes the assembler to do that. Changing PATCH_LABEL macro fixed it.
{ | ||
Precode* pPrecode = Precode::GetPrecodeFromEntryPoint((PCODE)hlpDynamicFuncTable[dynamicFtnNum].pfnHelper); | ||
_ASSERTE(pPrecode->GetType() == PRECODE_FIXUP); | ||
*ppIndirection = ((FixupPrecode*)pPrecode)->GetTargetSlot(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What guarantees that we have the final code of the method by this point?
If this is a valid optimization, we should be doing it when filling hlpDynamicFuncTable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to have the final code. This just passes the fixup precode indirection slot to the JIT. If we don't have the final code, the slot points to the fixup part of the slot. When we have the JITted code, the slot points to it. Without this optimization, JIT would jump to the beginning of the fixup stub and the indirect jump in there would jump through the slot. So we save one jump by this. I have found this using the performance repo microbenchmarks where the casting benchmarks were consistently slower with my change. With this fix, they are now consistently faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can modify the hlpDynamicFuncTable
entry to be able to carry either the indirection or the target address and a flag indicating which one it is. I was going back and forth in my head whether to do that or do it the way I've ended up doing it.
However, there is a _AddrIsJITHelper
function on 32 bit Windows in the debugger controller code that compares addresses of all the helpers with an address that the debugger has stepped into to avoid the debugger stopping in unmanaged runtime code. While I believe we should not call this methods that code for prefix stubs (we should get TRACE_STUB tracetype), I need to double check that. I am talking about the case when the helper was not JITted yet and so the indirection would go to the middle of the fixup stub and the code there would not detect it was in a helper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just passes the fixup precode indirection slot to the JIT. If we don't have the final code, the slot points to the fixup part of the slot.
Ah ok, I have missed that.
I have removed the VSD stubs change, it was causing steady state performance degradation in more complex ASPNet benchmarks like the Orchard or Fortunes. I am planning to re-add the Lookup stub in a follow up change, but for this PR, it was safer to remove it all. |
Extensive benchmarking has shown that the new model for VSD stubs is causing steady state performance degradation for code using a lot of virtual calls. It wasn't showing up in the plaintext benchmark I have used as the driver of the change. This change looses part of the startup perf improvements, but I am planning to add back the Lookup stubs in a follow up change. Those should not be perf critical and they are likely the ones where the startup improvements are gained from.
Before merging this change, I would like to share some new details on the performance. It turned out that while the plaintext and JSON benchmarks that I was using as a driver were showing gains all over the place, Orchard and Fortunes that I've tested after this PR was created are showing 2.5-4.5% regression in the read throughput on Citrine Intel Windows and Linux. But on Citrine AMD Linux, the same metric is showing 12-17% gain. The AMD machine has more CPU cores, so it might be one of the reasons. |
I assume this perf improvement is because of this PR 👍 (according to the graph, the commit range is 05cb7f5...110cb9f) |
windows x64 improvements: dotnet/perf-autofiling-issues#4226, dotnet/perf-autofiling-issues#4225 |
Windows Arm64 Improvements dotnet/perf-autofiling-issues#4251 dotnet/perf-autofiling-issues#4252 |
Ubuntu Arm64 improvements dotnet/perf-autofiling-issues#4258 dotnet/perf-autofiling-issues#4260 |
* Reimplement stubs to improve performance This change implements `FixupPrecodeStub`, `PrecodeStub` and `CallCountingStub` using a new mechanism with fixed code and separate RW data. The `LoaderHeap` was updated to support a new kind of allocation using interleaved code and data pages to support this new mechanism. The JIT now generates code that uses indirection slot to jump to the methods using `FixupPrecode`, improving performance of the ASPNet plaintext benchmark by 3-4% depending on the target platform (measured on x64 Windows / Linux and arm64 Linux). I have also removed the Holders, as the stubs are naturally properly aligned due to the way they are allocated. There is now only a single variant of each stub, there are no long / short ones anymore as they are not needed - the indirect jumps we use now are not range limited. Most of the stubs stuff is now target agnostic and the originally split implementation is now in single place for all targets. Only a few constants are defined as target specific in these. The code for the stubs is no longer generated as bytes by C++ code, but rather written in asm and compiled. These precompiled templates are then used as a source to copy the code from. The x86 is a bit more complex than that due to the fact that it doesn't support PC relative indirect addressing, so we need to relocate all access to the data slots when generating the code pages. As a further improvement, we could generate just a single page of the code and then just map it many times. This is left for future work. ARM64 Unix differs from the other targets / platforms - there are various page sizes being used. So the asm templates are generated for 4k..64k page sizes and the variant is then picked at runtime based on the page size extracted from the OS. This also removes a lot of writeable mappings created for modifications of the stub code when W^X is enabled, in the plaintext benchmark they were reduced by 75%. That results in a significant reducing of the .NET application startup time with W^X enabled. I think the `LoaderHeap` would benefit from some refactoring, but I'd prefer leaving it for a follow up. It seems that for the sake of the review, it is better to keep it as is. The change also implements logging of number of mappings and their exact locations. This helped me to drive the work and I am planning to use it for further changes. It can be removed in the future once we reach a final state. There are still opportunities for improvement, but these stubs allowed me to scrape off the most significant portion of the mappings.
@janvorli we detected a slight x64 regression for the System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches(input: List)
|
@jozkee I am sorry for missing your message before. Yes, I believe the regression was caused by that change. There are few corner cases where the new stubs implementation has slight negative effect on micro-benchmarks. Some delegate calls are one of those cases and this test was one of those where I've seen it to have visible impact when I was developing the change. |
This change implements
FixupPrecodeStub
,PrecodeStub
,CallCountingStub
and VSD stubsLookupStub
,DispatchStub
andResolveStub
using a new mechanism with fixed code and separate RW data. TheLoaderHeap
was updated to support a new kind of allocation using interleaved code and data pages to support this new mechanism.The JIT now generates code that uses indirection slot to jump to the methods using
FixupPrecode
, improving performance of the ASPNet plaintext benchmark by 3-4% depending on the target platform (measured on x64 Windows / Linux and arm64 Linux).I have also removed the Holders, as the stubs are naturally properly aligned due to the way they are allocated.
There is now only a single variant of each stub, there are no long / short ones anymore as they are not needed - the indirect jumps we use now are not range limited.
Most of the stubs stuff is now target agnostic and the originally split implementation is now in single place for all targets. Only a few constants are defined as target specific in these.
The code for the stubs is no longer generated as bytes by C++ code, but rather written in asm and compiled. These precompiled templates are then used as a source to copy the code from. The x86 is a bit more complex than that due to the fact that it doesn't support PC relative indirect addressing, so we need to relocate all access to the data slots when generating the code pages.
As a further improvement, we could generate just a single page of the code and then just map it many times. This is left for future work.
ARM64 Unix differs from the other targets / platforms - there are various page sizes being used. So the asm templates are generated for 4k..64k page sizes and the variant is then picked at runtime based on the page size extracted from the OS.
This also removes a lot of writeable mappings created for modifications of the stub code when W^X is enabled, in the plaintext benchmark they were reduced by 75%. That results in a significant reducing of the .NET application startup time with W^X enabled.
I think the
LoaderHeap
would benefit from some refactoring, but I'd prefer leaving it for a follow up. It seems that for the sake of the review, it is better to keep it as is.The change also implements logging of number of mappings and their exact locations. This helped me to drive the work and I am planning to use it for further changes. It can be removed in the future once we reach a final state.
There are still opportunities for improvement, but these stubs allowed me to scrape off the most significant portion of the mappings.